Here we see Cisco's flowchart for troubleshooting IP Telephony. We start by getting a clear definition of the problem and then gathering facts. And by the way, we will talk about these different steps in more detail later on in this lesson, but for now just at a high level we've clearly define the problem, we understand who is impacted, how widespread the impact is, exactly what is the symptom, does it occur all the time, is it part of the time.
We gather facts, maybe from interviewing users, maybe from doing show and debug commands on our routers, maybe by looking at trace files from Communications Manager. And based on our experience and based on documentation we start to consider possibilities. And we pick from that list of possibilities the most likely suspect, and we create an action plan around that. We implement the action plan and see if it works or not, we observe the results. If it does, great, we have solved the problem and we need to document what we did for future reference. If it doesn't solve the problem we might need to go back and create another action plan. Maybe we need to gather additional facts to create a more effective action plan. But notice, there is a feedback loop – if we don't succeed the first time we try, try again, is the motto in this diagram.
Sample Network Problem: Define the Problem
To help illustrate Cisco's troubleshooting model let's consider a sample trouble ticket. Here we're being told that a user in cluster A is trying to initiate a call with a user in cluster B and interestingly the phone in cluster B it does ring, but as soon as that user over in cluster B, at phone B, as soon as they go off hook, the user hears a fast busy tone. And based on our troubleshooting model we want to begin by clearly defining the problem.
Let's contrast the observed behavior with the behavior that we would expect. We would expect phone B to ring and for the user to go off hook and for us to be able to talk with him, but instead the signaling got through, notice phone B did ring we've got connectivity, the signaling got through, but the audio path fails. That could be our problem definition. The audio stream fails even though the signaling, which appears to be SIP in this diagram, the signaling did get through.
Once we've clearly identified the problem then we can start gathering facts, such as determining when the problem first occurred. Maybe there was a change made about the time that the problem was first noticed. Is it a persistent problem? Does it occur just during the busy times of the day? Does it happen all the time? Does it not happen on weekends? And we might interview the user to understand what they're experiencing. Does the caller immediately hear fast busy or do they hear some other tone or do they hear a message? Is anybody else besides this user having a problem? How widespread is the problem?
In our example as part of gathering the facts let's say that we interview the user and from that interview we learn that the problem doesn't always happen, it happens intermittently. It started about two months ago but now it's occurring more and more frequently. The user says they haven't changed anything on the telephone and they mention that the problem seems to be happening usually during the busiest times of the day. The user says that they can call phones within their site and that works fine, and if we try to call somebody else at site B other than the phone B that we were trying to in this case – do those calls work? And the answer is sometimes. So it seems like we have an intermittent problem, which becomes more problematic during the busier times of the day, it's been happening for about two months and it seems to be getting worse and worse.
As a next step, let's consider what might be going on. We've defined the problem, we've gathered the facts, and based on our experience, and based on literally our intuition in many cases, based on what we've seen before, maybe based on consultation with other network engineers, maybe based on documentation, we start to consider a list of possibilities.
In our example, we might have a list of possibilities like these – an incorrect regions definition. Well we see that we have two separate clusters that are interconnected via a SIP trunk. Well that trunk could belong to one region, and the cluster at site A could belong to a region, and when we're calling between regions that influences what codec is used. So maybe the region only allows us to set up to G.729 while the phone is trying to use G.711 maybe. Well, I don't think that one is going to be true, because it works some of the time. It seems like if we had an incorrect regions definition it would always fail.
What about a codec mismatch? Again, if that's the issue, it probably happens all the time, it's not intermittent.
No transcoding resources. Well, something that we didn't mention, but that we learned during the phase where we were gathering facts, we learned that the phones at site B used G.723, first gen Cisco IP phones did that. And phones at Site B used G.729 when calling over the WAN. Can you talk between the phone that speaks G.723 and G.729? You can, but you need to do transcoding. In fact you almost have to do transcoding twice, because transcoding is converting between a high bandwidth codec and a low bandwidth codec, like going from G.711 to G.729. So here what's happening, it's almost as if we're going from G.729 to G.711 and then G.711 to G.723 and vice versa. So we do need transcoder resources. I think that's a good possibility, we might be running low on transcoder resources during busier times of the day. That's the reason that it works some of the time but not all time. Okay we'll keep that on the list as a likely suspect.
An access list on a router that might be allowing signaling, but denying RTP. Well that might explain it, as well as an RTP header compression mismatch, but again I've got the same issue with both of these. If these were the underlying issues it should be happening all the time not just part of the time - I suppose unless we had a time of day based access list. But I think from this list the most likely cause is we're out of transcoder resources during the busier times of the day.
Create Action Plan
Once we've identified our most likely suspect, we want to create an action plan, and if time permits it's often a good idea to document your action plan because you're able to back out. If you're going to give a bunch of commands on a router and it doesn't work, wouldn't it be great to have those documented so you could back out of that config that didn't help and maybe made matters worse. So it's a good idea if time permits, you have to weigh that against the urgency of the issue, but if time permit's, if you can document your action plan that will be great. And you might want to collaborate with others – use some collective intelligence in developing this action plan.
In our example, let's say that our action plan is to add transcoding resources such that calls between these two clusters can continue to use G.723 on one side and G.729 on the other and there's going to be enough transcoding resources. Another option might be to just limit the maximum number of calls. We could set up CAC, Call Admission Control, to make sure that we don't place too many calls, that being the number of calls that would exhaust our transcoding resources. In fact before we do this, what we might want to do and we could do this as part of gathering the facts, using RTMT we could determine if we're running out of transcoder resources. There's actually a counter, an out of transcoding resources counter, that can tell us if we ever did run out of transcending resources. And if we did, we might want to go get additional transcoding resources.
Implement Action Plan
Next, we implement the action plan and we watch what happens. We make sure that we're able to back out of this action plan if things don't go as predicted and part of the action plan might involve temporarily removing an access list so that we could modify it. Be aware that we might be opening up a security hole when you do this. Also be aware that when you implement your action plan you might want to do it, again depending on the urgency of the issue, you might want to implement the action plan outside of normal office hours to minimize the impact on your users.
After the implementation of the action plan we observe the results. Did it fix the problem or not? And if it did, if the problem appears to stop, we need to make sure that we didn't inject a new problem when we implemented the action plan. Assuming that we didn't, and assuming that the original symptom has stopped then we can document our results.
Restart the Problem-Solving Process
If we implement our action plan and it doesn't resolve the issue, what we need to do is go back and possibly create an additional action plan. This might involve gathering additional information but there's a feedback loop involved. We go back and we create a different action plan, we implement that action plan and we see if that resolved the issue.
As we mentioned earlier after we resolve the issue and we haven't injected new issues into the network, we need to document what we did. This is sometimes called a post-mortem report. We want to document what was the issue, what was the underlying cause, what was the fix, who had to be involved in the resolution. This can be useful to us and to other network administrators and engineers in the future when a similar problem might arise.