Working at a service provider for hundreds of VoIP customers with thousands of phones means you’re going to encounter problems that make you scratch your head. Every now and then one of those problems gets escalated to me and I’m left figuring out why something is happening. Sometimes that means running packet captures on all sorts of equipment, installing in-line network taps to grab traffic off the wire, or trudging through debug logs. Every now and then I get lucky, figure out the problem before we get started and look like a wizard. More often though, I’m left just as confused as everyone else.
This problem has come to me in a number of iterations, but it’s always intermittent, unexplained phone reboots that appear to have no rhyme or reason. Mid-call or sitting untouched, whatever the usage scenario, the customer was complaining that their phones were rebooting, affecting multiple Aastra models. The last time this was happening with noticeable frequency was several months back, and we thought we nabbed the problem when a provisioning error related to an access control list (ACL) on a switch was fixed. Phones stopped rebooting, the ticket was closed and the customer was happy.
Fast forward to the other day when the problem reared its ugly head again after multiple phones in their office rebooted unexpectedly. The ACL was still in place, so this was something new. What could it be this time?
Digging in
When I was working on this issue last time, I pushed out a temporary config to their phones that enabled remote syslogging of debug information. The idea was that the phones would spit out some sort of useful log line prior to their reboot, like memory being exhausted or an error code that would lead us to understand the problem. Around the same time as that, I discovered the ACL issue, so those logs were never needed. I decided to leave the syslogging in place as a precaution though, just in case unexplained reboots occurred in the near future. Now that the problem was happening again, I finally had information I hoped would lead me to an answer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
log server ip: 10.0.0.2 log server port: 514 log module linemgr: 7 log module user interface: 7 log module misc: 7 log module sip: 7 log module dis: 7 log module dstore: 7 log module ept: 7 log module ind: 7 log module kbd: 7 log module net: 7 log module provis: 7 log module rtpt: 7 log module snd: 7 log module prof: 7 log module xml: 7 log module stun: 7 |
Determining any weird problem is never that simple however, and the phones didn’t log any information prior to their reboot, giving me nothing to go on. The Aastra phones offer very detailed log levels, ranging from just a trickle of debug information all the way up to a full on tsunami of completely indecipherable garbage. Apparently whatever was happening would not be exposed at the current level, but it did show us that more phones than expected had rebooted, but only once.
1 2 3 4 5 6 7 8 9 |
Jan 23 10:04:38 10.0.0.22 00:00:21.720000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-2C-A4:CD Jan 24 11:57:27 10.0.0.17 00:00:24.910000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-22-AC:CD Jan 24 12:19:33 10.0.0.23 00:00:22.940000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-22-A4:CD Jan 24 12:30:28 10.0.0.25 00:00:21.800000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-2A-AF:CD Jan 24 12:30:55 10.0.0.28 00:00:17.030000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-21-A1:CD Jan 24 13:31:52 10.0.0.10 00:00:18.910000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-22-A3:CD Jan 24 13:32:04 10.0.0.11 00:00:21.820000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-2A-A7:CD Jan 24 14:07:20 10.0.0.12 00:00:16.910000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-22-A6:CD Jan 24 14:59:04 10.0.0.19 00:00:17.900000 SetParm: (CC) ERROR: Cannot get vendor options mac:00-07-9A-22-AD:CD |
What now?
It turned out that our network guys had performed maintenance on their switch the night before, and while correlation is not causation, it was at least a good place to start investigating next. I caught a lucky break as looking at the switch logs gave us our next clue, DHCP address conflicts.
1 2 3 4 5 6 |
Jan 24 10:20:54.133: %DHCPD-4-DECLINE_CONFLICT: DHCP address conflict: client 0007.9a22.f5cd declined 10.0.0.6. Jan 24 10:21:11.133: %DHCPD-4-DECLINE_CONFLICT: DHCP address conflict: client 0007.9a22.f5cd declined 10.0.0.7. Jan 24 10:21:29.133: %DHCPD-4-DECLINE_CONFLICT: DHCP address conflict: client 0007.9a22.f5cd declined 10.0.0.8. Jan 24 10:21:43.134: %DHCPD-4-DECLINE_CONFLICT: DHCP address conflict: client 0007.9a22.f5cd declined 10.0.0.9. Jan 24 10:21:58.134: %DHCPD-4-DECLINE_CONFLICT: DHCP address conflict: client 0007.9a22.f5cd declined 10.0.0.10. Jan 24 10:22:15.142: %DHCPD-4-DECLINE_CONFLICT: DHCP address conflict: client 0007.9a22.f5cd declined 10.0.0.11. |
Normally when a DHCP client’s lease expires it just renews the lease for its current IP, and everything hums along happily. In this case though, you can see a client declining address after address. We can only assume from this that the client was denied renewing its current lease, and was unhappy with the IPs offered due to conflicts. Why would the switch’s DHCP agent do this though? They maintain a database of current bindings to prevent giving out IPs that are already leased. As it turns out though, due to incomplete configuration of DHCP on the switch, the reload after maintenance wiped the binding database clean, and it was clueless about existing leases. When a client’s lease timer expired and it sought a renewal, it was being given a new IP. Could this change cause Aastra phones to reboot?
Confirming the problem
The case was easy enough to test in our lab which already had one of the same model phones, an Aastra 57i, addressed via DHCP. The test plan was to decrease the lease time to the shortest interval (one minute), then deny leasing its IP, and finally remove the existing binding to cause a new lease to be generated.
1 2 3 4 5 6 7 8 9 10 |
# sh ip dhcp binding ... 10.10.0.101 0007.9a12.abcd Jan 25 2013 11:54 AM Automatic Active Vlan1234 ... # conf t (config)# ip dhcp pool voip-lab (dhcp-config)# lease 0 0 1 (dhcp-config)# exit (config)# ip dhcp excluded-address 10.10.0.1 10.10.0.101 # clear ip dhcp binding 10.10.0.101 |
After rebooting the test phone to make it aware of the shortened lease time, within a minute the switch said that it could no longer lease the IP it wanted and voila! The phone’s screen lit up, posted a quick message that its IP had changed, and it promptly rebooted! The problem was confirmed, the mystery is solved, but the result is awful. No poltergeists here, just an annoying network implementation. The customer also has Aastra 6731i phones, but one was not available in the lab to test against.
Solving it for good
In the end, the problem can be solved a couple different ways. The better long term solution is to centralize DHCP to dedicated DHCP servers, rather than relying on Cisco’s DHCP agents, and that is one of the long term goals for our VoIP network. The other method of attacking the problem is configuring a DHCP database agent, allowing the switch to store its DHCP binding database on a remote FTP or TFTP server, or even locally on flash.
1 2 3 4 5 6 7 8 9 10 11 12 |
# conf t (config)# ip dhcp database flash:/dhcpbinding.db (config)# exit # sh ip dhcp database URL : flash:/dhcpbinding.db Read : Never Written : Jan 25 2013 09:10 AM Status : Database has changed. A file transfer to the agent is pending. Delay : 300 seconds Timeout : 300 seconds Failures : 0 Successes: 1 |
This is the simplest way for us to ensure the binding database is maintained between reloads in the short term, meaning we can continue with switch maintenance without this kind of disruption to our clients. With any luck, maybe this will be the last I hear of rebooting phones!
Thanks for sharing. will try out your suggestion and let you know if my similar issue is resolved. Sounds pretty promising from your comments. God bless!