Sunday, June 17, 2018

         

                               Calls made after long idle hours are failing 


Recently, we had customer reporting about a call failure issues. Since the scenario and the root cause of the issue was different I would like to share that experience here.

Issue Description

Every morning, whenever users  try to place calls from their desk phones to the PSTN,  the call rings. However, the call gets disconnected as soon as  the call gets accepted/answered.

Technical description: In this scenario, for  a SIP based VoIP Call flow, the SIP Signaling works and the phone rings fine. However, when the callee/called person answers the call, it gets disconnected.
 Hence in this issue, the SIP signaling works fine but  the Media path always fails.


Lync Platform: Lync 2013 MT platform.

Desk Phone - Yealink

Recent change: Firmware update. At the local site, the firmware was upgraded on the Yealink phones. But after the firmware update all the test cases were  successful.

Phone Firmware versions

Firmware version without the issue:   66.9.0.25

New firmware version (that caused the issue) :   66.9.0.42


Workaround (when you are using the firmware 66.9.0.42)


  • Reboot the phone after several hours of idle time. Then the issues does not occur. (OR)
  • Downgrade to another version in our case it was 66.9.0.25.


Recent Changes: 

Scenario: 

The customer was migrated from existing PBX to Lync 2013 MT several months ago. The users are using Yealinks Desk Phone to make calls. The phones were working fine for the last few weeks. However, users reported that they are always not able to make calls in the morning.  Once they reboot the phone then the calls are working fine throughout the day. However, the next morning again we have the same issue and it gets resolved once the phones are rebooted.


Troubleshooting:

  • Confirmed that the port 3478 for the STUN (UDP) was allowed in the Firewall.
  • Confirmed that Lync/SfB server was listening on the port 3478 for new sessions and there were no server related issues.
  • There were no connectivity issues between the phones and the Lync/ SfB (Skype for Business) servers.

Issue: After troubleshooting the issue with the customer, it was obvious that the issue occurred only on the phones which had the latest firmware 66.9.0.42.


Network trace collection and Analysis:

In order to collect the Wireshark trace,  I connected to the Yealink phone using its IP Address and collected the Wireshark trace for the not-working and the working scenario.

The following are the snapshots of the network trace collected while the calls were failing (after several hours of the idle time). First let us see network trace from the phone on a morning when the calls are failing. From the  snapshots (of the Wireshark trace of the failure scenario) - we could see that there were several STUN binding requests but no successful response. Moreover there were  several strange errors for STUN binding requests like.


1Allocate Error Response error-code: 401 (unauthorized) the request did not contain a Message-Integrity attribute”.   

 And sometimes the STUN binding requests failed with the other STUN binding errors like,

2. “Allocate Error Response Code : 436” – The username supplied in the request is not known.

So, it is evident that, the issue was due to some STUN Binding requests and the lack of successful STUN responses. 

While searching in the internet based on STUN error message ( that we got from the Wireshark trace),  understood that the issue was due to the STUN response or the ICE keep alive related issues.







So, tried to collected the Wire Shark trace for a working scenario. Hence, rebooted the phone and then collected the Wire Shark trace (when the calls were working fine after the phone reboot).

When the phones were rebooted, found that the phones were connected to the same Lync server but the phones started working (after the reboot). Hence, this implies that there were no problems with the Phones and the Lync server. This is because after the rebooting the phone, it was sending out a new STUN biding requests and receiving a  STUN Allocate Success  response immediately, for the Media flow.





 Hence, contacted the Yealink support and provided the Wireshark trace for the working and not-working scenarios.

The Yealink support checked wire shark traces I provided. They also confirmed the issue after performing the tests at their end. So, their Yealink Product development team worked on a hot-fix and   provided us a hot-fix in couple of days and it fixed our problem.

Root cause of the issue:  As per the Yealink support, the root cause of the issue "the phones don’t update the STUN user information in time, new firmware hot-fix would let the phones update the STUN information every 10 minutes."

A quick word about Yealink support in this case:

I must admit that since I have worked with several other UC Phone vendors, I can tell you that Yealink support was great in this case. Because, earlier when I faced similar firmware issues with other Microsoft UC vendors, my experience with their support was really time consuming and bad.

After all, the other premier UC vendor for Microsoft was in total denial mode for months rather than accepting about the issues with their firmware. During those instances, not only I need to wait for several months for them to fix the issue. But also, the vendor would take several months to even acknowledge the issue on their product. On the other hand, the Yealink support was very quick on confirming the issue and providing us the hot-fix immediately in order to fix this issue. So, a big thank  you to the Yealink Support  :-)

Lessons learnt:


From this issue we understood that we have one more scenario that needs to be tested after a firmware update. The take away from this experience is, you need to test the call flow after long idle hours as well (at least after a time frame of 12 hours since the desk phones was rebooted).

No comments:

Post a Comment