Calls made after long idle hours are failing
Recently, we had customer reporting about a call failure issues. Since the scenario and the root cause of the issue was different I would like to share that experience here.
Issue Description:
Lync Platform: Lync 2013 MT platform.
Desk Phone - Yealink
Recent change: Firmware update. At the local site, the firmware was upgraded on the Yealink phones. But after the firmware update all the test cases were successful.
Phone Firmware versions:
Firmware version without the issue: 66.9.0.25
New firmware version (that caused the issue) : 66.9.0.42
Every morning, whenever users try to place calls from their desk phones to the PSTN, the call rings. However, the call gets disconnected as soon as the call gets accepted/answered.
Technical description: In this scenario, for a SIP based VoIP Call flow, the SIP Signaling works and the phone rings fine. However, when the callee/called person answers the call, it gets disconnected.
Hence in this issue, the SIP signaling works fine but the Media path always fails.
Technical description: In this scenario, for a SIP based VoIP Call flow, the SIP Signaling works and the phone rings fine. However, when the callee/called person answers the call, it gets disconnected.
Hence in this issue, the SIP signaling works fine but the Media path always fails.
Lync Platform: Lync 2013 MT platform.
Desk Phone - Yealink
Recent change: Firmware update. At the local site, the firmware was upgraded on the Yealink phones. But after the firmware update all the test cases were successful.
Phone Firmware versions:
Firmware version without the issue: 66.9.0.25
Workaround (when you are using the firmware 66.9.0.42)
- Reboot the phone after several hours of idle time. Then the issues does not occur. (OR)
- Downgrade to another version in our case it was 66.9.0.25.
Recent Changes:
Scenario:
Troubleshooting:
- Confirmed that the port 3478 for the STUN (UDP) was allowed in the Firewall.
- Confirmed that Lync/SfB server was listening on the port 3478 for new sessions and there were no server related issues.
- There were no connectivity issues between the phones and the Lync/ SfB (Skype for Business) servers.
Issue: After troubleshooting the issue with the customer, it was obvious that the issue occurred only on the phones which had the latest firmware 66.9.0.42.
Network trace collection and Analysis:
In order to collect the Wireshark trace, I connected to the Yealink phone using its IP Address and collected the Wireshark trace for the not-working and the working scenario.
The following are the snapshots of the network trace collected while the calls were failing (after several hours of the idle time). First let us see network trace from the phone on a morning when the calls are failing. From the snapshots (of the Wireshark trace of the failure scenario) - we could see that there were several STUN binding requests but no successful response. Moreover there were several strange errors for STUN binding requests like.
1. Allocate Error Response error-code: 401 (unauthorized) the request did not contain a Message-Integrity attribute”.
And sometimes the STUN binding requests failed with the other STUN binding errors like,
2. “Allocate Error Response Code : 436” – The username supplied in the request is not known.“
So, it is evident that, the issue was due to some STUN Binding requests and the lack of successful STUN responses.
While searching in the internet based on STUN error message ( that we got from the Wireshark trace), understood that the issue was due to the STUN response or the ICE keep alive related issues.
So, tried to collected the Wire Shark trace for a working scenario. Hence, rebooted the phone and then collected the Wire Shark trace (when the calls were working fine after the phone reboot).
When the phones were rebooted, found that the phones were connected to the same Lync server but the phones started working (after the reboot). Hence, this implies that there were no problems with the Phones and the Lync server. This is because after the rebooting the phone, it was sending out a new STUN biding requests and receiving a STUN Allocate Success response immediately, for the Media flow.
Root cause of the issue: As per the Yealink support, the root cause of the issue "the phones don’t update the STUN user information in time, new firmware hot-fix would let the phones update the STUN information every 10 minutes."
A quick word about Yealink support in this case:
I must admit that since I have worked with several other UC Phone vendors, I can tell you that Yealink support was great in this case. Because, earlier when I faced similar firmware issues with other Microsoft UC vendors, my experience with their support was really time consuming and bad.
After all, the other premier UC vendor for Microsoft was in total denial mode for months rather than accepting about the issues with their firmware. During those instances, not only I need to wait for several months for them to fix the issue. But also, the vendor would take several months to even acknowledge the issue on their product. On the other hand, the Yealink support was very quick on confirming the issue and providing us the hot-fix immediately in order to fix this issue. So, a big thank you to the Yealink Support :-)
Lessons learnt: