Calls made after long idle hours are failing
Technical description: In this scenario, for a SIP based VoIP Call flow, the SIP Signaling works and the phone rings fine. However, when the callee/called person answers the call, it gets disconnected.
Hence in this issue, the SIP signaling works fine but the Media path always fails.
Lync Platform: Lync 2013 MT platform.
Desk Phone - Yealink
Recent change: Firmware update. At the local site, the firmware was upgraded on the Yealink phones. But after the firmware update all the test cases were successful.
Phone Firmware versions:
Firmware version without the issue: 184.108.40.206
Workaround (when you are using the firmware 220.127.116.11)
- Reboot the phone after several hours of idle time. Then the issues does not occur. (OR)
- Downgrade to another version in our case it was 18.104.22.168.
- Confirmed that the port 3478 for the STUN (UDP) was allowed in the Firewall.
- Confirmed that Lync/SfB server was listening on the port 3478 for new sessions and there were no server related issues.
- There were no connectivity issues between the phones and the Lync/ SfB (Skype for Business) servers.
Issue: After troubleshooting the issue with the customer, it was obvious that the issue occurred only on the phones which had the latest firmware 22.214.171.124.
Network trace collection and Analysis:
In order to collect the Wireshark trace, I connected to the Yealink phone using its IP Address and collected the Wireshark trace for the not-working and the working scenario.
The following are the snapshots of the network trace collected while the calls were failing (after several hours of the idle time). First let us see network trace from the phone on a morning when the calls are failing. From the snapshots (of the Wireshark trace of the failure scenario) - we could see that there were several STUN binding requests but no successful response. Moreover there were several strange errors for STUN binding requests like.
1. Allocate Error Response error-code: 401 (unauthorized) the request did not contain a Message-Integrity attribute”.
And sometimes the STUN binding requests failed with the other STUN binding errors like,
While searching in the internet based on STUN error message ( that we got from the Wireshark trace), understood that the issue was due to the STUN response or the ICE keep alive related issues.
Hence, contacted the Yealink support and provided the Wireshark trace for the working and not-working scenarios.
The Yealink support checked wire shark traces I provided. They also confirmed the issue after
performing the tests at their end. So, their Yealink Product development team worked on a hot-fix andprovided us a hot-fix in couple of days and it fixed our problem.
Root cause of the issue: As per the Yealink support, the root cause of the issue "the phones don’t update the STUN user information in time, new firmware hot-fix would let the phones update the STUN information every 10 minutes."
A quick word about Yealink support in this case:
I must admit that since I have worked with several other UC Phone vendors, I can tell you that Yealink support was great in this case. Because, earlier when I faced similar firmware issues with other Microsoft UC vendors, my experience with their support was really time consuming and bad.
After all, the other premier UC vendor for Microsoft was in total denial mode for months rather than accepting about the issues with their firmware. During those instances, not only I need to wait for several months for them to fix the issue. But also, the vendor would take several months to even acknowledge the issue on their product. On the other hand, the Yealink support was very quick on confirming the issue and providing us the hot-fix immediately in order to fix this issue. So, a big thank you to the Yealink Support :-)
From this issue we understood that we have one more scenario that needs to be tested after a firmware update. The take away from this experience is, you need to test the call flow after long idle hours as well (at least after a time frame of 12 hours since the desk phones was rebooted).