Sunday, January 14, 2018


                                   Identifying Active Speakers in a conference


Hello Readers,

Have you ever had this question in your mind -  how a Conference server would detect the active speakers in a conference and then displays their name or photo or video while they are speaking in a conference call ? Today we are going to discuss about the technical details behind highlighting the active speakers in a Conference.

I had this question  in mind for quite some time and was trying to find an answer to it. So, today let us discuss about it. I am sure that you might know that in a real time communication we use RTP along with the transport protocol UDP (User Datagram Protocol)  in order to carry the media from one end point to the other.  RTP helps in several ways than merely carrying the traffic with the help of UDP.  


RTP (Real Time Protocol):

Apart from carrying the media from one end point to the other, RTP also helps in identifying the active speakers in a conference calls. It does that by using Synchronization Source (SSRC) and Contribution Source (CSRC) identifier. Let us see these in detail by looking at the RTP header.

In the RTP header (snapshot shown below) we have several fields like sequence number, timestamp, Marker bit, and the Synchronization source (SSRC) and contributing source (CSRC) identifiers. For today's topic let us discuss about the Synchronization source and contributing source identifiers here.





Let us see what actually happens in a conference call. A user will use a SIP Address to join the Conference call. However, the SIP is a application layer protocol. So, it cannot help in detecting the media or in identifying the end points that is used to send the media traffic. Moreover, what if a user uses two video cameras for a session. In that case, you need a mechanism to differentiate the signals from the two different devices. So that, after Sampling the analog signal and converting them to digital it can be placed them in a RTP packet with the captured device details. In the RTP packet is there a way to notify which device was used?

Yes, the Synchronization Source (SSRC) identifier in the RTP header, helps in the identifying the actual device that was used to send the media in a RTP session. Also this Synchronization Source identifier is globally unique within a RTP session. This is true even  if you have multiple Audio devices – a headset or a laptop microphone and speaker. This does not mean that the synchronization source identifier would remain same for all the RTP sessions, it may change for the next RTP session.

So, having a synchronization source (SSRC) identifier for each device would help in identifying the exact device and its input from the other device. For example, if a user uses the headset then the Synchronization source  identifier for the RTP session value would be the headset. So, with the help of SSRC identifier the Conference server would identify the active speaker and then shows the active speaker accordingly. This look simple isn't? Alright, now let us see  a real world scenario. 

Example:

Let us consider that, in a conference we have 5 participants. And chances are that, all the participants may join from different networks, countries and would have different bandwidth limits. Let us say 2 participants have excellent bandwidth and 1 has average bandwidth and 2 members are connected from a network which has low bandwidth.
In this case, if you want to choose a common codec then obviously it  would be one  which is used in the low frequency network can support. But by doing so we don’t want the users who have the excellent bandwidth to have poor video quality. So how to overcome this situation ? Here comes the role of a Mixer (Conference Server does that) 

RTP Mixer.
 A RTP mixer (in the conference sever) would be actually collecting all the inputs from all the participants. Then it would convert them to a new RTP packet and send it to all the endpoints. Thus,  the users in the poor network location would receive the quality which their network can support. Likewise, the other participants who are having excellent bandwidth can choose the one which has the best quality.

Here comes the tricky part. If you need to just differentiate the RTP stream using the source of the device using the Synchronization Source, then in this case the Synchronization source would be Mixer (conference server). So having only a Synchronization Source value in the RTP header is not an optimal solution to find the active speaker in a conference scenario. Hence we have another identifier called the Contributing Source (CSRC) identifier which helps in this situation.

The Contribution source identifier (CSRC) plays a very significant role while collecting the RTP streams from multiple users RTP stream and converting to a new RTP packet. While RTP Mixer (the conference server) creating the new RTP packet, it would also include the list of the active speakers in that instance, like participant 1 - who was talking AND  at the same time participant # 5 was trying to ask a question, while others were silent. So in this Packet the SSRC will have the mixer/conference server value and the CSRC will have the value of the Participant #1 and Participant #5. Thus, we get to see the active speaker in the conference even if hear sounds or noise from multiple users. That is great, but does the RTP Mixer work if a user is behind a NAT or a Firewall ?  No! So, here comes another important component called  - RTP Translator. Let us check that scenario now. As usual let us check why we need it and how it help us ?


RTP Translator:

The RTP Mixer can help only if the participants are directly reachable. However, if they are behind a NAT/ firewall then obviously a participant cannot reach the Mixer (Conference server). Hence, we have another component called RTP Translator. Consider this translator is like a server who sits in the DMZ and with a funnel. Then, it funnels the RTP traffic from all the participants to the Mixer and gets the new RTP stream from the Conference. Also it funnels  out to the other participants who are in the internet.

A Participant Leaving or Exiting Scenario:

Alright this sounds like a good option, but what happens if a person is leaving the conf. session  ? Well, in order to address this scenario, we have RTCP BYE packet. An RTCP sends a RTCP - BYE message when a person leaves a conference. Hence, others get notified that a user is leaving the conference.

NOTE:  We have not discussed about RTCP here yet. Let us discuss about it on some other day :-)

Reference: RFC 3550
           
To summarize, using the Synchronization source and Contribution Source identifiers in the RTP header we get to know the active speaker details in a conference call. I hope that you liked this topic and the discussion.  Thank you for reading !