Identifying Active Speakers in a conference
Hello Readers,
Have you ever had this question in your mind - how a Conference server would detect the active speakers in a conference and then displays their name or photo or video while they are speaking in a conference call ? Today we are going to discuss about the technical details behind highlighting the active speakers in a Conference.
I had this question in mind for quite some time and was trying to find an answer to it. So, today let us discuss about it. I am sure that you might know that in a real time communication we use RTP along with the transport protocol UDP (User Datagram Protocol) in order to carry the media from one end point to the other. RTP helps in several ways than merely carrying the traffic with the help of UDP.
Have you ever had this question in your mind - how a Conference server would detect the active speakers in a conference and then displays their name or photo or video while they are speaking in a conference call ? Today we are going to discuss about the technical details behind highlighting the active speakers in a Conference.
I had this question in mind for quite some time and was trying to find an answer to it. So, today let us discuss about it. I am sure that you might know that in a real time communication we use RTP along with the transport protocol UDP (User Datagram Protocol) in order to carry the media from one end point to the other. RTP helps in several ways than merely carrying the traffic with the help of UDP.
RTP (Real Time Protocol):
Apart from carrying the media from one end point to the
other, RTP also helps in identifying the active speakers in a conference calls. It does that by using Synchronization Source (SSRC) and Contribution Source (CSRC) identifier. Let
us see these in detail by looking at the RTP header.
In the RTP header (snapshot shown below) we have several fields like sequence number, timestamp, Marker bit, and the Synchronization source (SSRC) and contributing source (CSRC) identifiers. For today's topic let us discuss about the Synchronization source and contributing source identifiers here.
Let us see what actually happens in a conference call. A user will use a SIP Address to join the Conference call.
However, the SIP is a application layer protocol. So, it cannot help in detecting the media or in identifying the end points that is
used to send the media traffic. Moreover, what if a user uses two video cameras for a
session. In that case, you need a mechanism to differentiate the signals from the two
different devices. So that, after Sampling the analog signal and converting them to digital it can be placed them in a RTP packet with the captured device details.
In the RTP packet is there a way to notify which device was used?
Yes, the Synchronization Source (SSRC) identifier in the RTP header, helps in the identifying the actual device that was used to send the media in a RTP session. Also this Synchronization Source identifier is globally unique within a RTP session. This is true even if you have multiple Audio devices – a headset or a laptop microphone and speaker. This does not mean that the synchronization source identifier would remain same for all the RTP sessions, it may change for the next RTP session.
So, having a synchronization source (SSRC) identifier for each device would help in identifying the exact device and its input from the other device. For example, if a user uses the headset then the Synchronization source identifier for the RTP session value would be the headset. So, with the help of SSRC identifier the Conference server would identify the active speaker and then shows the active speaker accordingly. This look simple isn't? Alright, now let us see a real world scenario.
Example:
Let us consider that, in a conference we have 5 participants. And chances are that, all the participants may join from different networks, countries and would have different bandwidth limits. Let us say 2 participants have excellent bandwidth and 1 has average bandwidth and 2 members are connected from a network which has low bandwidth.
In this case, if you want to choose a common codec then
obviously it would be one which is used in the low frequency network
can support. But by doing so we don’t want the users who have the excellent
bandwidth to have poor video quality. So how to overcome this situation ? Here comes the role of a Mixer (Conference Server does that)
RTP Mixer.
A RTP mixer (in the conference sever) would be actually collecting all
the inputs from all the participants. Then it would convert them to a new RTP
packet and send it to all the endpoints. Thus, the users in the poor network
location would receive the quality which their network can support. Likewise, the
other participants who are having excellent bandwidth can choose the one which has the best quality.
Here comes the tricky part. If you need to just differentiate the RTP stream using the source of the device
using the Synchronization Source, then in this case the Synchronization source
would be Mixer (conference server). So having only a Synchronization Source value
in the RTP header is not an optimal solution to find the active speaker in a
conference scenario. Hence we have another identifier called the Contributing Source (CSRC) identifier which helps in this situation.
The Contribution source identifier (CSRC) plays a very significant role while collecting the RTP streams from multiple users RTP stream and converting to a new RTP packet. While RTP Mixer (the conference server) creating the new RTP packet, it would also include the list of the active speakers in that instance, like participant 1 - who was talking AND at the same time participant # 5 was trying to ask a question, while others were
silent. So in this Packet the SSRC will have the mixer/conference server value and the CSRC will have the value of the Participant #1 and Participant #5. Thus, we get to see the active speaker in the conference even if hear sounds
or noise from multiple users. That is great, but does the RTP Mixer work if a user is behind a NAT or a Firewall ? No! So, here comes another important component called - RTP Translator. Let us check that scenario now. As usual let us check why we need it and how it help us ?
RTP Translator:
The RTP Mixer can help only if the participants are directly reachable. However, if they are behind a NAT/ firewall then obviously a
participant cannot reach the Mixer (Conference server). Hence, we have another component
called RTP Translator. Consider this translator is like a server who sits in the
DMZ and with a funnel. Then, it funnels the RTP traffic from all the participants to the
Mixer and gets the new RTP stream from the Conference. Also it funnels out to
the other participants who are in the internet.
A Participant Leaving or Exiting Scenario:
Alright this sounds like a good option, but what happens if a person is leaving the conf. session ? Well, in order to address this scenario, we have RTCP BYE packet. An RTCP sends a RTCP - BYE message when a person leaves a conference. Hence, others get notified that a user is leaving the conference.
Reference: RFC 3550
To summarize, using the Synchronization
source and Contribution Source identifiers in the RTP header we get to know the active speaker details in a conference call. I hope that you liked this topic and the discussion. Thank you for reading !