This is the second article in a two-part series on network diagnostics for video meetings. Please check out the first part of the series here, if you haven’t done so already.
In the first article, we covered the basics of creating a direct peer-to-peer connection that can be utilized for media exchange. Specifically, we covered the connections created during a typical WebRTC session and developed a list of required and desired network constraints. Now, we will use this information to develop specific network diagnostics procedures. These procedures may be implemented in the form of pass or fail tests. These tests will allow you to tell whether or not the user is able to engage in the WebRTC-based interaction, and, if so, whether the user has the necessary network conditions to have an optimal meeting experience. We’ve found it very valuable to integrate these tests into the development and quality assurance process. However, it’s also beneficial to embed these diagnostics abilities into the actual product for the final users. For example, these checks may be implemented as standalone troubleshooting pages, as these tests may help users to better address issues involving their network setup.
It is worth noting here that we won’t be able to pinpoint the exact reason as to why a connection hasn’t been established, or state why certain infrastructure entities are not reached. When we are developing a tool that allows users to troubleshoot their connection, the best we can do is to plainly state which errors have occurred during troubleshooting checks, how severely they may impact user experience and identify the potential causes of this failure. From here, it will be up to the user to figure out which network/firewall restrictions exist in the network and to try to eliminate those restrictions.
Diagnostics process breakdown and execution
Our diagnostics procedures can be divided into three distinct test stages or phases:
- Signaling server connection test. This is a straightforward, black-and-white check that can be implemented as a single connection test. During the test, the user will attempt to connect to the signaling server. If the signaling server is unreachable, the user won’t be able to participate in WebRTC connection.
- ICE connection diagnostics. In the next stage, we will try to determine if the user’s network is suitable for participating in the WebRTC connection within the given infrastructure (i.e. specific configuration settings, set of ICE servers, etc.). And if it is, we will try to predict the quality of such a connection. It’s a good idea to divide the network diagnostics process for the ICE connection into two stages. In the ICE discovery tests, we will connect to the STUN and TURN servers and attempt candidate gathering. Running these tests will help you to make certain preliminary predictions regarding the user’s experience. The next stage is an ICE connectivity check Here, it is useful to try establishing a connection using all the candidates discovered during the discovery process.
- Network performance checks. This set of tests will follow the connectivity checks and will only be run if at least one connection has been established successfully during the previous tests. For these tests, we will select the connection with the optimal parameters to perform bandwidth tests; one that will allow for determining the network’s operational capacity. We will establish a data channel on top of the created ICE connection, and send messages varying in size. Alternatively, we could send an actual media file, such as a captured video stream in various resolutions, in order to estimate the network’s capabilities for running a video meeting.
While the signaling server and network performance checks are pretty straightforward to implement in practice, it makes sense to take a closer look at the ICE connection diagnostics below, as executing these tests and drawing conclusions based on the results may be challenging.
ICE discovery diagnostics
Within these checks, two tests will be performed:
- An attempt to gather server reflex candidates using the STUN server
- An attempt to gather relay candidates using the TURN server
Based on the test results, the following conclusions can be made:
- Failure of both of these tests can be seen as a critical failure
- Failure to gather STUN candidates can be seen as the network’s inability to support a direct peer-to-peer connection. This will inevitably introduce overhead. The media will not flow directly between peers, instead, it will be flowing via the TURN server serving as an intermediary proxy server.
- Failure to gather TURN candidates alone may not impact the WebRTC session. This only means that the user may not be able to use the TURN server as a fallback in the event the direct connection fails.
ICE connectivity diagnostics
Prior to discussing which tests should be performed within the scope of this connectivity diagnostics stage and the subsequent conclusions which can be drawn from these tests results, it’s worth noting two distinct approaches to implementing these checks:
- Connectivity checks via an echo connection. In this approach, we establish an echo connection within a web client. A single client represents both the initiating and the answering peer. We use the relay ICE candidate to establish the connection, so the data travels via the relay TURN server. The main advantage of this approach is that there is very little engineering overhead needed to implement these tests. The tests only require a TURN server; no additional supporting infrastructure is needed. The downside of the echo test is that while the TURN connectivity can be verified in this way, we can’t verify the successful connectivity via a direct peer-to-peer connection.
- Connectivity checks via a server-based client. In order to go the extra mile in preparing and conducting network diagnostics, in some cases it may make sense to actually develop a full-scale implementation of the WebRTC client on the web server. This will enable you to verify both the connectivity via STUN and TURN generated candidates, and provide the most accurate prediction of user experience. Of course, keep in mind that this approach introduces a significant engineering overhead.
Another thing to keep in mind about connectivity checks is that it only makes sense to conduct them if you have successfully gathered at least one candidate during the ICE discovery tests.
The ICE Connectivity stage of the connectivity diagnostics is run the following way:
- WebRTC connections are established, and a dummy message is sent over the data channel established within these connections.
- WebRTC clients then reply with their own echoing message.
- The process will be performed for both the STUN and TURN gathered candidates. Note that you should run tests against both the UDP and TCP protocols when testing TURN-supported connectivity, in case TCP is enabled on the server.
Depending on the connectivity tests results, you can make a reliable prediction regarding the user’s connection quality:
- User has successfully connected via STUN. In this scenario, the user is able to connect via a direct UDP connection to other peers. This is the optimal way to connect. If other connectivity checks fail, this is undesirable, because we would want other options to connect to also be available as a fallback. However, we won’t be considering these failures as critical errors, as these will not impact the user experience if network conditions are unchanged during the meeting.
- User has successfully connected via TURN/UDP. If the connection via the TURN relay is successful, while the direct UDP connection fails, the user still has the second best option to connect. An overhead will be introduced in terms of data traveling through the TURN server, but, in many cases, this will not have a significant impact.
- User has successfully connected via TURN/TCP. If the connectivity test via TURN/TCP is the only one to run successfully, this means the user has the least optimal way to connect for WebRTC connections. Users are very likely to have a sub-par video meeting experience due to high latency.
It is worth noting that specific network conditions may change quite frequently, especially in rapidly changing environments, such as when users use wireless connections. Therefore, keep in mind that successfully passing all the network diagnostic tests will not necessarily mean that the user’s experience in the meeting will be optimal. Make users aware of this fact and encourage them to re-check their network conditions by repeating the connectivity diagnostics process if they experience any issues.