This page explains the relationship between audio and video timing within Snowmix. Understanding the conditions, limitations and possibilities regarding timing of audio and video streams within Snowmix, may be crucial to achieve sync between various sources of audio and video. The page has the following subsections:
Snowmix spends most of time, unless running on an inadequate hardware, waiting for input from its shared memory control sockets and its control connections. The latter may carry Snowmix commands or audio data. These input will arrive asynchronously. When input arrives, Snowmix will detect this and execute the following 3 steps:
Snowmix operates in time slices of a frame period. A frame period is the inverse of the system framerate set for Snowmix. If the system framerate is set to 25 or 30, the frame period is 40 ms respectively 33.3 ms. In addition to the asynchronous handling described above, Snowmix will at frame rate go through the following steps:
So if Snowmix is setup to use output mode 0 and no extra delay is configured for the audio elements, a video frame and audio frames sent into Snowmix, will be mixed and sent out within a frame period. That said, because step 7, 8, 9 and 10 can take some time, ideally less than a frame period, there will be a time skew or difference between mixed audio data was sent out of Snowmix and a mixed video frame is sent out. For this reason, and because audio data is usually sent into Snowmix in smaller time chunks than a frame period, adding a delay for all audio feeds of a frame period plus a few ms, is usually recommended. If the frame period is 40 ms, then adding a 50 ms delay is usually a good choice using the command audio feed delay.
audio feed delay 1 50
audio feed delay 2 50
etc.
Note that the delay in the above example is a one time delay added of 50 ms of silent samples to the queues using the specific audio feed as a source every time an external audio source is connecting to a that specific Snowmix audio feed and beginning to send samples into Snowmix. This delay may increase or decrease due to audio drift explained later.
When feeding a file or a live stream of audio and video into Snowmix, it is important that feeder feeds the streams synchronously and in a timely manner. Using GStreamer to feed audio and video into Snowmix, this is usally secured setting the sync flag to true for the outputting modules as shown here:
( echo 'audio feed ctr isaudio 1
gst-launch-1.0 -v tcpclientsrc host=a.b.c.d port=e do-timestamp=true !\
decodebin name=decoder !\
videorate ! videoscale ! videoconvert ! $MIXERFORMAT ! queue !\
shmsink socket-path=/tmp/feed1-control-pipe shm-size=51609600 wait-for-connection=0 sync=true decoder. !\
audioresample ! audioconvert ! $AUDIOFORMAT !\
queue ! fdsink fd=3 sync=true 3>&1 1>&2
) | nc 127.0.0.1 9999
Here we should note 3 important settings. The first is to add timestamps when possible to the source. The second is that the shmsink has set the option sync=true and the third is that the fdsink has set the option syn=true too.
While we theoretically can introduce delay in the queues before shmsink for video and fdsink for audio assigning a minimum threshold of buffers, this may not always work as output buffers from the decoder may arrive in bulks. However the renderdelay option for shmsink and fdsink may work (untested). Renderdelay is in milliseconds. Adding a renderdelay may require an increase in shmsize for the shmsink.
If video frames are fed faster to Snowmix than Snowmix consumes the frames, then Snowmix will discard incoming frames, if the feed is set to live as opposed to recorded.
Snowmix will start discarding video frames when more than two video frames for a feed are ready in Snowmix. If a video source is providing video frames at higher frame rate than Snowmix is configured for, frames will be dropped every now and then upon need. If the source is providing 25 frames persond and Snowmix is configured for 24 fps, a single frame is dropped every second.
If a video decoder is providing frames in chunks, but otherwise keep the same frame rate as Snowmix, Snowmix will drop frames when more than 2 frames are decoded and ready inside Snowmix while it will reuse an old frame if a new one is not present. This way the video stays in sync, but it will be choppy. To avoid this, the decoder/shmsink setup must provide a stable flow of frames. Assuming the video stream is encoded for a steady flow of frames, GStreamer will decode the stream with correct timestamps and if the sync flag for shmsink is set to true, the shmsink will feed Snowmix with a steady flow of frames.
If a video source is providing frames to slowly for Snowmix, Snowmix will reuse the last frame received to compensate.
Audio data that is being fed into audio feeds in Snowmix are being read asynchronously as fast as possible and audio data is queued immediately.
Audio mixers that use audio feeds or audio mixers will mix audio data at frame rate using the sample rate specified for the mixer. Mixed audio data is then queued for the audio elements, such as other mixers or audio sinks.
Audio sinks will write all queued audio data at frame rate. This will typically have the following setup
External --> Audio feed --> Audio mixer --> Audio sink --> External
So due to the way the mixer element works, no matter if input is too slow or too fast, the mixer will at frame rate provide a stable number of audio samples though audio sinks for output. If input is too fast, audio data will queue up on the source for an audio mixer possible resulting in an unacceptable delay. If input is too slow, the mixer will add silent audio samples for the source eventually missing audio samples. To compensate for audio data arriving too fast or too slow, the minimum and maximum buffer delay for the source of a mixer can be set. By default the min and max delay is disabled. In the following example the minimum and maximum delay is set for audio mixer 1 source 3:
audio mixer source mindelay 1 3 70
audio mixer source maxdelay 1 3 280
Here the minimum delay is set to 70 ms and the maximum delay is set to 280 ms. The effect is that the source is not mixed into the mixer before at least 70 ms is queued or buffered for the source. It also means, that no more than 280 ms of audio samples for the queue for the source will be queued. Audio samples beyond this, will be discarded.
Because audio samples are sent into Snowmix in chunks also known as audio frames with a number of samples representing a length in time, the difference between mindelay and maxdelay should be greater than a couple of audio frames. Otherwise Snowmix may frequently discard audio samples unnecessarily and be forced to later add silent samples. Typically a difference of 150-200 ms is adequate, but it depends on the process or program that is feeding data into Snowmix and possibly the audio codec that audio data was decoded from before being fed into Snowmix as raw audio samples.
The mindelay and maxdelay feature of Snowmix will, when used correctly, compensate for audio drift. Audio drift is when an audio source is producing audio samples slightly faster or slower than an audio consumer is consuming the audio samples. This will always happens when the clock of the producer and the consumer is not synchronized as is usually the case.
If video output is needed to be delayed, this can be achieved setting the system output mode to a valuer larger than 0 and the system output delay to a value larger than 1. In the following example, video output is delayed by two frame periods:
system output mode 1
system output delay 2
If audio output is needed to be delayed, all sources for the mixer sourcing the audio sink must set the mindelay. In the following example, audio is delayed by 200ms:
audio mixer source mindelay 1 1 200
audio mixer source mindelay 1 2 200
audio mixer source mindelay 1 3 200
audio mixer source mindelay 1 4 200
Note that if the maxdelay feature is to be used, the maxdelay value for the sources must typically be set to a value 150-200 ms larger than the mindelay.
Silent audio samples can be added upon request, either manually or automatically as part of a script. The commands available for this are the following:
audio feed add silence <feed id> <silence in ms>
audio mixer source [add silence <mixer id> <source no> <silence in ms>]
audio mixer add silence <mixer id> <silence in ms>
audio sink add silence <sink id> <silence in ms>
Because a mixer only consumes a fixed number of samples for every frame period for every source, silent audio samples added with audio feed add silence or audio mixer source add silence will subsequently delay mixing the specific source. Silence samples added using audio mixer add silence or audio sink add silence will produce more samples for output for the frame period the silence samples was added, but otherwise internally have no effect. Whether an external process reading the audio samples for output will be affected by the extra samples, depends on the specific process or program.
Note that adding silent samples for sources of a mixer, may result in dropping or discarding samples if the maxdelay feature has been enabled for the mixer source.
As for adding silent audio samples, Snowmix offers to drop or discard samples in audio queues. The commands are:
audio feed drop <feed id> <samples in ms>
audio mixer source [drop silence <mixer id> <source no> <samples in ms>]
audio mixer drop silence <mixer id> <samples in ms>
audio sink drop silence <sink id> <samples in ms>
While it always is possible to add silent samples to audio queues instantaneously, dropping samples may take longer time. If an audio queue has 100 ms of samples and the queue is requested to drop 700 ms of samples, the 100 ms of samples will be dropped as well as the next 600 ms of samples added to the queue.
Usually when connecting a new audio and video device to Snowmix, it is necessary to check the sync/delay of the streams. This can be done manually by feeding the audio and video into Snowmix with as little delay as possible. Then Snowmix is setup to mix the audio and video to the output and the output is displayed using an external player. When tested to work, a clapper board or a hand clapper is used to generate a visible and audible signal on output. Usually the audio is ahead of the video. Now the following step can be used to determine the approximate delay needed.
The amount of delay added in step 2-3 is then t1 and the amount of time added in step 4-5 is t2. Then the optimal delay to add to an audio source can be calculated as this:
Optimal delay = (2 x t1 + t2) / 2
If t1 was 60 and t2 was 170, the optimal delay would be 145 ms. If the optimal delay is 145 ms for audio feed 3 and audio feed 3 is the source 2 in mixer 1, the follwoing configuration may be useful:
audio feed delay 3 145
audio mixer source mindelay 1 2 145
audio mixer source maxdelay 1 2 325
This will ensure that when the audio source is connecting to Snowmix, 145ms of silent audio samples is prepended once. Furthermore, the mixer using the feed as source, will keep at least 145 ms of samples in the source queue adding silent samples upon need and it drop samples that otherwise would enlarge the queue beyond 325 ms.
If a camera and microphone is recoding a stream from Snowmix perhaps played on a laptop with both audio and video and the recoded audio and video stream is sent to Snowmix, it is possible to write a script for Snowmix that automatically determine the relative and absolute delay for both audio and video and subsequently determine the necessary audio delay needed for audio and video from the camera to be in sync in Snowmix.
Such a script is being developed as a Snowmix library (slib), but currently the script is hard to use and not supported.
If Snowmix is receiving audio and video streams from multiple sources, these sources may not always be in sync with each other. Especially if multiple different encoder hardware and streaming protocols are used, the difference in delay of each stream may become noticeable. If this is the case, the video of the "fastest" video streasm must be delayed to match the "slowest" video stream. The shmsink option renderdelay can be used for this task possibly needing to enlarge the shmsize of the shmsink.
As for video, audio from various sources may need an individual delay. As explained earlier, renderdelay of the fdsink can be used, but it is easier to add delay for each audio feed in Snowmix as well as setting mindelay and maxdelay for audio sources in the audio mixers of Snowmix.
Synchronizing multiple audio and video sources may be cumbersome, especially if the sources are not within physical reach, so testing equipment in lab before being deployed is recommended.