On Wed, Feb 8, 2012 at 12:47 AM, Simon A. Eugster <simon.eu@...> wrote:
> On 02/08/2012 09:08 AM, Alexandre Prokoudine wrote:
>> On Wed, Feb 8, 2012 at 11:32 AM, Simon A. Eugster wrote:
>>> I have no idea how you manage to have a link to a solution for nearly
>>> every problem. Thanks for the link!
>> YW :)
>>> How accurate can we position audio streams? Just by full frames, or is
>>> it possible to have a finer granularity? When I synced audio/video I
>>> often had the problem that the audio was too early and after moving it
>>> by one frame it was too late.
then your sense is more acute than most humans'
>> Admittedly, I haven't had a chance to test it myself yet. However
> Already read! :)
> I rather meant kdenlive/MLT here. Can we move an audio clip by just a
> few samples or only by full frames?
at the framework level only by frame, but something more precise can
be achieved with an audio filter, e.g. sox.delay
>> "The algorithm I settled on resembles the method a human uses when
>> looking at the waveform view. First, it breaks each input audio stream
>> into 40 ms blocks and computes the mean absolute value of each block.
40ms aka duration of 25 fps frame, this value can be simply computed
>> The resulting 25 Hz signal is the “volume envelope”. The code
>> subtracts the mean volume from each track’s envelope, then performs a
>> cross-correlation between tracks and looks for the peak, which
>> identifies the relative shift.
can be implemented as a passive transition that computes the shift and
reports it through a property. Then, an application can make a
frame-level adjustment at the edit level and apply sox.delay or
frei0r.delay0r filters for sub-frame accuracy. Or, the transition can
be dual pass, and perform the sub-frame adjustments itself on the
Alternatively, kdenlive is already getting all of the audio in a
consumer-frame-show event, so it could just do all of this analysis in
its own code and use existing filters for sub-frame accuracy.
>> To avoid performing N^2
>> cross-correlations, one clip is selected as the fixed reference, and
>> all others are compared to it. The peak position is quantized to the
>> block duration (creating an error of +/- 20ms), so to improve accuracy
>> a parabolic fit is used to interpolate the true maximum. I don’t know
>> the exact residual error, but I expect it’s typically less than 5 ms,
>> which should be plenty good enough, seeing as sound travels about 1
>> foot per ms."
>> Alexandre Prokoudine