As the training I need to do is already recorded, I took a 10 minute
WAV file, then ran it through webrtcvad
( https://github.com/wiseman/py-webrtcvad ) to split the audio into
many small files (over 200).
With creating the transcripts for each file, I'm assuming that I should
attempt to write a word exactly as I hear it, is that correct. Here is
an example:
Spoken word --> because --> written word --> because
Spoken word --> cause --> written word --> 'cause
Spoken word --> dont --> written word --> don't
Spoken word --> its --> written word --> it's
Spoken word --> the um er --> written word --> the um er
Spoken word --> the the --> written word --> the the
Is it okay to put minor puntuation in the transcripts ??
Am now going through each file (some are just noise or empty so these
are deleted), listening to it, then recording the transcript in a text
file. Have now gone through those (83) WAV files and the total duration
is 5 min 48 seconds. The range is 00.63 seconds to 13.44 seconds, yet
most of the audios are from between 1 second and 6 seconds.
I seem to remember reading somewhere that it was recommended to have at
least 1 hour of audio in preparation for the training. Will the 5 min
48 seconds be sufficient ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As the training I need to do is already recorded, I took a 10 minute
WAV file, then ran it through webrtcvad
( https://github.com/wiseman/py-webrtcvad ) to split the audio into
many small files (over 200).
With creating the transcripts for each file, I'm assuming that I should
attempt to write a word exactly as I hear it, is that correct. Here is
an example:
Spoken word --> because --> written word --> because
Spoken word --> cause --> written word --> 'cause
Spoken word --> dont --> written word --> don't
Spoken word --> its --> written word --> it's
Spoken word --> the um er --> written word --> the um er
Spoken word --> the the --> written word --> the the
Is it okay to put minor puntuation in the transcripts ??
Am now going through each file (some are just noise or empty so these
are deleted), listening to it, then recording the transcript in a text
file. Have now gone through those (83) WAV files and the total duration
is 5 min 48 seconds. The range is 00.63 seconds to 13.44 seconds, yet
most of the audios are from between 1 second and 6 seconds.
I seem to remember reading somewhere that it was recommended to have at
least 1 hour of audio in preparation for the training. Will the 5 min
48 seconds be sufficient ?
No
On Tue, 20 Mar 2018 20:40:01 -0000
"Nickolay V. Shmyrev" nshmyrev@users.sourceforge.net wrote:
Thanks Nickolay. Is 1 hr of audio the minimum required ?
Is it okay to put minor puntuation in the transcripts ??
Peter
Hi,
I've got currently 50minutes audio input. But it is still to few to train context-dependent models.
Does someone knows the minimum audio input length for a context-dependent model?
BR
Marc