Re: [Kaldi-developers] DNN VAD

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Yes, it definitely would.
There are various ways to do it.  You could do ali-to-phones with
--per-frame=true to get the phone labels and convert them yourself; or you
could do
gunzip -c ali.1.gz | ali-to-post ark:- ark:- | weight-silence-post 0.0
1:2:3:4:5 final.mdl ark:- ark:- | post-to-weights ark:- ark,t:-
which will give you weights per frame that are zero for silence and 1 for
speech; you can treat these as labels.  The 1:2:3:4:5 should be the
contents of your data/lang/phones/silphones.csl.


Dan


On Sun, Mar 1, 2015 at 7:20 AM, John Barnes <jcb...@gm...> wrote:

> Do  frame level labels from kaldi acoustic models include the silence
> phones (eg SIL, SPN, ...)?
>
>  If so, would it be possible to take aligned ASR data and extract those
> frame labels, collapse all nonsilence phones to a single class, and train
> the VAD DNN using an external framework like Theano?
>
> John
>
>
> On Saturday, February 28, 2015, Daniel Povey <dp...@gm...> wrote:
>
>>
>> Hi,
>> I am cross-posting this to kaldi-developers as I think my reply might be
>> of interest to people subscribed to that list.  This is a good excuse to
>> talk about the situation with Voice Activity Detection (VAD) more generally.
>>
>> There definitely does need to be some good voice activity detection in
>> Kaldi at some point.
>> Part of the reason why it doesn't exist yet is that it's never been clear
>> to me that there is a "right" way to do VAD- or even a right way to
>> formulate it as a problem.  For example, how many classes should there be
>> (music?  laughter?); and what should be done about cross-talk and
>> background speakers.  And how does this all work in the online setting
>> (e.g. is there a mechanism to reclassify previous speech as background if
>> we get much louder speech?)
>>
>> Formulating it as a multi-class (speech/nonspeech) problem with neural
>> nets does seem to be one of the most natural ways to set it up.  However, I
>> think it would make more sense to do this at the frame level rather than
>> the segment level.  Some of the issues involved in setting this up are a
>> little complicated; for instance, it might be necessary to make changes to
>> some of the command line tools so they don't require the transition-model
>> and accept labels directly instead of alignments.
>>
>> Right now I'm working on extending the online-nnet2 setup to use the
>> decoder backtrace to classify frames as silence or nonsilence, and use this
>> to limit the iVector estimation to silence.  This should at least remove
>> the WER performance hit that we get from not having speech/silence
>> detection in online decoding.  In the past (e.g. for BABEL) we have done
>> segmentation by doing a first pass of recognition using a fairly simple
>> model and post-processing that output to create segments.
>>
>> Dan
>>
>>
>> On Sat, Feb 28, 2015 at 8:05 AM, John Barnes <jcb...@gm...>
>> wrote:
>>
>>> I'm interested in training a DNN voice activity detection system using
>>> kaldi.  I have a large corpus labeled at the segment level as speech and
>>> nonspeech.  Are there any existing recipes to do this or suggestions on how
>>> to modify a recipe to accomplish this task?
>>>
>>> Thanks
>>>
>>> John
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Dive into the World of Parallel Programming The Go Parallel Website,
>>> sponsored
>>> by Intel and developed in partnership with Slashdot Media, is your hub
>>> for all
>>> things parallel software development, from weekly thought leadership
>>> blogs to
>>> news, videos, case studies, tutorials and more. Take a look and join the
>>> conversation now. http://goparallel.sourceforge.net/
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>>
>>