From: Daniel P. <dp...@gm...> - 2015-03-01 19:44:54
|
Yes, it definitely would. There are various ways to do it. You could do ali-to-phones with --per-frame=true to get the phone labels and convert them yourself; or you could do gunzip -c ali.1.gz | ali-to-post ark:- ark:- | weight-silence-post 0.0 1:2:3:4:5 final.mdl ark:- ark:- | post-to-weights ark:- ark,t:- which will give you weights per frame that are zero for silence and 1 for speech; you can treat these as labels. The 1:2:3:4:5 should be the contents of your data/lang/phones/silphones.csl. Dan On Sun, Mar 1, 2015 at 7:20 AM, John Barnes <jcb...@gm...> wrote: > Do frame level labels from kaldi acoustic models include the silence > phones (eg SIL, SPN, ...)? > > If so, would it be possible to take aligned ASR data and extract those > frame labels, collapse all nonsilence phones to a single class, and train > the VAD DNN using an external framework like Theano? > > John > > > On Saturday, February 28, 2015, Daniel Povey <dp...@gm...> wrote: > >> >> Hi, >> I am cross-posting this to kaldi-developers as I think my reply might be >> of interest to people subscribed to that list. This is a good excuse to >> talk about the situation with Voice Activity Detection (VAD) more generally. >> >> There definitely does need to be some good voice activity detection in >> Kaldi at some point. >> Part of the reason why it doesn't exist yet is that it's never been clear >> to me that there is a "right" way to do VAD- or even a right way to >> formulate it as a problem. For example, how many classes should there be >> (music? laughter?); and what should be done about cross-talk and >> background speakers. And how does this all work in the online setting >> (e.g. is there a mechanism to reclassify previous speech as background if >> we get much louder speech?) >> >> Formulating it as a multi-class (speech/nonspeech) problem with neural >> nets does seem to be one of the most natural ways to set it up. However, I >> think it would make more sense to do this at the frame level rather than >> the segment level. Some of the issues involved in setting this up are a >> little complicated; for instance, it might be necessary to make changes to >> some of the command line tools so they don't require the transition-model >> and accept labels directly instead of alignments. >> >> Right now I'm working on extending the online-nnet2 setup to use the >> decoder backtrace to classify frames as silence or nonsilence, and use this >> to limit the iVector estimation to silence. This should at least remove >> the WER performance hit that we get from not having speech/silence >> detection in online decoding. In the past (e.g. for BABEL) we have done >> segmentation by doing a first pass of recognition using a fairly simple >> model and post-processing that output to create segments. >> >> Dan >> >> >> On Sat, Feb 28, 2015 at 8:05 AM, John Barnes <jcb...@gm...> >> wrote: >> >>> I'm interested in training a DNN voice activity detection system using >>> kaldi. I have a large corpus labeled at the segment level as speech and >>> nonspeech. Are there any existing recipes to do this or suggestions on how >>> to modify a recipe to accomplish this task? >>> >>> Thanks >>> >>> John >>> >>> >>> ------------------------------------------------------------------------------ >>> Dive into the World of Parallel Programming The Go Parallel Website, >>> sponsored >>> by Intel and developed in partnership with Slashdot Media, is your hub >>> for all >>> things parallel software development, from weekly thought leadership >>> blogs to >>> news, videos, case studies, tutorials and more. Take a look and join the >>> conversation now. http://goparallel.sourceforge.net/ >>> _______________________________________________ >>> Kaldi-users mailing list >>> Kal...@li... >>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>> >>> >> |