Menu

Long audio alignment

Help
2014-05-24
2014-06-15
  • Jonathan Roitgrund

    Hi,

    I'm working on a basic transcript synchronization system and I was hoping to use Kaldi for long audio alignment (as described on this Sphinx documentation page), using the approach of recursively refining the language model using only the parts of the transcript between confirmed anchor points.

    I've perused the (amazing) doc pages and the WFST paper and I have a basic idea of what I'd like to do but I have a couple of questions.

    Am I right in thinking that I only need train one general purpose acoustic model (presumably from voxforge data, since it is the most complete freely available one) since I have no prior information about the speaker or audio? Can I generate this just once, and then make several different decoding graphs by composing it with different language models? Will it be stored as "H.fst"?

    Since in my several recursive passes only the language model will differ, can I re-use any data in between passes? From what I understand the alignment (ie the map from MFCC vector frames to transition IDs in the HMM model) depends only on the acoustic model, so it seems like I should be able to re-use this and decode the same alignment to different text depending on the language model I use.

    Finally, I'd love any insight on how best to use my transcript to build the language model. Since I know exactly what order the words are in and what sentences to expect, I assume a rule-based grammar would be ideal, and I should and use something like Thrax. Does this sound right?

    Thanks in advance for any help.

     
    • Daniel Povey

      Daniel Povey - 2014-05-24

      Vassil (cc'd) has been doing some work in audio alignment and may have some
      comments,

      Am I right in thinking that I only need train one general purpose acoustic
      model (presumably from voxforge data, since it is the most complete freely
      available one) since I have no prior information about the speaker or
      audio? Can I generate this just once, and then make several different
      decoding graphs by composing it with different language models? Will it be
      stored as "H.fst"?

      This should be enough, yes, so long as it's all in English. The model
      itself is called final.mdl, typically. To make different decoding graphs
      you could call utils/mkgraph.sh with suitable arguments; it takes the
      final.mdl and the grammar or language FST. H.fst is produced on the fly in
      Kaldi depending on which triphones you actually see, so can't really be
      re-used between different language models.

      Since in my several recursive passes only the language model will differ,
      can I re-use any data in between passes? From what I understand the
      alignment (ie the map from MFCC vector frames to transition IDs in the HMM
      model) depends only on the acoustic model, so it seems like I should be
      able to re-use this and decode the same alignment to different text
      depending on the language model I use.

      Yes, in principle you could re-use some info between passes, but this will
      involve coding; also, alignment will typically be pretty fast as the
      language model is quite constraining, so this would probably be a waste of
      your (human) time.

      Finally, I'd love any insight on how best to use my transcript to build
      the language model. Since I know exactly what order the words are in and
      what sentences to expect, I assume a rule-based grammar would be ideal, and
      I should and use something like Thraxhttp://openfst.cs.nyu.edu/twiki/bin/view/GRM/Thrax.
      Does this sound right?

      Hm. I'd go for something with more of a statistical flavor. You could
      just train a regular ARPA language model (e.g. using SRILM) on the
      transcripts. Alternatively you could go for a model in which you can
      access the transcript words in sequence, but also optionally skip them or
      insert filler words. In the latter case, directly writing an FST from a
      script would be the best (see the tutorial at www.openfst.org).

      Dan


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355348/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
    • Paul Dixon

      Paul Dixon - 2014-05-24

      Finally, I'd love any insight on how best to use my transcript to build
      the language model. Since I know exactly what order the words are in and
      what sentences to expect, I assume a rule-based grammar would be ideal, and
      I should and use something like Thraxhttp://openfst.cs.nyu.edu/twiki/bin/view/GRM/Thrax.
      Does this sound right?

      You could try a Factor Automaton
      http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4960722
      http://www.cs.nyu.edu/~mohri/pub/fac.pdf

       
  • Vassil Panayotov

    My setup is along the lines of what Dan said.
    I'm planning to make the code publicly available in two-three months time.

    Vassil

     
  • Jonathan Roitgrund

    Thanks for all the replies, this is extremely helpful.
    Am I correct in understanding that the factor automaton approach is a specific case of writing custom FSTs to enforce word order while allowing for indels? I'll play around with different models and see what works best.

    Vassil, are you doing anything similar to Sail for acoustic model refinement in between iterations? They reference this flexible transcription alignment paper and this speaker adaptation paper, neither of which I fully understand.

     
    • Vassil Panayotov

      I don't have time at the moment to read the papers you refer to, but yes, my implementation does use acoustic model adaptation between the passes.

      Vassil

       
  • Jonathan Roitgrund

    I'm having some trouble understanding the alignment tools. It would be lovely if someone could shed some light on them.

    gmm-align-compiled seems to be just like gmm-decode (takes in an acoustic model, a decoding graph but as an rspecifier rather than as a .fst file, and features) except it outputs only alignments rather than both alignments and words. doesn't aligning require the exact same computation as decoding (finding the highest scoring path through the WFST)?

    gmm-align takes both transcriptions AND a decoding graph. If the transcriptions aren't used to build a linear acceptor (like the one generated by compile-training-graphs), what are they used for? Are they supposed to be actual user-supplied transcriptions, or are they supposed to be transitions output by a decoding pass? If the latter, doesn't the decoding pass output alignments anyway (gmm-decode takes an alignments-wspecifier)?

     
    • Daniel Povey

      Daniel Povey - 2014-06-08

      gmm-align-compiled seems to be just like gmm-decode (takes in an acoustic
      model, a decoding graph but as an rspecifier rather than as a .fst file,
      and features) except it outputs only alignments rather than both alignments
      and words. doesn't aligning require the exact same computation as decoding
      (finding the highest scoring path through the WFST)?

      gmm-align-compiled is generally used in a mode where a separate FST derived
      from each of the human-supplied transcripts has already been created and
      dumped to disk. Yes, it's the same computation as decoding, except the FST
      is typically utterance-specific.

      gmm-align takes both transcriptions AND a decoding graph. If the
      transcriptions aren't used to build a linear acceptor (like the one
      generated by compile-training-graphs), what are they used for? Are they
      supposed to be actual user-supplied transcriptions, or are they supposed to
      be transitions output by a decoding pass? If the latter, doesn't the
      decoding pass output alignments anyway (gmm-decode takes an
      alignments-wspecifier)?

      gmm-align takes transcriptions and a lexicon and a tree-- I don't think it
      takes a decoding graph, but it does take a lexicon that's represented as an
      FST. gmm-align is like gmm-align-compiled, except that it computes the
      utterance-specific FSTs on the fly, in memory, from the transcripts that it
      reads in.

      Dan


      Long audio alignment
      https://sourceforge.net/p/kaldi/discussion/1355348/thread/40dec03f/?limit=25#fa1a


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355348/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • Jonathan Roitgrund

    That makes perfect sense, especially since gmm-align takes a lexicon rather than a full decoding graph.

    Thanks again for all the help and the wonderful example scripts - Kaldi has been an absolute pleasure to use so far.

    Quick question: is it possible for a very restrictive language model to yield worse results than a very general language model?

    I can transcribe an utterance with 100% accuracy using a bigram LM generated with MITLM from the entire Voxforge corpus.

    With a bigram model generated from just that utterance though, the transcription is completely wrong: "TO FLOOR HIS HE TO FLOOR HIS FRESHLY FLOOR THE" instead of "HIS FRESHLY CAUGHT FURS HE FLUNG TO THE FLOOR"). This happens with several utterances regardless of which acoustic model I use.

    To add to my puzzlement, the idea of my language model being "too restrictive" doesn't make sense given that gmm-align seems to be able to align the transcription, and from what I understand the grammar it creates on the fly is a linear acceptor, which should be even more restrictive.

     
    • Daniel Povey

      Daniel Povey - 2014-06-09

      Quick question: is it possible for a very restrictive language model to
      yield worse results than a very general language model?

      It shouldn't, if it's well matched to what you're recognizing.

      I can transcribe an utterance with 100% accuracy using a bigram LM
      generated with MITLM from the entire Voxforge corpus.

      With a bigram model generated from just that utterance though, the
      transcription is completely wrong: "TO FLOOR HIS HE TO FLOOR HIS FRESHLY
      FLOOR THE" instead of "HIS FRESHLY CAUGHT FURS HE FLUNG TO THE FLOOR").
      This happens with several utterances regardless of which acoustic model I
      use.

      To add to my puzzlement, the idea of my language model being "too
      restrictive" doesn't make sense given that gmm-align seems to be able to
      align the transcription, and from what I understand the grammar it creates
      on the fly is a linear acceptor, which should be even more restrictive.

      I would try using larger beams with your restrictive model. If it still
      doesn't work well, make sure that you don't have a mismatch like using a
      lexicon FST that has the wrong words.txt file. FST composition only makes
      sense if the integer identifiers of things like words are the same in the
      different source FSTs.
      Dan

       
    • Vassil Panayotov

      I would also suggest to use SRILM - in my experience it's more reliable that the FOSS alternatives.
      I think the recommended smoothing algorithm for short texts is Witten-Bell.

       
  • Jonathan Roitgrund

    Tried SRILM and W-B: ngram-count -lm data/local/lm.arpa -wbdiscount -text data/corpus.txt -order 3 -write-vocab $data/local/vocab-full.txt

    Tried beam sizes anywhere from 16 to 400.

    Same erroneous transcription every time.

    Could it be because I'm using a different phone list as the one that was used when training the tree for the acoustic model (I'm using tree, final.mdl, and final.mat from Voxforge training)?

     
    • Daniel Povey

      Daniel Povey - 2014-06-10

      Yes, that could be a problem.
      The phones.txt mappings have to be exactly the same as used during
      training, and you have to use a lexicon built with the same words.txt that
      you are using to map the words to integers while compiling the LM into an
      FST. Otherwise everything is meaningless because phones and words are
      mapped to strings in different ways.

      Dan

      On Tue, Jun 10, 2014 at 6:15 PM, Jonathan Roitgrund <jroitgrund@users.sf.net

      wrote:

      Tried SRILM and W-B: ngram-count -lm data/local/lm.arpa -wbdiscount -text
      data/corpus.txt -order 3 -write-vocab $data/local/vocab-full.txt

      Tried beam sizes anywhere from 16 to 400.

      Same erroneous transcription every time.

      Could it be because I'm using a different phone list as the one that was
      used when training the tree for the acoustic model (I'm using tree,
      final.mdl, and final.mat from Voxforge training)?


      Long audio alignment
      https://sourceforge.net/p/kaldi/discussion/1355348/thread/40dec03f/?limit=25#9eba


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355348/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • Jonathan Roitgrund

    Yep, Dan was right - copying over nonsilence_phones.txt and preventing it being overwritten fixed it.

     
  • Jonathan Roitgrund

    Any suggestions for improving performance on noisy speech (from a movie)?
    I thought a constrained LM would be enough but I get a huge error rate even with utterance-specific LMs (and this time it's noise not an error in my data preparation, because I get close to 100% on less noisy speech).

    I'm afraid acoustic model adaptation won't help much because the acoustic conditions are likely to be very different (probably even different speakers) between segments.

    I'm still looking around for some way of pre-processing the audio but I'm wondering if there's anything obvious to do at the Kaldi level.

     
    • Vassil Panayotov

      BTW, in respect to the "different speakers in different segments" problem you've mentioned, you could perhaps consider using speaker diarization, e.g. LIUM's and doing per-speaker AM adaptation. Not sure how helpful that will be for noisy/cocktail-party/etc. speech, though...

       
  • Neil Nelson

    Neil Nelson - 2014-06-14

    For the better grade movies there is no hum or hiss in the usual continuous background noise sense that would be addressed by obtaining audio properties of a noise-only segment and reducing those properties with, for example, a Wiener filter. Rather the audio we hear is commonly a mixing-board assembly of audio components from a variety of sources and the prior post 'noise' consists of the non-speech components. We may think of each of these sources as 'actors' and where the 'noise' is much like the background chatter at a party. The chatter comes from actors near and far and at separate spatial locations, and to our advantage, modern movies are mixed to two or more audio channels with actor positioning obtained through a variety of effects. Methods of Auditory Scene Analysis intend to isolate these various actors. Here is an example paper: http://theses.eurasip.org/media/theses/documents/cobos-maximo-application-of-sound-source-separation-methods-to-advanced-spatial-audio-systems.pdf

     
    • Daniel Povey

      Daniel Povey - 2014-06-14

      For something that's simpler to use, you could try basis-fMLLR, which
      allows you to adapt on a small amount of data. Look in the scripts for
      anything with 'basis' in it.
      Dan

      On Sat, Jun 14, 2014 at 9:45 AM, Neil Nelson neilenelson@users.sf.net
      wrote:

      For the better grade movies there is no hum or hiss in the usual
      continuous background noise sense that would be addressed by obtaining
      audio properties of a noise-only segment and reducing those properties
      with, for example, a Wiener filter. Rather the audio we hear is commonly a
      mixing-board assembly of audio components from a variety of sources and the
      prior post 'noise' consists of the non-speech components. We may think of
      each of these sources as 'actors' and where the 'noise' is much like the
      background chatter at a party. The chatter comes from actors near and far
      and at separate spatial locations, and to our advantage, modern movies are
      mixed to two or more audio channels with actor positioning obtained through
      a variety of effects. Methods of Auditory Scene Analysis intend to isolate
      these various actors. Here is an example paper:
      http://theses.eurasip.org/media/theses/documents/cobos-maximo-application-of-sound-source-separation-methods-to-advanced-spatial-audio-systems.pdf


      Long audio alignment
      https://sourceforge.net/p/kaldi/discussion/1355348/thread/40dec03f/?limit=25#013d


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355348/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/