Kaldi / Discussion / Help: Long audio alignment

Jonathan Roitgrund - 2014-05-24

Hi,

I'm working on a basic transcript synchronization system and I was hoping to use Kaldi for long audio alignment (as described on this Sphinx documentation page), using the approach of recursively refining the language model using only the parts of the transcript between confirmed anchor points.

I've perused the (amazing) doc pages and the WFST paper and I have a basic idea of what I'd like to do but I have a couple of questions.

Am I right in thinking that I only need train one general purpose acoustic model (presumably from voxforge data, since it is the most complete freely available one) since I have no prior information about the speaker or audio? Can I generate this just once, and then make several different decoding graphs by composing it with different language models? Will it be stored as "H.fst"?

Since in my several recursive passes only the language model will differ, can I re-use any data in between passes? From what I understand the alignment (ie the map from MFCC vector frames to transition IDs in the HMM model) depends only on the acoustic model, so it seems like I should be able to re-use this and decode the same alignment to different text depending on the language model I use.

Finally, I'd love any insight on how best to use my transcript to build the language model. Since I know exactly what order the words are in and what sentences to expect, I assume a rule-based grammar would be ideal, and I should and use something like Thrax. Does this sound right?

Thanks in advance for any help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-05-24
  
  Vassil (cc'd) has been doing some work in audio alignment and may have some
  comments,
  
  Am I right in thinking that I only need train one general purpose acoustic
  model (presumably from voxforge data, since it is the most complete freely
  available one) since I have no prior information about the speaker or
  audio? Can I generate this just once, and then make several different
  decoding graphs by composing it with different language models? Will it be
  stored as "H.fst"?
  
  This should be enough, yes, so long as it's all in English. The model
  itself is called final.mdl, typically. To make different decoding graphs
  you could call utils/mkgraph.sh with suitable arguments; it takes the
  final.mdl and the grammar or language FST. H.fst is produced on the fly in
  Kaldi depending on which triphones you actually see, so can't really be
  re-used between different language models.
  
  Since in my several recursive passes only the language model will differ,
  can I re-use any data in between passes? From what I understand the
  alignment (ie the map from MFCC vector frames to transition IDs in the HMM
  model) depends only on the acoustic model, so it seems like I should be
  able to re-use this and decode the same alignment to different text
  depending on the language model I use.
  
  Yes, in principle you could re-use some info between passes, but this will
  involve coding; also, alignment will typically be pretty fast as the
  language model is quite constraining, so this would probably be a waste of
  your (human) time.
  
  Finally, I'd love any insight on how best to use my transcript to build
  the language model. Since I know exactly what order the words are in and
  what sentences to expect, I assume a rule-based grammar would be ideal, and
  I should and use something like Thraxhttp://openfst.cs.nyu.edu/twiki/bin/view/GRM/Thrax.
  Does this sound right?
  
  Hm. I'd go for something with more of a statistical flavor. You could
  just train a regular ARPA language model (e.g. using SRILM) on the
  transcripts. Alternatively you could go for a model in which you can
  access the transcript words in sequence, but also optionally skip them or
  insert filler words. In the latter case, directly writing an FST from a
  script would be the best (see the tutorial at www.openfst.org).
  
  Dan
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355348/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Dixon - 2014-05-24
  
  Finally, I'd love any insight on how best to use my transcript to build
  the language model. Since I know exactly what order the words are in and
  what sentences to expect, I assume a rule-based grammar would be ideal, and
  I should and use something like Thraxhttp://openfst.cs.nyu.edu/twiki/bin/view/GRM/Thrax.
  Does this sound right?
  
  You could try a Factor Automaton
  http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4960722
  http://www.cs.nyu.edu/~mohri/pub/fac.pdf
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vassil Panayotov - 2014-05-25

My setup is along the lines of what Dan said.
I'm planning to make the code publicly available in two-three months time.

Vassil

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Roitgrund - 2014-05-25

Thanks for all the replies, this is extremely helpful.
Am I correct in understanding that the factor automaton approach is a specific case of writing custom FSTs to enforce word order while allowing for indels? I'll play around with different models and see what works best.

Vassil, are you doing anything similar to Sail for acoustic model refinement in between iterations? They reference this flexible transcription alignment paper and this speaker adaptation paper, neither of which I fully understand.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Vassil Panayotov - 2014-05-26
  
  I don't have time at the moment to read the papers you refer to, but yes, my implementation does use acoustic model adaptation between the passes.
  
  Vassil
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Roitgrund - 2014-06-08

I'm having some trouble understanding the alignment tools. It would be lovely if someone could shed some light on them.

gmm-align-compiled seems to be just like gmm-decode (takes in an acoustic model, a decoding graph but as an rspecifier rather than as a .fst file, and features) except it outputs only alignments rather than both alignments and words. doesn't aligning require the exact same computation as decoding (finding the highest scoring path through the WFST)?

gmm-align takes both transcriptions AND a decoding graph. If the transcriptions aren't used to build a linear acceptor (like the one generated by compile-training-graphs), what are they used for? Are they supposed to be actual user-supplied transcriptions, or are they supposed to be transitions output by a decoding pass? If the latter, doesn't the decoding pass output alignments anyway (gmm-decode takes an alignments-wspecifier)?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-06-08
  
  gmm-align-compiled seems to be just like gmm-decode (takes in an acoustic
  model, a decoding graph but as an rspecifier rather than as a .fst file,
  and features) except it outputs only alignments rather than both alignments
  and words. doesn't aligning require the exact same computation as decoding
  (finding the highest scoring path through the WFST)?
  
  gmm-align-compiled is generally used in a mode where a separate FST derived
  from each of the human-supplied transcripts has already been created and
  dumped to disk. Yes, it's the same computation as decoding, except the FST
  is typically utterance-specific.
  
  gmm-align takes both transcriptions AND a decoding graph. If the
  transcriptions aren't used to build a linear acceptor (like the one
  generated by compile-training-graphs), what are they used for? Are they
  supposed to be actual user-supplied transcriptions, or are they supposed to
  be transitions output by a decoding pass? If the latter, doesn't the
  decoding pass output alignments anyway (gmm-decode takes an
  alignments-wspecifier)?
  
  gmm-align takes transcriptions and a lexicon and a tree-- I don't think it
  takes a decoding graph, but it does take a lexicon that's represented as an
  FST. gmm-align is like gmm-align-compiled, except that it computes the
  utterance-specific FSTs on the fly, in memory, from the transcripts that it
  reads in.
  
  Dan
  
  Long audio alignment
  https://sourceforge.net/p/kaldi/discussion/1355348/thread/40dec03f/?limit=25#fa1a
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355348/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Roitgrund - 2014-06-09

That makes perfect sense, especially since gmm-align takes a lexicon rather than a full decoding graph.

Thanks again for all the help and the wonderful example scripts - Kaldi has been an absolute pleasure to use so far.

Quick question: is it possible for a very restrictive language model to yield worse results than a very general language model?

I can transcribe an utterance with 100% accuracy using a bigram LM generated with MITLM from the entire Voxforge corpus.

With a bigram model generated from just that utterance though, the transcription is completely wrong: "TO FLOOR HIS HE TO FLOOR HIS FRESHLY FLOOR THE" instead of "HIS FRESHLY CAUGHT FURS HE FLUNG TO THE FLOOR"). This happens with several utterances regardless of which acoustic model I use.

To add to my puzzlement, the idea of my language model being "too restrictive" doesn't make sense given that gmm-align seems to be able to align the transcription, and from what I understand the grammar it creates on the fly is a linear acceptor, which should be even more restrictive.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-06-09
  
  Quick question: is it possible for a very restrictive language model to
  yield worse results than a very general language model?
  
  It shouldn't, if it's well matched to what you're recognizing.
  
  I can transcribe an utterance with 100% accuracy using a bigram LM
  generated with MITLM from the entire Voxforge corpus.
  
  With a bigram model generated from just that utterance though, the
  transcription is completely wrong: "TO FLOOR HIS HE TO FLOOR HIS FRESHLY
  FLOOR THE" instead of "HIS FRESHLY CAUGHT FURS HE FLUNG TO THE FLOOR").
  This happens with several utterances regardless of which acoustic model I
  use.
  
  To add to my puzzlement, the idea of my language model being "too
  restrictive" doesn't make sense given that gmm-align seems to be able to
  align the transcription, and from what I understand the grammar it creates
  on the fly is a linear acceptor, which should be even more restrictive.
  
  I would try using larger beams with your restrictive model. If it still
  doesn't work well, make sure that you don't have a mismatch like using a
  lexicon FST that has the wrong words.txt file. FST composition only makes
  sense if the integer identifiers of things like words are the same in the
  different source FSTs.
  Dan
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Vassil Panayotov - 2014-06-10
  
  I would also suggest to use SRILM - in my experience it's more reliable that the FOSS alternatives.
  I think the recommended smoothing algorithm for short texts is Witten-Bell.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Roitgrund - 2014-06-10

Tried SRILM and W-B: ngram-count -lm data/local/lm.arpa -wbdiscount -text data/corpus.txt -order 3 -write-vocab $data/local/vocab-full.txt

Tried beam sizes anywhere from 16 to 400.

Same erroneous transcription every time.

Could it be because I'm using a different phone list as the one that was used when training the tree for the acoustic model (I'm using tree, final.mdl, and final.mat from Voxforge training)?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-06-10
  
  Yes, that could be a problem.
  The phones.txt mappings have to be exactly the same as used during
  training, and you have to use a lexicon built with the same words.txt that
  you are using to map the words to integers while compiling the LM into an
  FST. Otherwise everything is meaningless because phones and words are
  mapped to strings in different ways.
  
  Dan
  
  On Tue, Jun 10, 2014 at 6:15 PM, Jonathan Roitgrund <jroitgrund@users.sf.net
  
  wrote:
  
  Tried SRILM and W-B: ngram-count -lm data/local/lm.arpa -wbdiscount -text
  data/corpus.txt -order 3 -write-vocab $data/local/vocab-full.txt
  
  Tried beam sizes anywhere from 16 to 400.
  
  Same erroneous transcription every time.
  
  Could it be because I'm using a different phone list as the one that was
  used when training the tree for the acoustic model (I'm using tree,
  final.mdl, and final.mat from Voxforge training)?
  
  Long audio alignment
  https://sourceforge.net/p/kaldi/discussion/1355348/thread/40dec03f/?limit=25#9eba
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355348/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Roitgrund - 2014-06-10

Yep, Dan was right - copying over nonsilence_phones.txt and preventing it being overwritten fixed it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Roitgrund - 2014-06-14

Any suggestions for improving performance on noisy speech (from a movie)?
I thought a constrained LM would be enough but I get a huge error rate even with utterance-specific LMs (and this time it's noise not an error in my data preparation, because I get close to 100% on less noisy speech).

I'm afraid acoustic model adaptation won't help much because the acoustic conditions are likely to be very different (probably even different speakers) between segments.

I'm still looking around for some way of pre-processing the audio but I'm wondering if there's anything obvious to do at the Kaldi level.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Vassil Panayotov - 2014-06-15
  
  BTW, in respect to the "different speakers in different segments" problem you've mentioned, you could perhaps consider using speaker diarization, e.g. LIUM's and doing per-speaker AM adaptation. Not sure how helpful that will be for noisy/cocktail-party/etc. speech, though...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neil Nelson - 2014-06-14

For the better grade movies there is no hum or hiss in the usual continuous background noise sense that would be addressed by obtaining audio properties of a noise-only segment and reducing those properties with, for example, a Wiener filter. Rather the audio we hear is commonly a mixing-board assembly of audio components from a variety of sources and the prior post 'noise' consists of the non-speech components. We may think of each of these sources as 'actors' and where the 'noise' is much like the background chatter at a party. The chatter comes from actors near and far and at separate spatial locations, and to our advantage, modern movies are mixed to two or more audio channels with actor positioning obtained through a variety of effects. Methods of Auditory Scene Analysis intend to isolate these various actors. Here is an example paper: http://theses.eurasip.org/media/theses/documents/cobos-maximo-application-of-sound-source-separation-methods-to-advanced-spatial-audio-systems.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2014-06-14
  
  For something that's simpler to use, you could try basis-fMLLR, which
  allows you to adapt on a small amount of data. Look in the scripts for
  anything with 'basis' in it.
  Dan
  
  On Sat, Jun 14, 2014 at 9:45 AM, Neil Nelson neilenelson@users.sf.net
  wrote:
  
  For the better grade movies there is no hum or hiss in the usual
  continuous background noise sense that would be addressed by obtaining
  audio properties of a noise-only segment and reducing those properties
  with, for example, a Wiener filter. Rather the audio we hear is commonly a
  mixing-board assembly of audio components from a variety of sources and the
  prior post 'noise' consists of the non-speech components. We may think of
  each of these sources as 'actors' and where the 'noise' is much like the
  background chatter at a party. The chatter comes from actors near and far
  and at separate spatial locations, and to our advantage, modern movies are
  mixed to two or more audio channels with actor positioning obtained through
  a variety of effects. Methods of Auditory Scene Analysis intend to isolate
  these various actors. Here is an example paper:
  http://theses.eurasip.org/media/theses/documents/cobos-maximo-application-of-sound-source-separation-methods-to-advanced-spatial-audio-systems.pdf
  
  Long audio alignment
  https://sourceforge.net/p/kaldi/discussion/1355348/thread/40dec03f/?limit=25#013d
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355348/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Long audio alignment

Forums

Help

Long audio alignment document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Long audio alignment