When creating a language model for LVCSR should I use the same vocabulary for training the Language model as I use for the LVCSR, or should I for example use the top N-counted words in the text-korpus as the vocabulary? With the same vocabulary as LCVSR I mean the .vocab file used under training of the acoustic model but with no duplicate words.
Also my .vocab file contains many words that is not present in the audio-files used for training the acoustic model. Will this make the recognition less accurate?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When creating a language model for LVCSR should I use the same vocabulary for training the Language model as I use for the LVCSR, or should I for example use the top N-counted words in the text-korpus as the vocabulary?
You can use top-n words as vocabulary.
Also my .vocab file contains many words that is not present in the audio-files used for training the acoustic model. Will this make the recognition less accurate?
No, acoustic model training learns sounds of the language, not words of the language. As long as sounds of the language are learned you can recognize any words from them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, thank you! I'm still a little confused about the LM though. When LVCSR is recognizing a word that is not presented in the LM, will the probability for this word(based on its context ) get zero? Will training a LM with good-touring discount solve this problem? or must I train the LM with an open vocabulary(taking UNKs into account)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Andreas, recognizers only looks for the words in the LM, it does not look for other words. Though there are research systems that can spell unknown words our system can't do this yet. For that reason we recommend to create LM with comprehensive words lists, you can take first 200k words for example. Open vocabulary language models are not used in speech recognizer, you need to train closed vocabulary model.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ok, so your saying any words that is not present in the LM will not be recognized? So my question is; why should I have a phonetic vocabulary for the AM that contains more words than the vocabulary of the LM when "the recognizers only looks for words in the LM"?
Sorry for all the questions, I just need to get my head straight on this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
why should I have a phonetic vocabulary for the AM that contains more words than the vocabulary of the LM when "the recognizers only looks for words in the LM"?
You should not. Only in case someone will create bigger LM it will use words from your phonetic vocabulary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When creating a language model for LVCSR should I use the same vocabulary for training the Language model as I use for the LVCSR, or should I for example use the top N-counted words in the text-korpus as the vocabulary? With the same vocabulary as LCVSR I mean the .vocab file used under training of the acoustic model but with no duplicate words.
Also my .vocab file contains many words that is not present in the audio-files used for training the acoustic model. Will this make the recognition less accurate?
You can use top-n words as vocabulary.
No, acoustic model training learns sounds of the language, not words of the language. As long as sounds of the language are learned you can recognize any words from them.
Ok, thank you! I'm still a little confused about the LM though. When LVCSR is recognizing a word that is not presented in the LM, will the probability for this word(based on its context ) get zero? Will training a LM with good-touring discount solve this problem? or must I train the LM with an open vocabulary(taking UNKs into account)?
Andreas, recognizers only looks for the words in the LM, it does not look for other words. Though there are research systems that can spell unknown words our system can't do this yet. For that reason we recommend to create LM with comprehensive words lists, you can take first 200k words for example. Open vocabulary language models are not used in speech recognizer, you need to train closed vocabulary model.
ok, so your saying any words that is not present in the LM will not be recognized? So my question is; why should I have a phonetic vocabulary for the AM that contains more words than the vocabulary of the LM when "the recognizers only looks for words in the LM"?
Sorry for all the questions, I just need to get my head straight on this.
You should not. Only in case someone will create bigger LM it will use words from your phonetic vocabulary.
Great, thank you Nickolay for clearing things up for me!