I have a question dealing with the training transcript. I have two text one text file has around 5000 unique sentences another has over 5,000,000 unique sentences. Both are from the command and control domain. Which file should I use as the training file?
Second question both of my files have one sentence per line what else do I need to do to this file to make it ready for the CM toolkit. Meaning do I need to make it similar to the transcript files used for batch processing?
Do I need to add context cues to these transcript files such as begin</s> and end </s> of speech markers, silence markers etc?
Do I need to add the phonetic representation of each line?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-06-08
First of all, I rethought my earlier posting. Your LM training is significantly different from mine in that your command/control app has clearly defined utterance beginnings and endings (the start and end of each command), but my earlier dictation app did not, and that's where my problem originated. I am not sure, but I think that you should put <s> </s> around each utterance.
I think that you must be careful in using a systematically generated corpus such as you have described above. Each utterance/sentence is the corpus is assumed to be equally likely, and the 1-gram, 2-gram, and 3-gram probabilities will be estimated accordingly. On the other hand, if certain commands or classes of commands are expected to be more frequent than others, then you should try to represent that in the training corpus.
I am no longer in the same job as when I used the CMU SLM Toolkit earlier this year, so I am not in a position to offer detailed advice, but I'm sure there are others reading this forum who can.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-06-08
First of all, I assume that you are addressing language model training using the CML Statistical Language Model toolkit, not acoustical model training.
Usually a larger training corpus is better, but it may depend on the complexity of your command/control application -- if it's limited enough, then the 5-million corpus may not describe the domain any better. For initial experiments, the 5000-sentence corpus will be much faster to train.
For LM training, no phonetic representation is needed or can be used. the LM is in terms of word tokens only.
I have only a little experience building LMs with this toolkit, but I believe that the <s> and </s> context cue markers are quite important, at least for Sphinx-4 (and I suspect for the other Sphinxen as well). I have found many questions about training from a text-only corpus, but few answers. See my 2005-06-02 posting under http://sourceforge.net/forum/forum.php?thread_id=1227551&forum_id=382337 .
cheers,
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Based on what I read from the link you posted is it mandatory that in my training transcript I add the <s></s> at the beginning and ending of each sentence even though each one of the sentences in the training text are on a seperate line?
Also the training text that I am using is a printout of all the possible sentences that a command and control BNF could produce. Majority of these sentences are exactly the same except for one word diffences. For example: "Robot A go thay way" and "Robot B go that way". Will this effect n-gram probabilites that will be assigned to wach word.
Finally is there a web accessible document I could read on what the information in the .arpa file represents. For example I tried developing a bigram LM using the commands shown on the FAQ page of Sphinx -4. I ended up with a LM in arpa format with information such as
Absolute discounting was applied.
1-gram discounting constant : NaN
2-gram discounting constant : NaN
3-gram discounting constant : NaN
and
\data\
ngram 1=81
ngram 2=1
ngram 3=1
\1-grams:
NaN <UNK> -99.9990
-99.0000 1 0.0000
-99.0000 I 0.0000
-99.0000 a 0.0000
-99.0000 above 0.0000
I am pretty sure this is not the information I should be getting but I am not sure how to read it. Is it the probability of 'a' in a unigram model is 0%?
Thanks Once again for the information
Grad_Student
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a question dealing with the training transcript. I have two text one text file has around 5000 unique sentences another has over 5,000,000 unique sentences. Both are from the command and control domain. Which file should I use as the training file?
Second question both of my files have one sentence per line what else do I need to do to this file to make it ready for the CM toolkit. Meaning do I need to make it similar to the transcript files used for batch processing?
Do I need to add context cues to these transcript files such as begin</s> and end </s> of speech markers, silence markers etc?
Do I need to add the phonetic representation of each line?
First of all, I rethought my earlier posting. Your LM training is significantly different from mine in that your command/control app has clearly defined utterance beginnings and endings (the start and end of each command), but my earlier dictation app did not, and that's where my problem originated. I am not sure, but I think that you should put <s> </s> around each utterance.
I think that you must be careful in using a systematically generated corpus such as you have described above. Each utterance/sentence is the corpus is assumed to be equally likely, and the 1-gram, 2-gram, and 3-gram probabilities will be estimated accordingly. On the other hand, if certain commands or classes of commands are expected to be more frequent than others, then you should try to represent that in the training corpus.
I am no longer in the same job as when I used the CMU SLM Toolkit earlier this year, so I am not in a position to offer detailed advice, but I'm sure there are others reading this forum who can.
Some helpful information on the .arpa LM format is found in http://fife.speech.cs.cmu.edu/sphinxman/decoding.html#01 .
cheers,
jerry
First of all, I assume that you are addressing language model training using the CML Statistical Language Model toolkit, not acoustical model training.
Usually a larger training corpus is better, but it may depend on the complexity of your command/control application -- if it's limited enough, then the 5-million corpus may not describe the domain any better. For initial experiments, the 5000-sentence corpus will be much faster to train.
For LM training, no phonetic representation is needed or can be used. the LM is in terms of word tokens only.
I have only a little experience building LMs with this toolkit, but I believe that the <s> and </s> context cue markers are quite important, at least for Sphinx-4 (and I suspect for the other Sphinxen as well). I have found many questions about training from a text-only corpus, but few answers. See my 2005-06-02 posting under http://sourceforge.net/forum/forum.php?thread_id=1227551&forum_id=382337 .
cheers,
jerry
Thanks Jerry I have been going through a few of your old post since you seem to have many of the same questions I currently have.
My question is refering to the CM Language Model toolkit
Thanks
Grad_Student
Based on what I read from the link you posted is it mandatory that in my training transcript I add the <s></s> at the beginning and ending of each sentence even though each one of the sentences in the training text are on a seperate line?
Also the training text that I am using is a printout of all the possible sentences that a command and control BNF could produce. Majority of these sentences are exactly the same except for one word diffences. For example: "Robot A go thay way" and "Robot B go that way". Will this effect n-gram probabilites that will be assigned to wach word.
Finally is there a web accessible document I could read on what the information in the .arpa file represents. For example I tried developing a bigram LM using the commands shown on the FAQ page of Sphinx -4. I ended up with a LM in arpa format with information such as
Absolute discounting was applied.
1-gram discounting constant : NaN
2-gram discounting constant : NaN
3-gram discounting constant : NaN
and
\data\ ngram 1=81
ngram 2=1
ngram 3=1
\1-grams:
NaN <UNK> -99.9990
-99.0000 1 0.0000
-99.0000 I 0.0000
-99.0000 a 0.0000
-99.0000 above 0.0000
I am pretty sure this is not the information I should be getting but I am not sure how to read it. Is it the probability of 'a' in a unigram model is 0%?
Thanks Once again for the information
Grad_Student