Menu

Organizing training text for CMU toolkit

Help
2005-06-08
2012-09-22
  • Grad_Student

    Grad_Student - 2005-06-08

    I have a question dealing with the training transcript. I have two text one text file has around 5000 unique sentences another has over 5,000,000 unique sentences. Both are from the command and control domain. Which file should I use as the training file?

    Second question both of my files have one sentence per line what else do I need to do to this file to make it ready for the CM toolkit. Meaning do I need to make it similar to the transcript files used for batch processing?
    Do I need to add context cues to these transcript files such as begin</s> and end </s> of speech markers, silence markers etc?
    Do I need to add the phonetic representation of each line?

     
    • Anonymous

      Anonymous - 2005-06-08

      First of all, I rethought my earlier posting. Your LM training is significantly different from mine in that your command/control app has clearly defined utterance beginnings and endings (the start and end of each command), but my earlier dictation app did not, and that's where my problem originated. I am not sure, but I think that you should put <s> </s> around each utterance.

      I think that you must be careful in using a systematically generated corpus such as you have described above. Each utterance/sentence is the corpus is assumed to be equally likely, and the 1-gram, 2-gram, and 3-gram probabilities will be estimated accordingly. On the other hand, if certain commands or classes of commands are expected to be more frequent than others, then you should try to represent that in the training corpus.

      I am no longer in the same job as when I used the CMU SLM Toolkit earlier this year, so I am not in a position to offer detailed advice, but I'm sure there are others reading this forum who can.

      Some helpful information on the .arpa LM format is found in http://fife.speech.cs.cmu.edu/sphinxman/decoding.html#01 .

      cheers,
      jerry

       
    • Anonymous

      Anonymous - 2005-06-08

      First of all, I assume that you are addressing language model training using the CML Statistical Language Model toolkit, not acoustical model training.

      Usually a larger training corpus is better, but it may depend on the complexity of your command/control application -- if it's limited enough, then the 5-million corpus may not describe the domain any better. For initial experiments, the 5000-sentence corpus will be much faster to train.

      For LM training, no phonetic representation is needed or can be used. the LM is in terms of word tokens only.

      I have only a little experience building LMs with this toolkit, but I believe that the <s> and </s> context cue markers are quite important, at least for Sphinx-4 (and I suspect for the other Sphinxen as well). I have found many questions about training from a text-only corpus, but few answers. See my 2005-06-02 posting under http://sourceforge.net/forum/forum.php?thread_id=1227551&forum_id=382337 .

      cheers,
      jerry

       
    • Grad_Student

      Grad_Student - 2005-06-08

      Thanks Jerry I have been going through a few of your old post since you seem to have many of the same questions I currently have.

      My question is refering to the CM Language Model toolkit

      Thanks
      Grad_Student

       
    • Grad_Student

      Grad_Student - 2005-06-08

      Based on what I read from the link you posted is it mandatory that in my training transcript I add the <s></s> at the beginning and ending of each sentence even though each one of the sentences in the training text are on a seperate line?

      Also the training text that I am using is a printout of all the possible sentences that a command and control BNF could produce. Majority of these sentences are exactly the same except for one word diffences. For example: "Robot A go thay way" and "Robot B go that way". Will this effect n-gram probabilites that will be assigned to wach word.

      Finally is there a web accessible document I could read on what the information in the .arpa file represents. For example I tried developing a bigram LM using the commands shown on the FAQ page of Sphinx -4. I ended up with a LM in arpa format with information such as
      Absolute discounting was applied.
      1-gram discounting constant : NaN
      2-gram discounting constant : NaN
      3-gram discounting constant : NaN

      and

      \data\ ngram 1=81
      ngram 2=1
      ngram 3=1

      \1-grams:
      NaN <UNK> -99.9990
      -99.0000 1 0.0000
      -99.0000 I 0.0000
      -99.0000 a 0.0000
      -99.0000 above 0.0000

      I am pretty sure this is not the information I should be getting but I am not sure how to read it. Is it the probability of 'a' in a unigram model is 0%?

      Thanks Once again for the information
      Grad_Student

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.