Menu

Open/Close vocabulary Language Models

CK
2014-09-05
2014-09-23
  • CK

    CK - 2014-09-05

    Hi,

    I am using CMU Sphinx online LMTool to generate language model, but which type of language model is that open or closed as I have seen some information on open and close vocabulary language models.
    How does that variation reflect in recognition in both performance and accuracy

    Regards,
    Kalpana Challagulla

     
    • Nickolay V. Shmyrev

      Open vocabulary language model contains special word <unk> and allows to score probabilities of word sequences with arbitrary vocabulary. The probability of the unknown word is replaced with probability of <unk> tag. This is useful for some applications but not applicable for speech recognition since speech recognizer needs to know not just the probability of the word but also it's pronunciation.</unk></unk>

      In our decoders <unk> word is not used at all and any open vocabulary language model is essentially converted to closed vocabulary language model.
      So it doesn't matter what language model will you generate, it will be treated as closed vocabulary language model anyway.</unk>

       
  • CK

    CK - 2014-09-08

    Thank You Nickolay, that clarified me.
    also need to know perplexity score calculation,how it differs from the text which is in the language model and which is not in the language model and which is the correct way to get the perplexity score of a given language model to evaluate

     
  • Nickolay V. Shmyrev

    also need to know perplexity score calculation,how it differs from the text which is in the language model and which is not in the language model and which is the correct way to get the perplexity score of a given language model to evaluate

    Perplexity depends on both the language model and the text. Perplexity on training text is obviously smaller. You can not evaluate the perplexity of the model alone, you need to submit some text for evaluation. For the best estimation this text should be in domain but should not contain sentences from training text.

     
  • CK

    CK - 2014-09-12

    Thank you Nickolay

    Using CMUCLMTK, I am able to get perplexity score for a specific small language model generated from CMUCLMTK steps, result is as below
    evallm :
    perplexity -text test2.txt
    Computing perplexity of the language model with respect
    to the text test2.txt
    Perplexity = 2.90, Entropy = 1.54 bits
    Computation based on 9 words.
    Number of 3-grams hit = 2 (22.22%)
    Number of 2-grams hit = 2 (22.22%)
    Number of 1-grams hit = 5 (55.56%)
    0 OOVs (0.00%) and 2 context cues were removed from the calculation.
    Can we get perplexity score for the generic English Language model available on CMU site against our text..??
    How to get the same for (Language models available in CMUSphinx)cmusphinx-5.0-en-us.lm..??

     
  • CK

    CK - 2014-09-16

    On unpacking the above mentioned model i am able to see .dmp model not arpa format , can we directly convert the .dmp language model to .arpa language model using sphinx_lm_convert.exe?

     
    • Nickolay V. Shmyrev

      On unpacking the above mentioned model i am able to see .dmp model not arpa format

      You need to double-check what you are doing. The link above points to gzipped ARPA model

      can we directly convert the .dmp language model to .arpa language model using sphinx_lm_convert.exe?

      Yes

              sphinx_lm_convert -i file.lm.dmp -o file.lm -ifmt dmp -ofmt arpa
      
       
  • CK

    CK - 2014-09-22

    Hi Nickolay,

    As already discussed i am able to calculate the perplexity score for the small and specific language models which we have generated, but when i tried calculating the same for en-us generic language model which i have downloaded from CMU site i am getting the below error can you please help me how to over come this

    D:\CMUSphinix\CMUCLMTK\cmuclmtk-0.7-win32\cmuclmtk-0.7-win32>evallm.exe -arpa lm.arpa
    Reading in language model from file lm.arpa
    Reading in a 3-gram language model.
    Number of 1-grams = 19794.
    Number of 2-grams = 1377200.
    Number of 3-grams = 3178194.
    Reading unigrams...
    Warning, reading line -
    - gave unexpected input.

    Reading 2-grams...
    ....Error - Repeated 2-gram in ARPA format language model.

    Regards,
    Kalpana

     
    • Nickolay V. Shmyrev

      There might be a bug in cmucltmtk. Use SRILM instead.

       

Log in to post a comment.