Menu

Arpa language model with pocketsphinx

Help
2010-06-19
2012-09-22
  • Samuel Kitono

    Samuel Kitono - 2010-06-19

    I have downloaded language model from
    http://www.keithv.com/software/csr/
    where it provides language for 5k vocab.
    But it presents the language model in ARPA format .

    can i still use this language model the same way as lm format? so the init
    arguement would be -lm languagemodel.ARPA instead of the usual -lm
    languagemodel.lm

    Also can the dictionary have less vocabs than the language model vocab?

     
  • Nickolay V. Shmyrev

    can i still use this language model the same way as lm format?

    There is no such thing as lm format. Model could be in two formats - arpa and
    dmp, you can convert between formats using sphinx_lm_convert. Pocketsphinx can
    load arpa models as well as dmp models. To load model use -lm command line
    option. Sometimes, arpa models should be sorted using sphinx_lm_sort.

    Also can the dictionary have less vocabs than the language model vocab?

    Did you mean words? Ideally dictionary should have pronuciations for all words
    from the language model vocabulary.

     
  • Samuel Kitono

    Samuel Kitono - 2010-06-19

    Did you mean words? Ideally dictionary should have pronuciations for all
    words from the language model vocabulary.

    true, the reason i do this is so that my speech recognition will have less
    error. But im wondering if this assumption is correct?

     
  • Nickolay V. Shmyrev

    true, the reason i do this is so that my speech recognition will have less
    error. But im wondering if this assumption is correct?

    No, it's not about less errors or more errors. If your language model has
    words that are missing in the dictionary they are just dropped from the
    search. That means you don't use your full language model but only a part of
    it. This time you only waste the memory and processing time.

     
  • Samuel Kitono

    Samuel Kitono - 2010-06-19

    well i do not mind wasting memory and processing time. My concern is the WER
    rate of my speech recognition hence i was thinking of using a 3k word
    dictionary list with the 5k language model. But do you think by doing this i
    would decrease my WER?

    If only there is a 3k language model I would have used it instead. I would be
    glad if you could tell me where i can get them if possible.

    And thanks for your constant reply to my question.

     
  • Nickolay V. Shmyrev

    If only there is a 3k language model I would have used it instead. I would
    be glad if you could tell me where i can get them if possible.

    SRILM toolkit can limit the vocabulary of the language model to the set you
    need using "ngram -limit-vocab new.vocab -lm big.lm -write-lm small.lm"
    command

     
  • Nickolay V. Shmyrev

    And this discussion is very good illustation on how to ask proper questions.
    Instead of dumb question "can I use small dictionary" you must have been
    asking "can I limite the vocabulary of the language model". You must describe
    the problem you have to get fast answer, not describe the way you think you
    can solve it.

     
  • Samuel Kitono

    Samuel Kitono - 2010-06-20

    Thanks nshmyrev, I will keep in mind to ask a direct question next time. But i
    am quite offended that you refer my question as "dumb". Im a newbie in this
    topic and sometimes maybe i do not know what is the right question to ask.

     
  • Samuel Kitono

    Samuel Kitono - 2010-07-09

    Ok, I have got a problem with the srilm I have run
    ngram -limit-vocab turtle.vocab -lm lm_giga_64k_nvp_3gram.arpa -write-lm
    test2.lm

    turtle.vocab has these lines

    A
    AND
    ARE
    AROUND
    BACKWARD
    BACKWARDS
    BYE
    CENTIMETER
    CENTIMETERS
    CHASE
    COLOR
    DEGREES
    DISPLAY
    DO
    DOING
    EIGHT
    EIGHTEEN
    EIGHTY
    ELEVEN
    EXIT
    EXPLORE
    FIFTEEN
    FIFTY
    FIND
    FINISH
    FIVE
    FORTY
    FORWARD
    FOUR
    FOURTEEN
    GO
    GREY
    GUARD
    HALF
    HALL
    HALLWAY
    HALT
    HELLO
    HOME
    HUNDRED
    KEVIN
    LAB
    LEFT
    LISTENING
    LOST
    METER
    METERS
    MINUS
    NINE
    NINETEEN
    NINETY
    OFFICE
    ONE
    PERSON
    QUARTER
    QUARTERS
    QUIT
    READY
    REID
    RIGHT
    ROBOMAN
    ROOM
    ROTATE
    SAY
    SEBASTIAN
    SEVEN
    SEVENTEEN
    SEVENTY
    SIX
    SIXTEEN
    SIXTY
    STOP
    TEN
    THE
    THEN
    THIRTEEN
    THIRTY
    THREE
    TO
    TOM
    TURN
    TWELVE
    TWENTY
    TWO
    UNDERSTAND
    WANDER
    WHAT
    WINDOW
    YOU
    

    while test.lm has this as output which is obviously wrong:

    \data\
    ngram 1=3
    ngram 2=4
    ngram 3=2
    
    \1-grams:
    -2.111539   </s>    0
    -99 <s> -2.176543
    -1.128404   <unk>   -1.532571
    
    \2-grams:
    -4.122183   <s> </s>    0
    -1.677757   <s> <unk>   -0.09743171
    -1.21712    <unk> </s>  0
    -1.362285   <unk> <unk> -0.2113529
    
    \3-grams:
    -1.162838   <unk> <unk> </s>
    -1.280766   <unk> <unk> <unk>
    
    \end\
    

    the arpa file is huge so i just put a link here which leads to that file (it
    is the 64k NVP 3-gram)
    http://www.keithv.com/software/giga/

    Please advise what is wrong with it...

     
  • Nickolay V. Shmyrev

    Please advise what is wrong with it...

    Your vocabulary is upper case, lm_giga is lower case. SRILM can't find any of
    the words you listed in the lm.

     
  • Samuel Kitono

    Samuel Kitono - 2010-07-09

    I have changed turtle.vocab into lowercase but It is still not working...

    Just in case I will put the console output:

    ...\srilm\bin\Debug>ngram -limit-vocab turtle.vocab -lm lm_giga_64k_nvp_3gram.arpa -write-lm test2.lm
    lm_giga_64k_nvp_3gram.arpa: line 13: warning: non-zero probability for <unk> inclosed-vocabulary LM
    
     
  • Samuel Kitono

    Samuel Kitono - 2010-07-09

    Ok i have found the problem. After looking at the documentation of srilm the
    "Correct" command line is the following:
    ngram -vocab turtle.vocab -limit-vocab -lm big.lm -write-lm small.lm

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.