CMU Sphinx / Forums / Help: Arpa language model with pocketsphinx

Samuel Kitono - 2010-06-19

I have downloaded language model from
http://www.keithv.com/software/csr/
where it provides language for 5k vocab.
But it presents the language model in ARPA format .

can i still use this language model the same way as lm format? so the init
arguement would be -lm languagemodel.ARPA instead of the usual -lm
languagemodel.lm

Also can the dictionary have less vocabs than the language model vocab?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-06-19

can i still use this language model the same way as lm format?

There is no such thing as lm format. Model could be in two formats - arpa and
dmp, you can convert between formats using sphinx_lm_convert. Pocketsphinx can
load arpa models as well as dmp models. To load model use -lm command line
option. Sometimes, arpa models should be sorted using sphinx_lm_sort.

Also can the dictionary have less vocabs than the language model vocab?

Did you mean words? Ideally dictionary should have pronuciations for all words
from the language model vocabulary.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Kitono - 2010-06-19

Did you mean words? Ideally dictionary should have pronuciations for all
words from the language model vocabulary.

true, the reason i do this is so that my speech recognition will have less
error. But im wondering if this assumption is correct?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-06-19

true, the reason i do this is so that my speech recognition will have less
error. But im wondering if this assumption is correct?

No, it's not about less errors or more errors. If your language model has
words that are missing in the dictionary they are just dropped from the
search. That means you don't use your full language model but only a part of
it. This time you only waste the memory and processing time.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Kitono - 2010-06-19

well i do not mind wasting memory and processing time. My concern is the WER
rate of my speech recognition hence i was thinking of using a 3k word
dictionary list with the 5k language model. But do you think by doing this i
would decrease my WER?

If only there is a 3k language model I would have used it instead. I would be
glad if you could tell me where i can get them if possible.

And thanks for your constant reply to my question.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-06-19

If only there is a 3k language model I would have used it instead. I would
be glad if you could tell me where i can get them if possible.

SRILM toolkit can limit the vocabulary of the language model to the set you
need using "ngram -limit-vocab new.vocab -lm big.lm -write-lm small.lm"
command

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-06-19

And this discussion is very good illustation on how to ask proper questions.
Instead of dumb question "can I use small dictionary" you must have been
asking "can I limite the vocabulary of the language model". You must describe
the problem you have to get fast answer, not describe the way you think you
can solve it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Kitono - 2010-06-20

Thanks nshmyrev, I will keep in mind to ask a direct question next time. But i
am quite offended that you refer my question as "dumb". Im a newbie in this
topic and sometimes maybe i do not know what is the right question to ask.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ok, I have got a problem with the srilm I have run
ngram -limit-vocab turtle.vocab -lm lm_giga_64k_nvp_3gram.arpa -write-lm
test2.lm

turtle.vocab has these lines

A
AND
ARE
AROUND
BACKWARD
BACKWARDS
BYE
CENTIMETER
CENTIMETERS
CHASE
COLOR
DEGREES
DISPLAY
DO
DOING
EIGHT
EIGHTEEN
EIGHTY
ELEVEN
EXIT
EXPLORE
FIFTEEN
FIFTY
FIND
FINISH
FIVE
FORTY
FORWARD
FOUR
FOURTEEN
GO
GREY
GUARD
HALF
HALL
HALLWAY
HALT
HELLO
HOME
HUNDRED
KEVIN
LAB
LEFT
LISTENING
LOST
METER
METERS
MINUS
NINE
NINETEEN
NINETY
OFFICE
ONE
PERSON
QUARTER
QUARTERS
QUIT
READY
REID
RIGHT
ROBOMAN
ROOM
ROTATE
SAY
SEBASTIAN
SEVEN
SEVENTEEN
SEVENTY
SIX
SIXTEEN
SIXTY
STOP
TEN
THE
THEN
THIRTEEN
THIRTY
THREE
TO
TOM
TURN
TWELVE
TWENTY
TWO
UNDERSTAND
WANDER
WHAT
WINDOW
YOU

while test.lm has this as output which is obviously wrong:

\data\
ngram 1=3
ngram 2=4
ngram 3=2

\1-grams:
-2.111539   </s>    0
-99 <s> -2.176543
-1.128404   <unk>   -1.532571

\2-grams:
-4.122183   <s> </s>    0
-1.677757   <s> <unk>   -0.09743171
-1.21712    <unk> </s>  0
-1.362285   <unk> <unk> -0.2113529

\3-grams:
-1.162838   <unk> <unk> </s>
-1.280766   <unk> <unk> <unk>

\end\

the arpa file is huge so i just put a link here which leads to that file (it
is the 64k NVP 3-gram)
http://www.keithv.com/software/giga/

Please advise what is wrong with it...

Nickolay V. Shmyrev - 2010-07-09

Please advise what is wrong with it...

Your vocabulary is upper case, lm_giga is lower case. SRILM can't find any of
the words you listed in the lm.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Kitono - 2010-07-09

I have changed turtle.vocab into lowercase but It is still not working...

Just in case I will put the console output:

...\srilm\bin\Debug>ngram -limit-vocab turtle.vocab -lm lm_giga_64k_nvp_3gram.arpa -write-lm test2.lm lm_giga_64k_nvp_3gram.arpa: line 13: warning: non-zero probability for <unk> inclosed-vocabulary LM
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Samuel Kitono - 2010-07-09

Ok i have found the problem. After looking at the documentation of srilm the
"Correct" command line is the following:
ngram -vocab turtle.vocab -limit-vocab -lm big.lm -write-lm small.lm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Arpa language model with pocketsphinx

Speech Recognition Toolkit

Forums

Help

Arpa language model with pocketsphinx document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Arpa language model with pocketsphinx