I have downloaded language model from http://www.keithv.com/software/csr/
where it provides language for 5k vocab.
But it presents the language model in ARPA format .
can i still use this language model the same way as lm format? so the init
arguement would be -lm languagemodel.ARPA instead of the usual -lm
languagemodel.lm
Also can the dictionary have less vocabs than the language model vocab?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
can i still use this language model the same way as lm format?
There is no such thing as lm format. Model could be in two formats - arpa and
dmp, you can convert between formats using sphinx_lm_convert. Pocketsphinx can
load arpa models as well as dmp models. To load model use -lm command line
option. Sometimes, arpa models should be sorted using sphinx_lm_sort.
Also can the dictionary have less vocabs than the language model vocab?
Did you mean words? Ideally dictionary should have pronuciations for all words
from the language model vocabulary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
true, the reason i do this is so that my speech recognition will have less
error. But im wondering if this assumption is correct?
No, it's not about less errors or more errors. If your language model has
words that are missing in the dictionary they are just dropped from the
search. That means you don't use your full language model but only a part of
it. This time you only waste the memory and processing time.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
well i do not mind wasting memory and processing time. My concern is the WER
rate of my speech recognition hence i was thinking of using a 3k word
dictionary list with the 5k language model. But do you think by doing this i
would decrease my WER?
If only there is a 3k language model I would have used it instead. I would be
glad if you could tell me where i can get them if possible.
And thanks for your constant reply to my question.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If only there is a 3k language model I would have used it instead. I would
be glad if you could tell me where i can get them if possible.
SRILM toolkit can limit the vocabulary of the language model to the set you
need using "ngram -limit-vocab new.vocab -lm big.lm -write-lm small.lm"
command
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And this discussion is very good illustation on how to ask proper questions.
Instead of dumb question "can I use small dictionary" you must have been
asking "can I limite the vocabulary of the language model". You must describe
the problem you have to get fast answer, not describe the way you think you
can solve it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks nshmyrev, I will keep in mind to ask a direct question next time. But i
am quite offended that you refer my question as "dumb". Im a newbie in this
topic and sometimes maybe i do not know what is the right question to ask.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok i have found the problem. After looking at the documentation of srilm the
"Correct" command line is the following:
ngram -vocab turtle.vocab -limit-vocab -lm big.lm -write-lm small.lm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have downloaded language model from
http://www.keithv.com/software/csr/
where it provides language for 5k vocab.
But it presents the language model in ARPA format .
can i still use this language model the same way as lm format? so the init
arguement would be -lm languagemodel.ARPA instead of the usual -lm
languagemodel.lm
Also can the dictionary have less vocabs than the language model vocab?
There is no such thing as lm format. Model could be in two formats - arpa and
dmp, you can convert between formats using sphinx_lm_convert. Pocketsphinx can
load arpa models as well as dmp models. To load model use -lm command line
option. Sometimes, arpa models should be sorted using sphinx_lm_sort.
Did you mean words? Ideally dictionary should have pronuciations for all words
from the language model vocabulary.
true, the reason i do this is so that my speech recognition will have less
error. But im wondering if this assumption is correct?
No, it's not about less errors or more errors. If your language model has
words that are missing in the dictionary they are just dropped from the
search. That means you don't use your full language model but only a part of
it. This time you only waste the memory and processing time.
well i do not mind wasting memory and processing time. My concern is the WER
rate of my speech recognition hence i was thinking of using a 3k word
dictionary list with the 5k language model. But do you think by doing this i
would decrease my WER?
If only there is a 3k language model I would have used it instead. I would be
glad if you could tell me where i can get them if possible.
And thanks for your constant reply to my question.
SRILM toolkit can limit the vocabulary of the language model to the set you
need using "ngram -limit-vocab new.vocab -lm big.lm -write-lm small.lm"
command
And this discussion is very good illustation on how to ask proper questions.
Instead of dumb question "can I use small dictionary" you must have been
asking "can I limite the vocabulary of the language model". You must describe
the problem you have to get fast answer, not describe the way you think you
can solve it.
Thanks nshmyrev, I will keep in mind to ask a direct question next time. But i
am quite offended that you refer my question as "dumb". Im a newbie in this
topic and sometimes maybe i do not know what is the right question to ask.
Ok, I have got a problem with the srilm I have run
ngram -limit-vocab turtle.vocab -lm lm_giga_64k_nvp_3gram.arpa -write-lm
test2.lm
turtle.vocab has these lines
while test.lm has this as output which is obviously wrong:
the arpa file is huge so i just put a link here which leads to that file (it
is the 64k NVP 3-gram)
http://www.keithv.com/software/giga/
Please advise what is wrong with it...
Your vocabulary is upper case, lm_giga is lower case. SRILM can't find any of
the words you listed in the lm.
I have changed turtle.vocab into lowercase but It is still not working...
Just in case I will put the console output:
Ok i have found the problem. After looking at the documentation of srilm the
"Correct" command line is the following:
ngram -vocab turtle.vocab -limit-vocab -lm big.lm -write-lm small.lm