I've been experimenting with generating language models using cmuclmtk-0.7
instead of mitlm. I had a couple of questions about it.
My first question is about text2idngram, specifically the hash size that is
passed to it. I can't use the default of 2000000 because the memory isn't
there for it on the device. I did some tests of reducing it to various numbers
and discovered that I could reduce it down to 5000 without the (final .arpa)
output apparently changing, and furthermore it reduced the time needed by
about a third. There seems to be nothing but upside to reducing the hash size.
Is there any downside?
My second question is that when I generate an ARPA model using the method
defined in this link: http://cmusphinx.sourceforge.net/wiki/tutoriallm, I get some entries in my ARPA
file with end + start sentence markers as in the following 3-gram:
-1.4260 TUTORIAL
When a language model is made in this way using primarily single words
surrounded by the sentence markers, for instance the following starting text
file:
CHANGE MODEL MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY SUNDAY QUIDNUNC
Then almost every word will have an entry for this WORD pattern under
the 3-grams as well as the more expected WORD pattern, like so:
My first question is about text2idngram, specifically the hash size that is
passed to it. I can't use the default of 2000000 because the memory isn't
there for it on the device. I did some tests of reducing it to various numbers
and discovered that I could reduce it down to 5000 without the (final .arpa)
output apparently changing, and furthermore it reduced the time needed by
about a third. There seems to be nothing but upside to reducing the hash size.
Is there any downside
I think it can be done, but it maybe requires some rework in outdated cmuclmtk
sources (hash needs to be linked from sphinxbase)
-1.4260 TUTORIAL
Is this expected? Is it problematic for recognition? Thanks for your insight.
This shouldn't hurt. But it's not good either. I never had time to fix it in
cmuclmtk
Just have guesses to this answer so I will remain silent but I suggest you
check out SRILM as well.
Also mitlm and irstlm, they should be way better candidates.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just have guesses to this answer so I will remain silent but I suggest you
check out SRILM as well.
I can't use srilm or irstlm because of their licenses; they'd probably be
usable for me, but unusable for my users. I have implemented a working port of
mitlm but I would really like to drop the C++ requirement since its the only
non-C dependency I'm working with at it leads to some awkwardness in Xcode
that leads to many support cases. But if the hash size reduction and the issue both have unknown consequences in cmuclmtk I guess I have to
stick with mitlm (which I do like other than the C++ thing). Nickolay, what is
the potential danger with changing the hash size?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
. But if the hash size reduction and the issue both have unknown
consequences in cmuclmtk I guess I have to stick with mitlm (which I do like
other than the C++ thing).
Consequences are known and they do not change anything. The model will be
still functional. You can safely go with cmuclmtk with modified initial hash
if you want to use it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That sounds good; can I ask you to expand on your previous comments a bit
then?
I think it can be done, but it maybe requires some rework in outdated
cmuclmtk sources (hash needs to be linked from sphinxbase)
This sounds like there is something I still need to do in order for this to
work without issue.
This shouldn't hurt. But it's not good either. I never had time to fix it in
cmuclmtk
Let's talk about the "not good"-ness a bit more. To me it looks like it is
effectively re-adding the 2-grams into the 3-gram section because it is taking
the end/start tags as words, so it probably raises the probability of the true
2-grams versus the true 3-grams a bit. Is there a problem with my doing a last
pass on the .arpa after it is created and just deleting the lines in which the pattern appears, and then adjusting the n-gram counts in the data
section? Or will this have a distorting effect on the overall probabilities?
Whatever I do there will have to work as well for 1000 words which are derived
from complete sentences as 10 words which were derived from a simple corpus
for command-and-control.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To me it looks like it is effectively re-adding the 2-grams into the 3-gram
section because it is taking the end/start tags as words, so it probably
raises the probability of the true 2-grams versus the true 3-grams a bit.
No, its not like that. Decoder never query those trigrams with so they
have no effect except they take memory.
Is there a problem with my doing a last pass on the .arpa after it is
created and just deleting the lines in which the pattern appears, and
then adjusting the n-gram counts in the data section? Or will this have a
distorting effect on the overall probabilities? Whatever I do there will have
to work as well for 1000 words which are derived from complete sentences as 10
words which were derived from a simple corpus for command-and-control.
Cleanup like this will work too.
I would better understand how do those things get into counts and fix it. It
shouldn't be complex, just some painful work to cleanup cmuclmtk. I can not
give you other advise.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay,
I've been experimenting with generating language models using cmuclmtk-0.7
instead of mitlm. I had a couple of questions about it.
My first question is about text2idngram, specifically the hash size that is
passed to it. I can't use the default of 2000000 because the memory isn't
there for it on the device. I did some tests of reducing it to various numbers
and discovered that I could reduce it down to 5000 without the (final .arpa)
output apparently changing, and furthermore it reduced the time needed by
about a third. There seems to be nothing but upside to reducing the hash size.
Is there any downside?
My second question is that when I generate an ARPA model using the method
defined in this link: http://cmusphinx.sourceforge.net/wiki/tutoriallm, I get some entries in my ARPA
file with end + start sentence markers as in the following 3-gram:
-1.4260 TUTORIAL
When a language model is made in this way using primarily single words
surrounded by the sentence markers, for instance the following starting text
file:
CHANGE MODELMONDAYTUESDAYWEDNESDAYTHURSDAYFRIDAYSATURDAYSUNDAYQUIDNUNCThen almost every word will have an entry for this WORDpattern underthe 3-grams as well as the more expected
WORDpattern, like so:-0.9542FRIDAY-0.9542
MONDAY-0.9542
QUIDNUNC-0.9542
SATURDAY-0.9542
SUNDAY-0.9542
THURSDAY-0.9542
TUESDAY-0.9542
WEDNESDAY-0.3010
CHANGE MODEL-0.3010
FRIDAY-0.3010
MONDAY-0.3010
SATURDAY-0.3010
SUNDAY-0.3010
THURSDAY-0.3010
TUESDAY-0.3010
WEDNESDAY-0.3010 CHANGE MODEL
-0.3010 FRIDAY
-0.3010 MODEL
-0.3010 MONDAY
-0.3010 SATURDAY
-0.3010 SUNDAY
-0.3010 THURSDAY
-0.3010 TUESDAY
-0.3010 WEDNESDAY
Is this expected? Is it problematic for recognition? Thanks for your insight.Just have guesses to this answer so I will remain silent but I suggest you
check out SRILM as well.
http://www.speech.sri.com/projects/srilm/
I think it can be done, but it maybe requires some rework in outdated cmuclmtk
sources (hash needs to be linked from sphinxbase)
This shouldn't hurt. But it's not good either. I never had time to fix it in
cmuclmtk
Also mitlm and irstlm, they should be way better candidates.I can't use srilm or irstlm because of their licenses; they'd probably be
usable for me, but unusable for my users. I have implemented a working port of
mitlm but I would really like to drop the C++ requirement since its the only
non-C dependency I'm working with at it leads to some awkwardness in Xcode
that leads to many support cases. But if the hash size reduction and the
issue both have unknown consequences in cmuclmtk I guess I have tostick with mitlm (which I do like other than the C++ thing). Nickolay, what is
the potential danger with changing the hash size?
Consequences are known and they do not change anything. The model will bestill functional. You can safely go with cmuclmtk with modified initial hash
if you want to use it.
That sounds good; can I ask you to expand on your previous comments a bit
then?
This sounds like there is something I still need to do in order for this to
work without issue.
Let's talk about the "not good"-ness a bit more. To me it looks like it is
effectively re-adding the 2-grams into the 3-gram section because it is taking
the end/start tags as words, so it probably raises the probability of the true
2-grams versus the true 3-grams a bit. Is there a problem with my doing a last
pass on the .arpa after it is created and just deleting the lines in which the
pattern appears, and then adjusting the n-gram counts in the datasection? Or will this have a distorting effect on the overall probabilities?
Whatever I do there will have to work as well for 1000 words which are derived
from complete sentences as 10 words which were derived from a simple corpus
for command-and-control.
No, its not like that. Decoder never query those trigrams with so they
have no effect except they take memory.
Cleanup like this will work too.
I would better understand how do those things get into counts and fix it. Itshouldn't be complex, just some painful work to cleanup cmuclmtk. I can not
give you other advise.