I'm seeing some results I'm not accustomed to in CMUCLMTK .7 when creating an
ARPA lm. The initial corpus is as follows:
SELECT ALL REMOVE ALL CLOSE CANCEL BACK UP DOWN PLAY ONE TWO THREE FOUR FIVE SIX SEVEN EIGHT NINE TEN CHANGE MODEL
This is using the method from the wiki on creating a language model using
CMUCLMTK. This has been working very well in OpenEars for lms made from longer
sentences.
With this small corpus, when I get to idngram2lm, here is the logging I'm
seeing:
What happens in the end is that (as clearly warned above), every bigram and
trigram has a -99.999 probability and can never be recognized. The unigrams
appear to be fine. Can you give me any insight into why this is happening and
how I can fix it? Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can you give me any insight into why this is happening and how I can fix it?
The amount of training data is not sufficient to estimate good-turing discount
parameters properly. You need to use linear smoothing or even absolute
discount. For small vocabulary it's recommended to us JSGF. Trigram ARPA model
have little sense for few dozen words.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you, that makes sense. How many lines (or overall words) should the
corpus be before trying to use good-turing?
This is for an automatic language model generator in the framework, so there
is currently no "automatically generate a JSGF grammar" option since it's more
complex to turn into a simple API.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So, that worked perfectly with the old corpus when I switched to absolute
(linear gave the same -99.999 result). With this different small corpus I now
get a crash:
SUNDAY MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY QUIDNUNC CHANGE MODEL
I understand that JSGF would be better, but I still need to make ARPA work.
Here is the logging right up until the crash, thanks again for your help:
Hi Nickolay,
I'm seeing some results I'm not accustomed to in CMUCLMTK .7 when creating an
ARPA lm. The initial corpus is as follows:
SELECT ALLREMOVE ALLCLOSECANCELBACKUPDOWNPLAYONETWOTHREEFOURFIVESIXSEVENEIGHTNINETENCHANGE MODELThis is using the method from the wiki on creating a language model using
CMUCLMTK. This has been working very well in OpenEars for lms made from longer
sentences.
With this small corpus, when I get to idngram2lm, here is the logging I'm
seeing:
I believe the issue is warned about here:
What happens in the end is that (as clearly warned above), every bigram and
trigram has a -99.999 probability and can never be recognized. The unigrams
appear to be fine. Can you give me any insight into why this is happening and
how I can fix it? Thank you.
The amount of training data is not sufficient to estimate good-turing discount
parameters properly. You need to use linear smoothing or even absolute
discount. For small vocabulary it's recommended to us JSGF. Trigram ARPA model
have little sense for few dozen words.
Thank you, that makes sense. How many lines (or overall words) should the
corpus be before trying to use good-turing?
This is for an automatic language model generator in the framework, so there
is currently no "automatically generate a JSGF grammar" option since it's more
complex to turn into a simple API.
Hi Nickolay,
So, that worked perfectly with the old corpus when I switched to absolute
(linear gave the same -99.999 result). With this different small corpus I now
get a crash:
SUNDAYMONDAYTUESDAYWEDNESDAYTHURSDAYFRIDAYSATURDAYQUIDNUNCCHANGE MODELI understand that JSGF would be better, but I still need to make ARPA work.
Here is the logging right up until the crash, thanks again for your help: