CMU Sphinx / Forums / Help: GT statistics are out of range

Hi Nickolay,

I'm seeing some results I'm not accustomed to in CMUCLMTK .7 when creating an
ARPA lm. The initial corpus is as follows:

~~SELECT ALL~~
~~REMOVE ALL~~
~~CLOSE~~
~~CANCEL~~
~~BACK~~
UP
~~DOWN~~
~~PLAY~~
~~ONE~~
~~TWO~~
~~THREE~~
~~FOUR~~
~~FIVE~~
~~SIX~~
~~SEVEN~~
~~EIGHT~~
~~NINE~~
~~TEN~~
~~CHANGE MODEL~~

This is using the method from the wiki on creating a language model using
CMUCLMTK. This has been working very well in OpenEars for lms made from longer
sentences.

With this small corpus, when I get to idngram2lm, here is the logging I'm
seeing:

2012-04-07 16:38:02.184 OpenEarsSampleApp[13820:10703] OPENEARSLOGGING: Starting text2idngram
text2idngram
Vocab                  : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.vocab
Output idngram         : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.idngram
N-gram buffer size     : 10
Hash table size        : 5000
Temp directory         : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/cmuclmtk-gyNhJz
Max open files         : 20
FOF size               : 10
n                      : 3
Initialising hash table...
Reading vocabulary... 
Allocating memory for the n-gram buffer...
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.

Sorting n-grams...
Writing sorted n-grams to temporary file /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/cmuclmtk-gyNhJz/1
Merging 1 temporary files...

2-grams occurring:  N times     > N times   Sug. -spec_num value
      0                          40          50
      1                  38           2          12
      2                   1           1          11
      3                   0           1          11
      4                   0           1          11
      5                   0           1          11
      6                   0           1          11
      7                   0           1          11
      8                   0           1          11
      9                   0           1          11
     10                   0           1          11

3-grams occurring:  N times     > N times   Sug. -spec_num value
      0                          57          67
      1                  56           1          11
      2                   1           0          10
      3                   0           0          10
      4                   0           0          10
      5                   0           0          10
      6                   0           0          10
      7                   0           0          10
      8                   0           0          10
      9                   0           0          10
     10                   0           0          10
text2idngram : Done.
2012-04-07 16:38:02.195 OpenEarsSampleApp[13820:10703] OPENEARSLOGGING: Done with text2idngram
2012-04-07 16:38:02.196 OpenEarsSampleApp[13820:10703] OPENEARSLOGGING: Starting idngram2lm
  n : 3
  Input file : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.idngram     (binary format)
  Output files :
     ARPA format   : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.arpa
  Vocabulary file : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.vocab
  Context cues file : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.ccs
  Cutoffs :
     2-gram : 0     3-gram : 0     
  Vocabulary type : Closed
  Minimum unigram count : 0
  Zeroton fraction : 1
  Counts will be stored in two bytes.
  Count table size : 65535
  Discounting method : Good-Turing
     Discounting ranges :
        1-gram : 1     2-gram : 7     3-gram : 7     
  Memory allocation for tree structure : 
     Allocate 10 MB of memory, shared equally between all n-gram tables.
  Back-off weight storage : 
     Back-off weights will be stored in four bytes.
Reading vocabulary.

read_wlist_into_siht: a list of 23 words was read from "/Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.vocab".
read_wlist_into_array: a list of 23 words was read from "/Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.vocab".
Context cue word : <s> id = 2
Context cue word : </s> id = 1
Allocated space for 357142 2-grams.
Allocated space for 833333 3-grams.
table_size 24
Allocated 5714272 bytes to table for 2-grams.
Allocated (2+3333332) bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.

fof[0][1] = 19
fof[0][2] = 1
fof[1][1] = 38
fof[1][2] = 1
fof[1][3] = 0
fof[1][4] = 0
fof[1][5] = 0
fof[1][6] = 0
fof[1][7] = 0
fof[1][8] = 0
fof[2][1] = 56
fof[2][2] = 1
fof[2][3] = 0
fof[2][4] = 0
fof[2][5] = 0
fof[2][6] = 0
fof[2][7] = 0
fof[2][8] = 0
Calculating discounted counts.
Warning : 1-gram : Discounting range is 1; setting P(zeroton)=P(singleton).
Discounted value : 0.86
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 6.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 5.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 4.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 3.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 2.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 1.
Warning : 2-gram : Discounting range of 1 is equivalent to excluding 
singletons.
2-gram : cutoff = 1, discounted values: 0.00
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 6.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 5.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 4.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 3.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 2.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 1.
Warning : 3-gram : Discounting range of 1 is equivalent to excluding 
singletons.
3-gram : cutoff = 1, discounted values: 0.00
   prob[1] = 1e-99 count = 0 
   prob[2] = 1e-99 count = 0 
   prob[3] = 0.095238095 count = 2 
   prob[4] = 0.041125542 count = 1 
   prob[5] = 0.041125542 count = 1 
   prob[6] = 0.041125542 count = 1 
   prob[7] = 0.041125542 count = 1 
   prob[8] = 0.041125542 count = 1 
   prob[9] = 0.041125542 count = 1 
   prob[10] = 0.041125542 count = 1 
   prob[11] = 0.041125542 count = 1 
   prob[12] = 1e-99 count = 0 
   prob[13] = 0.041125542 count = 1 
   prob[14] = 0.041125542 count = 1 
   prob[15] = 0.041125542 count = 1 
   prob[16] = 0.041125542 count = 1 
   prob[17] = 0.041125542 count = 1 
   prob[18] = 0.041125542 count = 1 
   prob[19] = 0.041125542 count = 1 
   prob[20] = 0.041125542 count = 1 
   prob[21] = 0.041125542 count = 1 
   prob[22] = 0.041125542 count = 1 
   prob[23] = 0.041125542 count = 1 
Unigrams's discount mass is 0.123377 (n1/N = 0.904762)
1 zerotons, P(zeroton) = 0.123377 P(singleton) = 0.0411255
P(zeroton) was reduced to 0.0411255416 (1.000 of P(singleton))
Unigram was renormalized to absorb a mass of 0.0822511
prob[UNK] = 1e-99
THE FINAL UNIGRAM:
 unigram[1]=1.08962e-99
 unigram[2]=1.08962e-99
 unigram[3]=0.103774
 unigram[4]=0.0448113
 unigram[5]=0.0448113
 unigram[6]=0.0448113
 unigram[7]=0.0448113
 unigram[8]=0.0448113
 unigram[9]=0.0448113
 unigram[10]=0.0448113
 unigram[11]=0.0448113
 unigram[12]=0.0448113
 unigram[13]=0.0448113
 unigram[14]=0.0448113
 unigram[15]=0.0448113
 unigram[16]=0.0448113
 unigram[17]=0.0448113
 unigram[18]=0.0448113
 unigram[19]=0.0448113
 unigram[20]=0.0448113
 unigram[21]=0.0448113
 unigram[22]=0.0448113
 unigram[23]=0.0448113
Incrementing contexts...
Calculating back-off weights...
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(2) = 0 (0 / 19)
ncount = 1
Warning : P(4) = 0 (0 / 1)
ncount = 1
Warning : P(5) = 0 (0 / 1)
ncount = 1
Warning : P(6) = 0 (0 / 1)
ncount = 1
Warning : P(7) = 0 (0 / 1)
ncount = 1
Warning : P(8) = 0 (0 / 1)
ncount = 1
Warning : P(9) = 0 (0 / 1)
ncount = 1
Warning : P(10) = 0 (0 / 1)
ncount = 1
Warning : P(11) = 0 (0 / 1)
ncount = 1
Warning : P(13) = 0 (0 / 1)
ncount = 1
Warning : P(14) = 0 (0 / 1)
ncount = 1
Warning : P(15) = 0 (0 / 1)
ncount = 1
Warning : P(16) = 0 (0 / 1)
ncount = 1
Warning : P(17) = 0 (0 / 1)
ncount = 1
Warning : P(18) = 0 (0 / 1)
ncount = 1
Warning : P(19) = 0 (0 / 1)
ncount = 1
Warning : P(20) = 0 (0 / 1)
ncount = 1
Warning : P(21) = 0 (0 / 1)
ncount = 1
Warning : P(22) = 0 (0 / 1)
ncount = 1
Warning : P(23) = 0 (0 / 1)
ncount = 1
Writing out language model...
ARPA-style 3-gram will be written to /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.arpa
idngram2lm : Done.

I believe the issue is warned about here:

Warning : 2-gram : GT statistics are out of range; lowering cutoff to 6.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 5.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 4.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 3.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 2.
Warning : 2-gram : GT statistics are out of range; lowering cutoff to 1.
Warning : 2-gram : Discounting range of 1 is equivalent to excluding 
singletons.
2-gram : cutoff = 1, discounted values: 0.00
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 6.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 5.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 4.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 3.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 2.
Warning : 3-gram : GT statistics are out of range; lowering cutoff to 1.
Warning : 3-gram : Discounting range of 1 is equivalent to excluding 
singletons.

What happens in the end is that (as clearly warned above), every bigram and
trigram has a -99.999 probability and can never be recognized. The unigrams
appear to be fine. Can you give me any insight into why this is happening and
how I can fix it? Thank you.

Nickolay V. Shmyrev - 2012-04-07

Can you give me any insight into why this is happening and how I can fix it?

The amount of training data is not sufficient to estimate good-turing discount
parameters properly. You need to use linear smoothing or even absolute
discount. For small vocabulary it's recommended to us JSGF. Trigram ARPA model
have little sense for few dozen words.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2012-04-07

Thank you, that makes sense. How many lines (or overall words) should the
corpus be before trying to use good-turing?

This is for an automatic language model generator in the framework, so there
is currently no "automatically generate a JSGF grammar" option since it's more
complex to turn into a simple API.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Nickolay,

So, that worked perfectly with the old corpus when I switched to absolute
(linear gave the same -99.999 result). With this different small corpus I now
get a crash:

~~SUNDAY~~
~~MONDAY~~
~~TUESDAY~~
~~WEDNESDAY~~
~~THURSDAY~~
~~FRIDAY~~
~~SATURDAY~~
~~QUIDNUNC~~
~~CHANGE MODEL~~

I understand that JSGF would be better, but I still need to make ARPA work.

Here is the logging right up until the crash, thanks again for your help:

OPENEARSLOGGING: Starting text2wfreq_impl
2012-04-09 21:52:11.874 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Done with text2wfreq_impl
2012-04-09 21:52:11.875 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Able to open /app/Documents/OpenEarsDynamicGrammar_pipe.txt for reading.
2012-04-09 21:52:11.969 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Able to open /app/Documents/OpenEarsDynamicGrammar.vocab for reading.
2012-04-09 21:52:11.969 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Starting wfreq2vocab
wfreq2vocab : Will generate a vocabulary containing the most
              frequent 20000 words. Reading wfreq stream from stdin...
## Vocab generated by v2 of the CMU-Cambridge Statistcal
## Language Modeling toolkit.
##
## Includes 12 words ##
wfreq2vocab : Done.
2012-04-09 21:52:11.970 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Done with wfreq2vocab
2012-04-09 21:52:11.970 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Starting text2idngram
text2idngram
Vocab                  : /app/Documents/OpenEarsDynamicGrammar.vocab
Output idngram         : /app/Documents/OpenEarsDynamicGrammar.idngram
N-gram buffer size     : 10
Hash table size        : 5000
Temp directory         : /app/Documents/cmuclmtk-UnlW6q
Max open files         : 20
FOF size               : 10
n                      : 3
Initialising hash table...
Reading vocabulary... 
Allocating memory for the n-gram buffer...
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.

Sorting n-grams...
Writing sorted n-grams to temporary file /app/Documents/cmuclmtk-UnlW6q/1
Merging 1 temporary files...

2-grams occurring:  N times     > N times   Sug. -spec_num value
      0                          19          29
      1                  18           1          11
      2                   0           1          11
      3                   0           1          11
      4                   0           1          11
      5                   0           1          11
      6                   0           1          11
      7                   0           1          11
      8                   1           0          10
      9                   0           0          10
     10                   0           0          10

3-grams occurring:  N times     > N times   Sug. -spec_num value
      0                          26          36
      1                  26           0          10
      2                   0           0          10
      3                   0           0          10
      4                   0           0          10
      5                   0           0          10
      6                   0           0          10
      7                   0           0          10
      8                   0           0          10
      9                   0           0          10
     10                   0           0          10
text2idngram : Done.
2012-04-09 21:52:12.100 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Done with text2idngram
2012-04-09 21:52:12.106 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Starting idngram2lm
  n : 3
  Input file : /app/Documents/OpenEarsDynamicGrammar.idngram     (binary format)
  Output files :
     ARPA format   : /app/Documents/OpenEarsDynamicGrammar.arpa
  Vocabulary file : /app/Documents/OpenEarsDynamicGrammar.vocab
  Context cues file : /app/Documents/OpenEarsDynamicGrammar.ccs
  Cutoffs :
     2-gram : 0     3-gram : 0     
  Vocabulary type : Closed
  Minimum unigram count : 0
  Zeroton fraction : 1
  Counts will be stored in two bytes.
  Count table size : 65535
  Discounting method : Absolute
  Memory allocation for tree structure : 
     Allocate 10 MB of memory, shared equally between all n-gram tables.
  Back-off weight storage : 
     Back-off weights will be stored in four bytes.
Reading vocabulary.

read_wlist_into_siht: a list of 12 words was read from "/app/Documents/OpenEarsDynamicGrammar.vocab".
read_wlist_into_array: a list of 12 words was read from "/app/Documents/OpenEarsDynamicGrammar.vocab".
Context cue word : <s> id = 2
Context cue word : </s> id = 1
Allocated space for 357142 2-grams.
Allocated space for 833333 3-grams.
table_size 13
Allocated 5714272 bytes to table for 2-grams.
Allocated (2+3333332) bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.

Calculating discounted counts.
Absolute discounting ratios :
1-gram : 0 0.5 0.666667 0.75 0.8  ... 
2-gram : 0 0.5 0.666667 0.75 0.8  ... 
3-gram : 0 0.5 0.666667 0.75 0.8  ... 
   prob[1] = 1e-99 count = 0 
   prob[2] = 1e-99 count = 0 
   prob[3] = 0 count = 1 
   prob[4] = 0 count = 1 
   prob[5] = 1e-99 count = 0 
   prob[6] = 0 count = 1 
   prob[7] = 0 count = 1 
   prob[8] = 0 count = 1 
   prob[9] = 0 count = 1 
   prob[10] = 0 count = 1 
   prob[11] = 0 count = 1 
   prob[12] = 0 count = 1 
Unigrams's discount mass is 1 (n1/N = 1)
1 zerotons, P(zeroton) = 1 P(singleton) = 0
P(zeroton) was reduced to 0.0000000000 (1.000 of P(singleton))
Unigram was renormalized to absorb a mass of 1
prob[UNK] = 1e-99
THE FINAL UNIGRAM:
 unigram[1]=inf
 unigram[2]=inf
 unigram[3]=nan
 unigram[4]=nan
 unigram[5]=nan
 unigram[6]=nan
 unigram[7]=nan
 unigram[8]=nan
 unigram[9]=nan
 unigram[10]=nan
 unigram[11]=nan
 unigram[12]=nan
Calculating back-off weights...
Warning : P( 2 ) == inf
Error : P( 2 | ) = inf

GT statistics are out of range

Speech Recognition Toolkit

Forums

Help

GT statistics are out of range document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

GT statistics are out of range