Menu

GT statistics are out of range

Help
Halle
2012-04-07
2012-09-22
  • Halle

    Halle - 2012-04-07

    Hi Nickolay,

    I'm seeing some results I'm not accustomed to in CMUCLMTK .7 when creating an
    ARPA lm. The initial corpus is as follows:

    SELECT ALL
    REMOVE ALL
    CLOSE
    CANCEL
    BACK
    UP
    DOWN
    PLAY
    ONE
    TWO
    THREE
    FOUR
    FIVE
    SIX
    SEVEN
    EIGHT
    NINE
    TEN
    CHANGE MODEL

    This is using the method from the wiki on creating a language model using
    CMUCLMTK. This has been working very well in OpenEars for lms made from longer
    sentences.

    With this small corpus, when I get to idngram2lm, here is the logging I'm
    seeing:

    2012-04-07 16:38:02.184 OpenEarsSampleApp[13820:10703] OPENEARSLOGGING: Starting text2idngram
    text2idngram
    Vocab                  : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.vocab
    Output idngram         : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.idngram
    N-gram buffer size     : 10
    Hash table size        : 5000
    Temp directory         : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/cmuclmtk-gyNhJz
    Max open files         : 20
    FOF size               : 10
    n                      : 3
    Initialising hash table...
    Reading vocabulary... 
    Allocating memory for the n-gram buffer...
    Reading text into the n-gram buffer...
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    
    Sorting n-grams...
    Writing sorted n-grams to temporary file /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/cmuclmtk-gyNhJz/1
    Merging 1 temporary files...
    
    2-grams occurring:  N times     > N times   Sug. -spec_num value
          0                          40          50
          1                  38           2          12
          2                   1           1          11
          3                   0           1          11
          4                   0           1          11
          5                   0           1          11
          6                   0           1          11
          7                   0           1          11
          8                   0           1          11
          9                   0           1          11
         10                   0           1          11
    
    3-grams occurring:  N times     > N times   Sug. -spec_num value
          0                          57          67
          1                  56           1          11
          2                   1           0          10
          3                   0           0          10
          4                   0           0          10
          5                   0           0          10
          6                   0           0          10
          7                   0           0          10
          8                   0           0          10
          9                   0           0          10
         10                   0           0          10
    text2idngram : Done.
    2012-04-07 16:38:02.195 OpenEarsSampleApp[13820:10703] OPENEARSLOGGING: Done with text2idngram
    2012-04-07 16:38:02.196 OpenEarsSampleApp[13820:10703] OPENEARSLOGGING: Starting idngram2lm
      n : 3
      Input file : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.idngram     (binary format)
      Output files :
         ARPA format   : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.arpa
      Vocabulary file : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.vocab
      Context cues file : /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.ccs
      Cutoffs :
         2-gram : 0     3-gram : 0     
      Vocabulary type : Closed
      Minimum unigram count : 0
      Zeroton fraction : 1
      Counts will be stored in two bytes.
      Count table size : 65535
      Discounting method : Good-Turing
         Discounting ranges :
            1-gram : 1     2-gram : 7     3-gram : 7     
      Memory allocation for tree structure : 
         Allocate 10 MB of memory, shared equally between all n-gram tables.
      Back-off weight storage : 
         Back-off weights will be stored in four bytes.
    Reading vocabulary.
    
    read_wlist_into_siht: a list of 23 words was read from "/Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.vocab".
    read_wlist_into_array: a list of 23 words was read from "/Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.vocab".
    Context cue word : <s> id = 2
    Context cue word : </s> id = 1
    Allocated space for 357142 2-grams.
    Allocated space for 833333 3-grams.
    table_size 24
    Allocated 5714272 bytes to table for 2-grams.
    Allocated (2+3333332) bytes to table for 3-grams.
    Processing id n-gram file.
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    
    fof[0][1] = 19
    fof[0][2] = 1
    fof[1][1] = 38
    fof[1][2] = 1
    fof[1][3] = 0
    fof[1][4] = 0
    fof[1][5] = 0
    fof[1][6] = 0
    fof[1][7] = 0
    fof[1][8] = 0
    fof[2][1] = 56
    fof[2][2] = 1
    fof[2][3] = 0
    fof[2][4] = 0
    fof[2][5] = 0
    fof[2][6] = 0
    fof[2][7] = 0
    fof[2][8] = 0
    Calculating discounted counts.
    Warning : 1-gram : Discounting range is 1; setting P(zeroton)=P(singleton).
    Discounted value : 0.86
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 6.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 5.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 4.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 3.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 2.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 1.
    Warning : 2-gram : Discounting range of 1 is equivalent to excluding 
    singletons.
    2-gram : cutoff = 1, discounted values: 0.00
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 6.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 5.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 4.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 3.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 2.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 1.
    Warning : 3-gram : Discounting range of 1 is equivalent to excluding 
    singletons.
    3-gram : cutoff = 1, discounted values: 0.00
       prob[1] = 1e-99 count = 0 
       prob[2] = 1e-99 count = 0 
       prob[3] = 0.095238095 count = 2 
       prob[4] = 0.041125542 count = 1 
       prob[5] = 0.041125542 count = 1 
       prob[6] = 0.041125542 count = 1 
       prob[7] = 0.041125542 count = 1 
       prob[8] = 0.041125542 count = 1 
       prob[9] = 0.041125542 count = 1 
       prob[10] = 0.041125542 count = 1 
       prob[11] = 0.041125542 count = 1 
       prob[12] = 1e-99 count = 0 
       prob[13] = 0.041125542 count = 1 
       prob[14] = 0.041125542 count = 1 
       prob[15] = 0.041125542 count = 1 
       prob[16] = 0.041125542 count = 1 
       prob[17] = 0.041125542 count = 1 
       prob[18] = 0.041125542 count = 1 
       prob[19] = 0.041125542 count = 1 
       prob[20] = 0.041125542 count = 1 
       prob[21] = 0.041125542 count = 1 
       prob[22] = 0.041125542 count = 1 
       prob[23] = 0.041125542 count = 1 
    Unigrams's discount mass is 0.123377 (n1/N = 0.904762)
    1 zerotons, P(zeroton) = 0.123377 P(singleton) = 0.0411255
    P(zeroton) was reduced to 0.0411255416 (1.000 of P(singleton))
    Unigram was renormalized to absorb a mass of 0.0822511
    prob[UNK] = 1e-99
    THE FINAL UNIGRAM:
     unigram[1]=1.08962e-99
     unigram[2]=1.08962e-99
     unigram[3]=0.103774
     unigram[4]=0.0448113
     unigram[5]=0.0448113
     unigram[6]=0.0448113
     unigram[7]=0.0448113
     unigram[8]=0.0448113
     unigram[9]=0.0448113
     unigram[10]=0.0448113
     unigram[11]=0.0448113
     unigram[12]=0.0448113
     unigram[13]=0.0448113
     unigram[14]=0.0448113
     unigram[15]=0.0448113
     unigram[16]=0.0448113
     unigram[17]=0.0448113
     unigram[18]=0.0448113
     unigram[19]=0.0448113
     unigram[20]=0.0448113
     unigram[21]=0.0448113
     unigram[22]=0.0448113
     unigram[23]=0.0448113
    Incrementing contexts...
    Calculating back-off weights...
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(2) = 0 (0 / 19)
    ncount = 1
    Warning : P(4) = 0 (0 / 1)
    ncount = 1
    Warning : P(5) = 0 (0 / 1)
    ncount = 1
    Warning : P(6) = 0 (0 / 1)
    ncount = 1
    Warning : P(7) = 0 (0 / 1)
    ncount = 1
    Warning : P(8) = 0 (0 / 1)
    ncount = 1
    Warning : P(9) = 0 (0 / 1)
    ncount = 1
    Warning : P(10) = 0 (0 / 1)
    ncount = 1
    Warning : P(11) = 0 (0 / 1)
    ncount = 1
    Warning : P(13) = 0 (0 / 1)
    ncount = 1
    Warning : P(14) = 0 (0 / 1)
    ncount = 1
    Warning : P(15) = 0 (0 / 1)
    ncount = 1
    Warning : P(16) = 0 (0 / 1)
    ncount = 1
    Warning : P(17) = 0 (0 / 1)
    ncount = 1
    Warning : P(18) = 0 (0 / 1)
    ncount = 1
    Warning : P(19) = 0 (0 / 1)
    ncount = 1
    Warning : P(20) = 0 (0 / 1)
    ncount = 1
    Warning : P(21) = 0 (0 / 1)
    ncount = 1
    Warning : P(22) = 0 (0 / 1)
    ncount = 1
    Warning : P(23) = 0 (0 / 1)
    ncount = 1
    Writing out language model...
    ARPA-style 3-gram will be written to /Users/username/Library/Application Support/iPhone Simulator/5.1/Applications/B90C29D2-D88A-4379-9B5E-493EA72BB296/Documents/OpenEarsDynamicGrammar.arpa
    idngram2lm : Done.
    

    I believe the issue is warned about here:

    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 6.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 5.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 4.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 3.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 2.
    Warning : 2-gram : GT statistics are out of range; lowering cutoff to 1.
    Warning : 2-gram : Discounting range of 1 is equivalent to excluding 
    singletons.
    2-gram : cutoff = 1, discounted values: 0.00
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 6.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 5.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 4.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 3.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 2.
    Warning : 3-gram : GT statistics are out of range; lowering cutoff to 1.
    Warning : 3-gram : Discounting range of 1 is equivalent to excluding 
    singletons.
    

    What happens in the end is that (as clearly warned above), every bigram and
    trigram has a -99.999 probability and can never be recognized. The unigrams
    appear to be fine. Can you give me any insight into why this is happening and
    how I can fix it? Thank you.

     
  • Nickolay V. Shmyrev

    Can you give me any insight into why this is happening and how I can fix it?

    The amount of training data is not sufficient to estimate good-turing discount
    parameters properly. You need to use linear smoothing or even absolute
    discount. For small vocabulary it's recommended to us JSGF. Trigram ARPA model
    have little sense for few dozen words.

     
  • Halle

    Halle - 2012-04-07

    Thank you, that makes sense. How many lines (or overall words) should the
    corpus be before trying to use good-turing?

    This is for an automatic language model generator in the framework, so there
    is currently no "automatically generate a JSGF grammar" option since it's more
    complex to turn into a simple API.

     
  • Halle

    Halle - 2012-04-09

    Hi Nickolay,

    So, that worked perfectly with the old corpus when I switched to absolute
    (linear gave the same -99.999 result). With this different small corpus I now
    get a crash:

    SUNDAY
    MONDAY
    TUESDAY
    WEDNESDAY
    THURSDAY
    FRIDAY
    SATURDAY
    QUIDNUNC
    CHANGE MODEL

    I understand that JSGF would be better, but I still need to make ARPA work.

    Here is the logging right up until the crash, thanks again for your help:

    OPENEARSLOGGING: Starting text2wfreq_impl
    2012-04-09 21:52:11.874 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Done with text2wfreq_impl
    2012-04-09 21:52:11.875 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Able to open /app/Documents/OpenEarsDynamicGrammar_pipe.txt for reading.
    2012-04-09 21:52:11.969 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Able to open /app/Documents/OpenEarsDynamicGrammar.vocab for reading.
    2012-04-09 21:52:11.969 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Starting wfreq2vocab
    wfreq2vocab : Will generate a vocabulary containing the most
                  frequent 20000 words. Reading wfreq stream from stdin...
    ## Vocab generated by v2 of the CMU-Cambridge Statistcal
    ## Language Modeling toolkit.
    ##
    ## Includes 12 words ##
    wfreq2vocab : Done.
    2012-04-09 21:52:11.970 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Done with wfreq2vocab
    2012-04-09 21:52:11.970 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Starting text2idngram
    text2idngram
    Vocab                  : /app/Documents/OpenEarsDynamicGrammar.vocab
    Output idngram         : /app/Documents/OpenEarsDynamicGrammar.idngram
    N-gram buffer size     : 10
    Hash table size        : 5000
    Temp directory         : /app/Documents/cmuclmtk-UnlW6q
    Max open files         : 20
    FOF size               : 10
    n                      : 3
    Initialising hash table...
    Reading vocabulary... 
    Allocating memory for the n-gram buffer...
    Reading text into the n-gram buffer...
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    
    Sorting n-grams...
    Writing sorted n-grams to temporary file /app/Documents/cmuclmtk-UnlW6q/1
    Merging 1 temporary files...
    
    2-grams occurring:  N times     > N times   Sug. -spec_num value
          0                          19          29
          1                  18           1          11
          2                   0           1          11
          3                   0           1          11
          4                   0           1          11
          5                   0           1          11
          6                   0           1          11
          7                   0           1          11
          8                   1           0          10
          9                   0           0          10
         10                   0           0          10
    
    3-grams occurring:  N times     > N times   Sug. -spec_num value
          0                          26          36
          1                  26           0          10
          2                   0           0          10
          3                   0           0          10
          4                   0           0          10
          5                   0           0          10
          6                   0           0          10
          7                   0           0          10
          8                   0           0          10
          9                   0           0          10
         10                   0           0          10
    text2idngram : Done.
    2012-04-09 21:52:12.100 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Done with text2idngram
    2012-04-09 21:52:12.106 OpenEarsSampleApp[45790:15103] OPENEARSLOGGING: Starting idngram2lm
      n : 3
      Input file : /app/Documents/OpenEarsDynamicGrammar.idngram     (binary format)
      Output files :
         ARPA format   : /app/Documents/OpenEarsDynamicGrammar.arpa
      Vocabulary file : /app/Documents/OpenEarsDynamicGrammar.vocab
      Context cues file : /app/Documents/OpenEarsDynamicGrammar.ccs
      Cutoffs :
         2-gram : 0     3-gram : 0     
      Vocabulary type : Closed
      Minimum unigram count : 0
      Zeroton fraction : 1
      Counts will be stored in two bytes.
      Count table size : 65535
      Discounting method : Absolute
      Memory allocation for tree structure : 
         Allocate 10 MB of memory, shared equally between all n-gram tables.
      Back-off weight storage : 
         Back-off weights will be stored in four bytes.
    Reading vocabulary.
    
    read_wlist_into_siht: a list of 12 words was read from "/app/Documents/OpenEarsDynamicGrammar.vocab".
    read_wlist_into_array: a list of 12 words was read from "/app/Documents/OpenEarsDynamicGrammar.vocab".
    Context cue word : <s> id = 2
    Context cue word : </s> id = 1
    Allocated space for 357142 2-grams.
    Allocated space for 833333 3-grams.
    table_size 13
    Allocated 5714272 bytes to table for 2-grams.
    Allocated (2+3333332) bytes to table for 3-grams.
    Processing id n-gram file.
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    
    Calculating discounted counts.
    Absolute discounting ratios :
    1-gram : 0 0.5 0.666667 0.75 0.8  ... 
    2-gram : 0 0.5 0.666667 0.75 0.8  ... 
    3-gram : 0 0.5 0.666667 0.75 0.8  ... 
       prob[1] = 1e-99 count = 0 
       prob[2] = 1e-99 count = 0 
       prob[3] = 0 count = 1 
       prob[4] = 0 count = 1 
       prob[5] = 1e-99 count = 0 
       prob[6] = 0 count = 1 
       prob[7] = 0 count = 1 
       prob[8] = 0 count = 1 
       prob[9] = 0 count = 1 
       prob[10] = 0 count = 1 
       prob[11] = 0 count = 1 
       prob[12] = 0 count = 1 
    Unigrams's discount mass is 1 (n1/N = 1)
    1 zerotons, P(zeroton) = 1 P(singleton) = 0
    P(zeroton) was reduced to 0.0000000000 (1.000 of P(singleton))
    Unigram was renormalized to absorb a mass of 1
    prob[UNK] = 1e-99
    THE FINAL UNIGRAM:
     unigram[1]=inf
     unigram[2]=inf
     unigram[3]=nan
     unigram[4]=nan
     unigram[5]=nan
     unigram[6]=nan
     unigram[7]=nan
     unigram[8]=nan
     unigram[9]=nan
     unigram[10]=nan
     unigram[11]=nan
     unigram[12]=nan
    Calculating back-off weights...
    Warning : P( 2 ) == inf
    Error : P( 2 | ) = inf
    
     

Log in to post a comment.