Menu

Sentence recognition

Help
Anonymous
2011-07-27
2012-09-22
  • Anonymous

    Anonymous - 2011-07-27

    Hi all,

    First of all thanks for your developments, they are awasome.

    I'm planning to implement an ASR application which does the following:

    • Control with a button when to listen and when to finish listening
    • Show a sentence to say and check if the sentence is correctly repeated. If it is, show it (in green for exampel) if it's not correctly (still to determinate how to take this decission) show it in red and be able to repeat the operation.

    I'll need a DLL (built with pocketsphinx & base I guess) to build a C# plugin
    where the GUI is handled.

    I want the application working with both spanish and english languages, and
    the sentences, a priori will be like 100.

    I have read how to build a language model and I tried with few sentences &
    online tool. That's a very easy way to create the language model and I think
    it can be enough for my purpose. Does this work if I upload a corpus in
    spanish (or any other language)?

    About the acoustic model, can I use some generic acoustic model? I know I can
    adapt that later, but can I use some good acoustic model to test my language
    models? Where can I find some in spanish and english? Can I use those [https:/
    /sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/]

    (https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20
    Models/) directly?

    And dictionary, the best option is to use that one which corresponds to the
    language model, right?

    Thanks again for your great work,

    Regards.

     
  • Anonymous

    Anonymous - 2011-07-29

    Thanks nshmyrev, very usefil information :)

    I built a LM with LMTools following the example from http://cmusphinx.sourcef
    orge.net/wiki/tutoriallm

    but if I write:

    pocketsphinx_continuous -lm 8521.lm -dict 8521.dic (sic from manual)

    I get the error that the hmm is not specified. I tried with:

    pocketsphinx_continuous -lm ....\4363\4363.lm -dict ....\4363\4363.dic -hmm
    ....\model\hmm\en\voxforge-
    en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000

    But I have this output:

    INFO: cmd_ln.c(512): Parsing command line:
    pocketsphinx_continuous \
    -lm ....\4363\4363.lm \
    -dict ....\4363\4363.dic \
    -hmm ....\model\hmm\en\voxforge-en-0.4\model_parameters\voxforge_en_sph
    inx.cd_cont_5000

    Current configuration:

    -adcdev
    -agc none none
    -agcthresh 2.0 2.000000e+000
    -alpha 0.97 9.700000e-001
    -argfile
    -ascale 20.0 2.000000e+001
    -backtrace no no
    -beam 1e-48 1.000000e-048
    -bestpath yes yes
    -bestpathlw 9.5 9.500000e+000
    -bghist no no
    -ceplen 13 13
    -cmn current current
    -cmninit 8.0 8.0
    -compallsen no no
    -debug 0
    -dict ....\4363\4363.dic
    -dictcase no no
    -dither no no
    -doublebw no no
    -ds 1 1
    -fdict
    -feat 1s_c_d_dd 1s_c_d_dd
    -featparams
    -fillprob 1e-8 1.000000e-008
    -frate 100 100
    -fsg
    -fsgusealtpron yes yes
    -fsgusefiller yes yes
    -fwdflat yes yes
    -fwdflatbeam 1e-64 1.000000e-064
    -fwdflatefwid 4 4
    -fwdflatlw 8.5 8.500000e+000
    -fwdflatsfwin 25 25
    -fwdflatwbeam 7e-29 7.000000e-029
    -fwdtree yes yes
    -hmm ....\model\hmm\en\voxforge-en-0.4\model_paramet
    ers\voxforge_en_sphinx.cd_cont_5000
    -input_endian little little
    -jsgf
    -kdmaxbbi -1 -1
    -kdmaxdepth 0 0
    -kdtree
    -latsize 5000 5000
    -lda
    -ldadim 0 0
    -lextreedump 0 0
    -lifter 0 0
    -lm ....\4363\4363.lm
    -lmctl
    -lmname default default
    -logbase 1.0001 1.000100e+000
    -logfn
    -logspec no no
    -lowerf 133.33334 1.333333e+002
    -lpbeam 1e-40 1.000000e-040
    -lponlybeam 7e-29 7.000000e-029
    -lw 6.5 6.500000e+000
    -maxhmmpf -1 -1
    -maxnewoov 20 20
    -maxwpf -1 -1
    -mdef
    -mean
    -mfclogdir
    -mixw
    -mixwfloor 0.0000001 1.000000e-007
    -mllr
    -mmap yes yes
    -ncep 13 13
    -nfft 512 512
    -nfilt 40 40
    -nwpen 1.0 1.000000e+000
    -pbeam 1e-48 1.000000e-048
    -pip 1.0 1.000000e+000
    -pl_beam 1e-10 1.000000e-010
    -pl_pbeam 1e-5 1.000000e-005
    -pl_window 0 0
    -rawlogdir
    -remove_dc no no
    -round_filters yes yes
    -samprate 16000 1.600000e+004
    -seed -1 -1
    -sendump
    -senmgau
    -silprob 0.005 5.000000e-003
    -smoothspec no no
    -svspec
    -tmat
    -tmatfloor 0.0001 1.000000e-004
    -topn 4 4
    -topn_beam 0 0
    -toprule
    -transform legacy legacy
    -unit_area yes yes
    -upperf 6855.4976 6.855498e+003
    -usewdphones no no
    -uw 1.0 1.000000e+000
    -var
    -varfloor 0.0001 1.000000e-004
    -varnorm no no
    -verbose no no
    -warp_params
    -warp_type inverse_linear inverse_linear
    -wbeam 7e-29 7.000000e-029
    -wip 0.65 6.500000e-001
    -wlen 0.025625 2.562500e-002

    INFO: cmd_ln.c(512): Parsing command line:
    \
    -alpha 0.97 \
    -dither yes \
    -doublebw no \
    -nfilt 40 \
    -ncep 13 \
    -lowerf 133.333334 \
    -upperf 6855.4976 \
    -nfft 512 \
    -wlen 0.025625 \
    -transform legacy \
    -feat 1s_c_d_dd \
    -agc none \
    -cmn current \
    -varnorm no

    Current configuration:

    -agc none none
    -agcthresh 2.0 2.000000e+000
    -alpha 0.97 9.700000e-001
    -ceplen 13 13
    -cmn current current
    -cmninit 8.0 8.0
    -dither no yes
    -doublebw no no
    -feat 1s_c_d_dd 1s_c_d_dd
    -frate 100 100
    -input_endian little little
    -lda ....\model\hmm\en\voxforge-en-0.4\model_paramet
    ers\voxforge_en_sphinx.cd_cont_5000/feature_transform
    -ldadim 0 0
    -lifter 0 0
    -logspec no no
    -lowerf 133.33334 1.333333e+002
    -ncep 13 13
    -nfft 512 512
    -nfilt 40 40
    -remove_dc no no
    -round_filters yes yes
    -samprate 16000 1.600000e+004
    -seed -1 -1
    -smoothspec no no
    -svspec
    -transform legacy legacy
    -unit_area yes yes
    -upperf 6855.4976 6.855498e+003
    -varnorm no no
    -verbose no no
    -warp_params
    -warp_type inverse_linear inverse_linear
    -wlen 0.025625 2.562500e-002

    INFO: acmod.c(238): Parsed model-specific feature parameters from
    ....\model\hm
    m\en\voxforge-
    en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/feat.param
    s
    INFO: fe_interface.c(288): You are using the internal mechanism to generate
    the
    seed.
    INFO: feat.c(848): Initializing feature stream to type: '1s_c_d_dd',
    ceplen=13,
    CMN='current', VARNORM='no', AGC='none'
    INFO: cmn.c(142): mean= 12.00, mean= 0.0
    INFO: acmod.c(153): Reading linear feature transformation from
    ....\model\hmm\e
    n\voxforge-
    en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/feature_trans
    form
    INFO: mdef.c(520): Reading model definition: ....\model\hmm\en\voxforge-
    en-0.4\
    model_parameters\voxforge_en_sphinx.cd_cont_5000/mdef
    INFO: bin_mdef.c(173): Allocating 104810 * 8 bytes (818 KiB) for CD tree
    INFO: tmat.c(205): Reading HMM transition probability matrices:
    ....\model\hmm\
    en\voxforge-
    en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/transition_m
    atrices
    INFO: acmod.c(117): Attempting to use SCHMM computation module
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    ....\model\hmm\en\v
    oxforge-en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/means
    INFO: ms_gauden.c(292): 5120 codebook, 1 feature, size
    16x29
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    ....\model\hmm\en\v
    oxforge-en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/variances
    INFO: ms_gauden.c(292): 5120 codebook, 1 feature, size
    16x29
    INFO: ms_gauden.c(356): 175 variance values floored
    INFO: acmod.c(119): Attempting to use PTHMM computation module
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    ....\model\hmm\en\v
    oxforge-en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/means
    INFO: ms_gauden.c(292): 5120 codebook, 1 feature, size
    16x29
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    ....\model\hmm\en\v
    oxforge-en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/variances
    INFO: ms_gauden.c(292): 5120 codebook, 1 feature, size
    16x29
    INFO: ms_gauden.c(356): 175 variance values floored
    ERROR: "ptm_mgau.c", line 801: Number of codebooks exceeds 256: 5120
    INFO: acmod.c(121): Falling back to general multi-stream GMM computation
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    ....\model\hmm\en\v
    oxforge-en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/means
    INFO: ms_gauden.c(292): 5120 codebook, 1 feature, size
    16x29
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    ....\model\hmm\en\v
    oxforge-en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/variances
    INFO: ms_gauden.c(292): 5120 codebook, 1 feature, size
    16x29
    INFO: ms_gauden.c(356): 175 variance values floored
    INFO: ms_senone.c(160): Reading senone mixture weights:
    ....\model\hmm\en\voxfo
    rge-en-0.4\model_parameters\voxforge_en_sphinx.cd_cont_5000/mixture_weights
    INFO: ms_senone.c(211): Truncating senone logs3(pdf) values by 10 bits
    INFO: ms_senone.c(218): Not transposing mixture weights in memory
    INFO: ms_senone.c(277): Read mixture weights for 5120 senones: 1 features x 16
    c
    odewords
    INFO: ms_senone.c(331): Mapping senones to individual codebooks
    INFO: ms_mgau.c(123): The value of topn: 4
    INFO: dict.c(294): Allocating 4114 * 20 bytes (80 KiB) for word entries
    INFO: dict.c(306): Reading main dictionary: ....\4363\4363.dic
    INFO: dict.c(206): Allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(309): 15 words read
    INFO: dict.c(314): Reading filler dictionary: ....\model\hmm\en\voxforge-
    en-0.4
    \model_parameters\voxforge_en_sphinx.cd_cont_5000/noisedict
    INFO: dict.c(206): Allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(317): 3 words read
    INFO: dict2pid.c(396): Building PID tables for dictionary
    INFO: dict2pid.c(405): Allocating 40^3 * 2 bytes (125 KiB) for word-initial
    trip
    hones
    INFO: dict2pid.c(131): Allocated 19360 bytes (18 KiB) for word-final triphones
    INFO: dict2pid.c(195): Allocated 19360 bytes (18 KiB) for single-phone word
    trip
    hones
    INFO: ngram_model_arpa.c(476): ngrams 1=13, 2=18, 3=13
    INFO: ngram_model_arpa.c(135): Reading unigrams
    INFO: ngram_model_arpa.c(515): 13 = #unigrams created
    INFO: ngram_model_arpa.c(194): Reading bigrams
    INFO: ngram_model_arpa.c(531): 18 = #bigrams created
    INFO: ngram_model_arpa.c(532): 5 = #prob2 entries
    INFO: ngram_model_arpa.c(539): 3 = #bo_wt2 entries
    INFO: ngram_model_arpa.c(291): Reading trigrams
    INFO: ngram_model_arpa.c(552): 13 = #trigrams created
    INFO: ngram_model_arpa.c(553): 3 = #prob3 entries
    INFO: ngram_search_fwdtree.c(99): 13 unique initial diphones
    INFO: ngram_search_fwdtree.c(147): 0 root, 0 non-root channels, 4 single-phone
    w
    ords
    INFO: ngram_search_fwdtree.c(186): Creating search tree
    INFO: ngram_search_fwdtree.c(191): before: 0 root, 0 non-root channels, 4
    single
    -phone words
    INFO: ngram_search_fwdtree.c(324): after: max nonroot chan increased to 160
    INFO: ngram_search_fwdtree.c(333): after: 13 root, 32 non-root channels, 3
    singl
    e-phone words
    INFO: ngram_search_fwdflat.c(153): fwdflat: min_ef_width = 4, max_sf_win = 25
    Allocating 32 buffers of 2500 samples each
    INFO: continuous.c(261): pocketsphinx_continuous COMPILED ON: Jul 26 2011, AT:
    0
    9:20:22

    FATAL_ERROR: "continuous.c", line 135: cont_ad_calib failed

    So, is it LM and DICT independent from Acoustic model or not? What I'm doing
    wrong?

    If I build any LM and DICT can't I use those from sourforge?

    This is my corpus (tutorial):

    open browser
    new e-mail
    forward
    backward
    next window
    last window
    open music player

    Thanks again.

    Regards

     
  • Nickolay V. Shmyrev

    So, is it LM and DICT independent from Acoustic model or not?

    No, they are not independent. Set of words in language model must match set of
    words in dictionary. Set of phones in dictionary must match set of phones in
    acoustic model.

    What I'm doing wrong?

    You are doing almost everything correct. Error message says that it fails to
    record audio from the microphone. Probably your input is muted in volume
    settings.

     
  • Anonymous

    Anonymous - 2011-07-29

    Yes.... it was microphone problem... thanks :D

    If I want to build a LM in spanish, I guess online LMTool is not valid.... is
    it? Do I have to use text2wfreq & text2idngram & idngram2lm &
    sphinx_lm_convert ¿? Can I use them in Windows? I'll try tomorrow....

    I downloaded Voxforge Spanish model but I cannot see the dictionary you told
    me was included... is it that noisedict file?

    Thanks once again and best regards.

     
  • Nickolay V. Shmyrev

    I guess online LMTool is not valid.... is it?

    Yes

    Do I have to use text2wfreq & text2idngram & idngram2lm & sphinx_lm_convert
    ¿?

    Yes

    Can I use them in Windows? I'll try tomorrow.

    You can.

    but I cannot see the dictionary you told me was included.

    etc/voxforge_es_sphinx.dic

     
  • Anonymous

    Anonymous - 2011-08-02

    I found the spanish dictionary and also voxforge english dictionary
    (cmudict.07a). I tried previous example with english dictionary and works but
    when I try the following example:

    pocketsphinx_continuous.exe -dict ....\8237ES\8237.dic -lm
    ....\8237ES\8237.lm -hmm ....\model\hmm\es\voxforge-
    es-0.1.1\model_parameters\voxforge_es_sphinx.cd_cont_1500

    Where dict & lm are some basic spanish commands created with LMtools.

    Here is the output:

    INFO: cmd_ln.c(512): Parsing command line:
    pocketsphinx_continuous.exe \
    -dict ....\8237ES\8237.dic \
    -lm ....\8237ES\8237.lm \
    -hmm ....\model\hmm\es\voxforge-es-0.1.1\model_parameters\voxforge_es_s
    phinx.cd_cont_1500

    Current configuration:

    -adcdev
    -agc none none
    -agcthresh 2.0 2.000000e+000
    -alpha 0.97 9.700000e-001
    -argfile
    -ascale 20.0 2.000000e+001
    -backtrace no no
    -beam 1e-48 1.000000e-048
    -bestpath yes yes
    -bestpathlw 9.5 9.500000e+000
    -bghist no no
    -ceplen 13 13
    -cmn current current
    -cmninit 8.0 8.0
    -compallsen no no
    -debug 0
    -dict ....\8237ES\8237.dic
    -dictcase no no
    -dither no no
    -doublebw no no
    -ds 1 1
    -fdict
    -feat 1s_c_d_dd 1s_c_d_dd
    -featparams
    -fillprob 1e-8 1.000000e-008
    -frate 100 100
    -fsg
    -fsgusealtpron yes yes
    -fsgusefiller yes yes
    -fwdflat yes yes
    -fwdflatbeam 1e-64 1.000000e-064
    -fwdflatefwid 4 4
    -fwdflatlw 8.5 8.500000e+000
    -fwdflatsfwin 25 25
    -fwdflatwbeam 7e-29 7.000000e-029
    -fwdtree yes yes
    -hmm ....\model\hmm\es\voxforge-es-0.1.1\model_param
    eters\voxforge_es_sphinx.cd_cont_1500
    -input_endian little little
    -jsgf
    -kdmaxbbi -1 -1
    -kdmaxdepth 0 0
    -kdtree
    -latsize 5000 5000
    -lda
    -ldadim 0 0
    -lextreedump 0 0
    -lifter 0 0
    -lm ....\8237ES\8237.lm
    -lmctl
    -lmname default default
    -logbase 1.0001 1.000100e+000
    -logfn
    -logspec no no
    -lowerf 133.33334 1.333333e+002
    -lpbeam 1e-40 1.000000e-040
    -lponlybeam 7e-29 7.000000e-029
    -lw 6.5 6.500000e+000
    -maxhmmpf -1 -1
    -maxnewoov 20 20
    -maxwpf -1 -1
    -mdef
    -mean
    -mfclogdir
    -mixw
    -mixwfloor 0.0000001 1.000000e-007
    -mllr
    -mmap yes yes
    -ncep 13 13
    -nfft 512 512
    -nfilt 40 40
    -nwpen 1.0 1.000000e+000
    -pbeam 1e-48 1.000000e-048
    -pip 1.0 1.000000e+000
    -pl_beam 1e-10 1.000000e-010
    -pl_pbeam 1e-5 1.000000e-005
    -pl_window 0 0
    -rawlogdir
    -remove_dc no no
    -round_filters yes yes
    -samprate 16000 1.600000e+004
    -seed -1 -1
    -sendump
    -senmgau
    -silprob 0.005 5.000000e-003
    -smoothspec no no
    -svspec
    -tmat
    -tmatfloor 0.0001 1.000000e-004
    -topn 4 4
    -topn_beam 0 0
    -toprule
    -transform legacy legacy
    -unit_area yes yes
    -upperf 6855.4976 6.855498e+003
    -usewdphones no no
    -uw 1.0 1.000000e+000
    -var
    -varfloor 0.0001 1.000000e-004
    -varnorm no no
    -verbose no no
    -warp_params
    -warp_type inverse_linear inverse_linear
    -wbeam 7e-29 7.000000e-029
    -wip 0.65 6.500000e-001
    -wlen 0.025625 2.562500e-002

    INFO: cmd_ln.c(512): Parsing command line:
    \
    -alpha 0.97 \
    -dither yes \
    -doublebw no \
    -nfilt 32 \
    -ncep 13 \
    -lowerf 200 \
    -upperf 3500 \
    -nfft 256 \
    -wlen 0.0256 \
    -transform legacy \
    -feat 1s_c_d_dd \
    -agc none \
    -cmn current \
    -varnorm no

    Current configuration:

    -agc none none
    -agcthresh 2.0 2.000000e+000
    -alpha 0.97 9.700000e-001
    -ceplen 13 13
    -cmn current current
    -cmninit 8.0 8.0
    -dither no yes
    -doublebw no no
    -feat 1s_c_d_dd 1s_c_d_dd
    -frate 100 100
    -input_endian little little
    -lda ....\model\hmm\es\voxforge-es-0.1.1\model_param
    eters\voxforge_es_sphinx.cd_cont_1500/feature_transform
    -ldadim 0 0
    -lifter 0 0
    -logspec no no
    -lowerf 133.33334 2.000000e+002
    -ncep 13 13
    -nfft 512 256
    -nfilt 40 32
    -remove_dc no no
    -round_filters yes yes
    -samprate 16000 1.600000e+004
    -seed -1 -1
    -smoothspec no no
    -svspec
    -transform legacy legacy
    -unit_area yes yes
    -upperf 6855.4976 3.500000e+003
    -varnorm no no
    -verbose no no
    -warp_params
    -warp_type inverse_linear inverse_linear
    -wlen 0.025625 2.560000e-002

    INFO: acmod.c(238): Parsed model-specific feature parameters from
    ....\model\hm
    m\es\voxforge-
    es-0.1.1\model_parameters\voxforge_es_sphinx.cd_cont_1500/feat.par
    ams
    ERROR: "fe_interface.c", line 100: FFT: Number of points must be greater or
    equa
    l to frame size (409 samples)

    It happens the same with:

    pocketsphinx_continuous.exe -lm ....\8237ES\8237.lm -hmm ....\model\hmm\es
    \voxforge-es-0.1.1\model_parameters\voxforge_es_sphinx.cd_cont_1500 -dict
    ....\model\hmm\es\voxforge-es-0.1.1\etc\voxforge_es_sphinx.dic

    What can be the problem??

    In the other hand, I'm trying to build a statistical language model with
    text2wfreq, text2idngram, idngram2lm and sphinx_lm_convert. I downloaded
    pocketsphinx 0.7 (windows) and I found sphinx_lm_convert in sphinxbase
    project, but I cannot find the others. Do I have to download a different
    package? Where can I find them?

    Many thanks and best regards :)

     
  • Anonymous

    Anonymous - 2011-08-02

    Ok, samprate solved the problem :)

    I downloaded cmuclmtk 0.7 but it doesn't work (Win XP).... I couldn't open
    pocketsphinx 0.7 (visual 2008), so I compiled previous snapshot (0.6).

    The error is that it doesn't find MSVCR100.dll ... :?

    So I decided to download previous version... and:

    text2wfreq < weather.txt | wfreq2vocab > weather.tmp.vocab
    text2wfreq : Reading text from standard input...
    wfreq2vocab : Will generate a vocabulary containing the most
    frequent 20000 words. Reading wfreq stream from stdin...
    text2wfreq : Done.
    wfreq2vocab : Done.

    But then I cannot generate the arpa format LM because "text2idngram -vocab
    weather.vocab -idngram weather.idngram < weather.closed.txt" doesn't find some
    file (i think).

    If I see what was generated by text2wfreq, I just can see the
    weather.tmp.vocab file.... I also change that name to weather.vocab but
    idngrab and closed are missing....

    Maybe I'm not interpreting the manual
    correctly

    Last question... I tried LMTool with spanish, and I think that the .dic
    generated cannot be used with that LM... Or at least it doesn't work for me,
    maybe because words are pronounced like if was english¿? Using voxforge
    dictionary works well :)

     
  • Nickolay V. Shmyrev

    The error is that it doesn't find MSVCR100.dll ... :?

    This dll is part of VS 2010

    But then I cannot generate the arpa format LM because "text2idngram -vocab
    weather.vocab -idngram weather.idngram < weather.closed.txt" doesn't find some
    file (i think).

    In your case command would be

    text2idngram -vocab weather.tmp.vocab -idngram weather.idngram < weather.txt
    

    That command will create weather.idngram.

    tried LMTool with spanish, and I think that the .dic generated cannot be
    used with that LM..

    This is correct

     
  • Anonymous

    Anonymous - 2011-08-03

    Hi again,

    I think I have to change something else...

    text2idngram -vocab weather.tmp.vocab -idngram weather.idngram < weather.txt
    text2idngram
    Error : Unknown (or unprocessed) command line options:
    -idngram weather.idngram
    Rerun with the -help option for more information.
    

    used The CMU-Cambridge Statistical Language Modeling Toolkit v2 documentation
    and changed your command to:

    text2idngram.exe -vocab weather.tmp.vocab <weather.txt> weather.idngram
    
    
    text2idngram
    Vocab                  : weather.tmp.vocab
    N-gram buffer size     : 100
    Hash table size        : 2000000
    Temp directory         : /usr/tmp/
    Max open files         : 20
    FOF size               : 10
    n                      : 3
    Initialising hash table...
    Reading vocabulary...
    Allocating memory for the n-gram buffer...
    Reading text into the n-gram buffer...
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    
    Sorting n-grams...
    Writing sorted n-grams to temporary file C:\DOCUME~1\enne\CONFIG~1\Temp\text2idn
    gram.temp.21
    Merging 1 temporary files...
    
    2-grams occurring:      N times         > N times       Sug. -spec_num value
          0                                              75              85
          1                              69               6              16
          2                               5               1              11
          3                               1               0              10
          4                               0               0              10
          5                               0               0              10
          6                               0               0              10
          7                               0               0              10
          8                               0               0              10
          9                               0               0              10
         10                               0               0              10
    
    3-grams occurring:      N times         > N times       Sug. -spec_num value
          0                                              80              90
          1                              78               2              12
          2                               2               0              10
          3                               0               0              10
          4                               0               0              10
          5                               0               0              10
          6                               0               0              10
          7                               0               0              10
          8                               0               0              10
          9                               0               0              10
         10                               0               0              10
    text2idngram : Done.
    

    Which looks better... but when I try to create the LM:

    idngram2lm.exe -vocab_type 0 -idngram weather.idngram -vocab
    weather.tmp.vocab -arpa weather.arpa
    n : 3
    Input file : weather.idngram (binary format)
    Output files :
    ARPA format : weather.arpa
    Vocabulary file : weather.tmp.vocab
    Cutoffs :
    2-gram : 0 3-gram : 0
    Vocabulary type : Closed
    Minimum unigram count : 0
    Zeroton fraction : 1
    Counts will be stored in two bytes.
    Count table size : 65535
    Discounting method : Good-Turing
    Discounting ranges :
    1-gram : 1 2-gram : 7 3-gram : 7
    Memory allocation for tree structure :
    Allocate 100 MB of memory, shared equally between all n-gram tables.
    Back-off weight storage :
    Back-off weights will be stored in four bytes.
    Reading vocabulary.

    read_wlist_into_siht: a list of 58 words was read from "weather.tmp.vocab".
    read_wlist_into_array: a list of 58 words was read from "weather.tmp.vocab".
    WARNING: appears as a vocabulary item, but is not labelled as a
    context cue.
    Allocated space for 5000000 2-grams.
    Allocated space for 12500000 3-grams.
    table_size 59
    Allocated 60000000 bytes to table for 2-grams.
    Allocated (2+25000000) bytes to table for 3-grams.
    Processing id n-gram file.
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    Error : n-grams are not correctly ordered. Error occurred at ngram 19.

    Maybe text2idngram is wrong :?

    By the way, will I have some problems when I repeat the process in spanish? Do
    you know if accents have problems? I mean simbols like ´, `,¨.....(e.g.
    camión, pingüino)

    Regards and thanks :)

     
  • Nickolay V. Shmyrev

    Unknown (or unprocessed) command line options:
    -idngram weather.idngra

    Looks like you are using obsolete version. Maybe you want to try latest one.

    By the way, will I have some problems when I repeat the process in spanish?
    

    Who knows

    Do you know if accents have problems?

    Accents are supported

     
  • Anonymous

    Anonymous - 2011-08-04

    Well, as I told you I'm using previous version because last one uses
    MSVCR100.dll

    idngram2lm.exe -version
    idngram2lm.exe from the CMU-Cambridge SLM Toolkit, Version 3 alpha

    Anyway I think it would be possible to continue without installing VS2010,
    with previous version. It just fails last step (idngram2language model).

    Or maybe the problem is in text2idngram as I indicated.... I don't know...

    Any idea?

    Thanks

     
  • Anonymous

    Anonymous - 2011-08-08

    Should I open a new thread with this last question? If so, I'll do it, but
    before that I prefer to ask it here...

    Thanks :)

     
  • Anonymous

    Anonymous - 2011-08-11

    Finally I installed VS 2010 and the tools work.

    In sentence recognition, wich is the best way to build a LM? I tried with CMU
    tools and some sentences (also tried weather example with few sentences) and
    the result is that it doesn't recognize any word....

    Maybe is because the LM is too short to build it in that mode¿? I tried with a
    similar (few sentences) LM with LMTools (online) and seems to work better...

    If I build a LM, which level of recognition can I achive? I have seen that is
    a word-level recognition. I mean, if I build a LM with sentences, and I say a
    sentence but in a different order than I have, it's also recognized. I thought
    that sentence-level recognition was possible.

    So, the - tags what they do? I read the tutorial, FAQ, some online
    information from different websites... and I think I miss some basic
    information.

    Any guidelines? Which kind of LM do I need for my purpose?

    Many thanks and best regards.

     
  • Nickolay V. Shmyrev

    if I build a LM with sentences, and I say a sentence but in a different
    order than I have, it's also recognized.

    If you need fixed word order you can use a finite state grammar in jsgf
    format. Tutorial describes that.

    So, the - tags what they do?

    They mark start of the sentence and end of it.

     

Log in to post a comment.