Menu

Android tablet does not recognize the voice

Help
Biet Hoang
2011-09-14
2012-09-22
  • Biet Hoang

    Biet Hoang - 2011-09-14

    Hi Nick,
    I am taking your advise to create a new thread even I have the same problem
    with thread
    https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4553620
    (how to build acoustic model)
    I am creating an android application that need a small vocabulary (~100
    words), but let make it easy, I create 10 digits (0-9)

    I read taking your advice on the forum link above, so here is what I come up
    with.

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    digit.dic:

    EIGHT EY T
    FIVE F AY V
    FOUR F AO R
    NINE N AY N
    ONE W AH N
    SEVEN S EH V AH N
    SIX S IH K S
    THREE TH R IY
    TWO T UW
    ZERO Z IH R OW
    ZERO(2) Z IY R OW

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    digit.phone:

    EH
    EY
    F
    IH
    IY
    K
    N
    OW
    R
    S
    SIL
    T
    TH
    UW
    V
    W
    Z
    AH
    AO
    AY

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    digit_train.fileids:

    peaker1/spk1_one
    speaker1/spk1_two
    speaker1/spk1_three
    speaker1/spk1_four
    speaker1/spk1_five
    speaker1/spk1_six
    speaker1/spk1_seven
    speaker1/spk1_eight
    speaker1/spk1_nine
    speaker1/spk1_zero
    speaker2/spk2_one
    speaker2/spk2_two
    speaker2/spk2_three
    speaker2/spk2_four
    speaker2/spk2_five
    speaker2/spk2_six
    speaker2/spk2_seven
    speaker2/spk2_eight
    speaker2/spk2_nine
    speaker2/spk2_zero
    speaker3/spk3_one
    speaker3/spk3_two
    speaker3/spk3_three
    speaker3/spk3_four
    speaker3/spk3_five
    speaker3/spk3_six
    speaker3/spk3_seven
    speaker3/spk3_eight
    speaker3/spk3_nine
    speaker3/spk3_zero

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    digit_train.transcription:
    ONE (spk1_one)
    TWO (spk1_two)
    THREE (spk1_three)
    FOUR (spk1_four)
    FIVE (spk1_five)
    SIX (spk1_six)
    SEVEN (spk1_seven)
    EIGHT (spk1_eight)
    NINE (spk1_nine)
    ZERO (spk1_zero)
    ONE (spk2_one)
    TWO (spk2_two)
    THREE (spk2_three)
    FOUR (spk2_four)
    FIVE (spk2_five)
    SIX (spk2_six)
    SEVEN (spk2_seven)
    EIGHT (spk2_eight)
    NINE (spk2_nine)
    ZERO (spk2_zero)
    ONE (spk3_one)
    TWO (spk3_two)
    THREE (spk3_three)
    FOUR (spk3_four)
    FIVE (spk3_five)
    SIX (spk3_six)
    SEVEN (spk3_seven)
    EIGHT (spk3_eight)
    NINE (spk3_nine)
    ZERO (spk3_zero)

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    After ./script_pl/RunAll.pl
    I got this message:

    Training for 2 Gaussian(s) completed after 6 iterations
    MODULE: 60 Lattice Generation
    Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
    MODULE: 61 Lattice Pruning
    Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
    MODULE: 62 Lattice Format Conversion
    Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
    MODULE: 65 MMIE Training
    Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
    MODULE: 90 deleted interpolation
    Skipped for continuous models

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    I assume that ./script_pl/RunAll.pl is successful, so I run

    root@ubuntu:/home/hoangb/Projects/Android/v2text/digit#
    ./scripts_pl/decode/slave.pl
    MODULE: DECODE Decoding using models previously trained
    Decoding 30 segments starting at 0 (part 1 of 1)
    0%
    WARNING: This step had 0 ERROR messages and 1 WARNING messages. Please check
    the log file for details.
    Aligning results to find error rate
    SENTENCE ERROR: 13.3% (4/30) WORD ERROR RATE: 13.3% (3/30)

     
  • eliasmajic

    eliasmajic - 2011-09-14

    I dont see a question anywhere but your training data set has far to little
    audio.

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    I run

    pocketsphinx_continuous -hmm model_parameters/digit.cd_cont_1000 -lm

    etc/digit.lm -dict etc/digit.dic and start speak

    INFO: acmod.c(242): Parsed model-specific feature parameters from
    model_parameters/digit.cd_cont_1000/feat.params
    INFO: feat.c(684): Initializing feature stream to type: '1s_c_d_dd',
    ceplen=13, CMN='current', VARNORM='no', AGC='none'
    INFO: cmn.c(142): mean= 12.00, mean= 0.0
    INFO: mdef.c(520): Reading model definition:
    model_parameters/digit.cd_cont_1000/mdef
    INFO: bin_mdef.c(173): Allocating 373 * 8 bytes (2 KiB) for CD tree
    INFO: tmat.c(205): Reading HMM transition probability matrices:
    model_parameters/digit.cd_cont_1000/transition_matrices
    INFO: acmod.c(117): Attempting to use SCHMM computation module
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    model_parameters/digit.cd_cont_1000/means
    INFO: ms_gauden.c(292): 153 codebook, 1 feature, size:
    INFO: ms_gauden.c(294): 8x39
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    model_parameters/digit.cd_cont_1000/variances
    INFO: ms_gauden.c(292): 153 codebook, 1 feature, size:
    INFO: ms_gauden.c(294): 8x39
    INFO: ms_gauden.c(354): 40932 variance values floored
    INFO: acmod.c(119): Attempting to use PTHMM computation module
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    model_parameters/digit.cd_cont_1000/means
    INFO: ms_gauden.c(292): 153 codebook, 1 feature, size:
    INFO: ms_gauden.c(294): 8x39
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    model_parameters/digit.cd_cont_1000/variances
    INFO: ms_gauden.c(292): 153 codebook, 1 feature, size:
    INFO: ms_gauden.c(294): 8x39
    INFO: ms_gauden.c(354): 40932 variance values floored
    INFO: ptm_mgau.c(804): Number of codebooks doesn't match number of ciphones,
    doesn't look like PTM: 153 20
    INFO: acmod.c(121): Falling back to general multi-stream GMM computation
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    model_parameters/digit.cd_cont_1000/means
    INFO: ms_gauden.c(292): 153 codebook, 1 feature, size:
    INFO: ms_gauden.c(294): 8x39
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
    model_parameters/digit.cd_cont_1000/variances
    INFO: ms_gauden.c(292): 153 codebook, 1 feature, size:
    INFO: ms_gauden.c(294): 8x39
    INFO: ms_gauden.c(354): 40932 variance values floored
    INFO: ms_senone.c(160): Reading senone mixture weights:
    model_parameters/digit.cd_cont_1000/mixture_weights
    INFO: ms_senone.c(211): Truncating senone logs3(pdf) values by 10 bits
    INFO: ms_senone.c(218): Not transposing mixture weights in memory
    INFO: ms_senone.c(277): Read mixture weights for 153 senones: 1 features x 8
    codewords
    INFO: ms_senone.c(331): Mapping senones to individual codebooks
    INFO: ms_mgau.c(122): The value of topn: 4
    INFO: dict.c(306): Allocating 4110 * 20 bytes (80 KiB) for word entries
    INFO: dict.c(321): Reading main dictionary: etc/digit.dic
    INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(324): 11 words read
    INFO: dict.c(330): Reading filler dictionary:
    model_parameters/digit.cd_cont_1000/noisedict
    INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(333): 3 words read
    INFO: dict2pid.c(396): Building PID tables for dictionary
    INFO: dict2pid.c(404): Allocating 20^3 * 2 bytes (15 KiB) for word-initial
    triphones
    INFO: dict2pid.c(131): Allocated 4880 bytes (4 KiB) for word-final triphones
    INFO: dict2pid.c(195): Allocated 4880 bytes (4 KiB) for single-phone word
    triphones
    INFO: ngram_model_arpa.c(477): ngrams 1=12, 2=20, 3=10
    INFO: ngram_model_arpa.c(135): Reading unigrams
    INFO: ngram_model_arpa.c(516): 12 = #unigrams created
    INFO: ngram_model_arpa.c(195): Reading bigrams
    INFO: ngram_model_arpa.c(533): 20 = #bigrams created
    INFO: ngram_model_arpa.c(534): 3 = #prob2 entries
    INFO: ngram_model_arpa.c(542): 3 = #bo_wt2 entries
    INFO: ngram_model_arpa.c(292): Reading trigrams
    INFO: ngram_model_arpa.c(555): 10 = #trigrams created
    INFO: ngram_model_arpa.c(556): 2 = #prob3 entries
    INFO: ngram_search_fwdtree.c(99): 11 unique initial diphones
    INFO: ngram_search_fwdtree.c(147): 0 root, 0 non-root channels, 4 single-phone
    words
    INFO: ngram_search_fwdtree.c(186): Creating search tree
    INFO: ngram_search_fwdtree.c(191): before: 0 root, 0 non-root channels, 4
    single-phone words
    INFO: ngram_search_fwdtree.c(326): after: max nonroot chan increased to 142
    INFO: ngram_search_fwdtree.c(338): after: 11 root, 14 non-root channels, 3
    single-phone words
    INFO: ngram_search_fwdflat.c(156): fwdflat: min_ef_width = 4, max_sf_win = 25
    INFO: continuous.c(367): pocketsphinx_continuous COMPILED ON: Sep 11 2011, AT:
    02:12:53

    Warning: Could not find Mic element
    READY....
    Listening...
    Recording is stopped, start recording with ad_start_rec
    Stopped listening, please wait...
    INFO: cmn_prior.c(121): cmn_prior_update: from < 8.00 0.00 0.00 0.00 0.00 0.00
    0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
    INFO: cmn_prior.c(139): cmn_prior_update: to < 13.76 -0.00 -0.24 0.02 -0.25
    -0.06 -0.20 -0.11 -0.14 -0.14 -0.16 -0.12 -0.30 >
    INFO: ngram_search_fwdtree.c(1549): 455 words recognized (1/fr)
    INFO: ngram_search_fwdtree.c(1551): 8403 senones evaluated (22/fr)
    INFO: ngram_search_fwdtree.c(1553): 3478 channels searched (9/fr), 2675 1st,
    803 last
    INFO: ngram_search_fwdtree.c(1557): 803 words for which last channels
    evaluated (2/fr)
    INFO: ngram_search_fwdtree.c(1560): 0 candidate words for entering last phone
    (0/fr)
    INFO: ngram_search_fwdtree.c(1562): fwdtree 0.05 CPU 0.014 xRT
    INFO: ngram_search_fwdtree.c(1565): fwdtree 4.69 wall 1.230 xRT
    INFO: ngram_search_fwdflat.c(305): Utterance vocabulary contains 2 words
    INFO: ngram_search_fwdflat.c(940): 281 words recognized (1/fr)
    INFO: ngram_search_fwdflat.c(942): 1140 senones evaluated (3/fr)
    INFO: ngram_search_fwdflat.c(944): 751 channels searched (1/fr)
    INFO: ngram_search_fwdflat.c(946): 751 words searched (1/fr)
    INFO: ngram_search_fwdflat.c(948): 76 word transitions (0/fr)
    INFO: ngram_search_fwdflat.c(951): fwdflat 0.00 CPU 0.001 xRT
    INFO: ngram_search_fwdflat.c(954): fwdflat 0.00 wall 0.001 xRT
    INFO: ngram_search.c(1201): not found in last frame, using <sil>.379
    instead
    INFO: ngram_search.c(1253): lattice start node .0 end node <sil>.226
    INFO: ngram_search.c(1281): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1386): Lattice has 9 nodes, 10 links
    INFO: ps_lattice.c(1352): Normalizer P(O) = alpha(<sil>:226:379) = -287518
    INFO: ps_lattice.c(1390): Joint P(O,S) = -287518 P(S|O) = 0
    INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(878): bestpath 0.00 wall 0.000 xRT
    000000000:
    READY....
    Listening...
    Recording is stopped, start recording with ad_start_rec
    Stopped listening, please wait...
    INFO: cmn_prior.c(121): cmn_prior_update: from < 13.76 -0.00 -0.24 0.02 -0.25
    -0.06 -0.20 -0.11 -0.14 -0.14 -0.16 -0.12 -0.30 >
    INFO: cmn_prior.c(139): cmn_prior_update: to < 13.78 -0.04 -0.21 0.02 -0.23
    -0.06 -0.19 -0.12 -0.15 -0.16 -0.17 -0.13 -0.30 >
    INFO: ngram_search_fwdtree.c(1549): 181 words recognized (2/fr)
    INFO: ngram_search_fwdtree.c(1551): 2406 senones evaluated (29/fr)
    INFO: ngram_search_fwdtree.c(1553): 1003 channels searched (12/fr), 792 1st,
    211 last
    INFO: ngram_search_fwdtree.c(1557): 211 words for which last channels
    evaluated (2/fr)
    INFO: ngram_search_fwdtree.c(1560): 0 candidate words for entering last phone
    (0/fr)
    INFO: ngram_search_fwdtree.c(1562): fwdtree 0.02 CPU 0.019 xRT
    INFO: ngram_search_fwdtree.c(1565): fwdtree 1.67 wall 2.010 xRT
    INFO: ngram_search_fwdflat.c(305): Utterance vocabulary contains 2 words
    INFO: ngram_search_fwdflat.c(940): 127 words recognized (2/fr)
    INFO: ngram_search_fwdflat.c(942): 246 senones evaluated (3/fr)
    INFO: ngram_search_fwdflat.c(944): 303 channels searched (3/fr)
    INFO: ngram_search_fwdflat.c(946): 303 words searched (3/fr)
    INFO: ngram_search_fwdflat.c(948): 76 word transitions (0/fr)
    INFO: ngram_search_fwdflat.c(951): fwdflat -0.00 CPU -0.000 xRT
    INFO: ngram_search_fwdflat.c(954): fwdflat 0.00 wall 0.001 xRT
    INFO: ngram_search.c(1253): lattice start node .0 end node .42
    INFO: ngram_search.c(1281): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1386): Lattice has 9 nodes, 4 links
    INFO: ps_lattice.c(1352): Normalizer P(O) = alpha(</sil></sil>
    :42:81) = -87429
    INFO: ps_lattice.c(1390): Joint P(O,S) = -87429 P(S|O) = 0
    INFO: ngram_search.c(875): bestpath -0.00 CPU -0.000 xRT
    INFO: ngram_search.c(878): bestpath 0.00 wall 0.000 xRT
    000000001:
    READY....
    Listening...
    Recording is stopped, start recording with ad_start_rec
    Stopped listening, please wait...
    INFO: cmn_prior.c(121): cmn_prior_update: from < 13.78 -0.04 -0.21 0.02 -0.23
    -0.06 -0.19 -0.12 -0.15 -0.16 -0.17 -0.13 -0.30 >
    INFO: cmn_prior.c(139): cmn_prior_update: to < 13.86 -0.09 -0.18 0.06 -0.23
    -0.08 -0.19 -0.14 -0.17 -0.18 -0.18 -0.14 -0.28 >
    INFO: ngram_search_fwdtree.c(1549): 189 words recognized (2/fr)
    INFO: ngram_search_fwdtree.c(1551): 2745 senones evaluated (29/fr)
    INFO: ngram_search_fwdtree.c(1553): 1131 channels searched (11/fr), 902 1st,
    229 last
    INFO: ngram_search_fwdtree.c(1557): 229 words for which last channels
    evaluated (2/fr)
    INFO: ngram_search_fwdtree.c(1560): 0 candidate words for entering last phone
    (0/fr)
    INFO: ngram_search_fwdtree.c(1562): fwdtree 0.02 CPU 0.021 xRT
    INFO: ngram_search_fwdtree.c(1565): fwdtree 1.86 wall 1.933 xRT
    INFO: ngram_search_fwdflat.c(305): Utterance vocabulary contains 2 words
    INFO: ngram_search_fwdflat.c(940): 133 words recognized (1/fr)
    INFO: ngram_search_fwdflat.c(942): 285 senones evaluated (3/fr)
    INFO: ngram_search_fwdflat.c(944): 334 channels searched (3/fr)
    INFO: ngram_search_fwdflat.c(946): 334 words searched (3/fr)
    INFO: ngram_search_fwdflat.c(948): 71 word transitions (0/fr)
    INFO: ngram_search_fwdflat.c(951): fwdflat -0.00 CPU -0.000 xRT
    INFO: ngram_search_fwdflat.c(954): fwdflat 0.00 wall 0.001 xRT
    INFO: ngram_search.c(1253): lattice start node .0 end node .90
    INFO: ngram_search.c(1281): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1386): Lattice has 11 nodes, 12 links
    INFO: ps_lattice.c(1352): Normalizer P(O) = alpha(:90:94) = -133883
    INFO: ps_lattice.c(1390): Joint P(O,S) = -134584 P(S|O) = -701
    INFO: ngram_search.c(875): bestpath -0.00 CPU -0.000 xRT
    INFO: ngram_search.c(878): bestpath 0.00 wall 0.000 xRT
    000000002:
    READY....
    Listening...
    ^CINFO: ngram_search_fwdtree.c(430): TOTAL fwdtree 0.09 CPU 0.016 xRT
    INFO: ngram_search_fwdtree.c(433): TOTAL fwdtree 8.21 wall 1.474 xRT
    INFO: ngram_search_fwdflat.c(174): TOTAL fwdflat 0.00 CPU 0.001 xRT
    INFO: ngram_search_fwdflat.c(177): TOTAL fwdflat 0.00 wall 0.001 xRT
    INFO: ngram_search.c(317): TOTAL bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(320): TOTAL bestpath 0.00 wall 0.000 xRT
    root@ubuntu:/home/hoangb/Projects/Android/v2text/digit# </sil>

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    I use this script file to record voice on linux

    for i in seq 1 10; do
    read sent; echo “1. + $sent;
    fn=printf %s ${sent,,};
    rec -r 8000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null;
    done < corpus.txt

    And I play the wav file to verify the sample rate. It is 8000hz
    root@ubuntu:/home/hoangb/Projects/Android/v2text/digit# play
    wav/speaker1/spk1_one.wav

    wav/speaker1/spk1_one.wav:

    File Size: 49.2k Bit Rate: 128k
    Encoding: Signed PCM
    Channels: 1 @ 16-bit
    Samplerate: 8000Hz
    Replaygain: off
    Duration: 00:00:03.07

    In:100% 00:00:03.07 Out:24.6k Clip:0
    Done.

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    feat.params:

    -alpha 0.97
    -samprate 8000.0
    -doublebw no
    -nfilt 31
    -ncep 13
    -lowerf 200.00
    -upperf 3500.00
    -dither yes
    -nfft 512
    -wlen 0.0256
    -transform legacy
    -feat CFG_FEATURE
    -svspec CFG_SVSPEC
    -agc CFG_AGC
    -cmn CFG_CMN
    -varnorm CFG_VARNORM

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    Aliamagic, you are quick!
    The question is: the application seem not recorgnize my voice in both Linux and Android. What have I done incorrect?

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    Eliasmagic, **How many people do you think that would be enough? I am looking
    for more people to help with recording, but before doing this. I would like
    you to check to see any thing wrong with my audio files. **

    I set CFG_WAVFILE_TYPE = 'mswav", but I record voice using linux. Does it
    conflict with my current setting?

    Audio waveform and feature file information

    $CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav";
    $CFG_WAVFILE_EXTENSION = 'wav';
    $CFG_WAVFILE_TYPE = 'mswav'; # one of nist, mswav, raw
    $CFG_FEATFILES_DIR = "$CFG_BASE_DIR/feat";

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    sphinx_train.cfg said that set CFG_HMM_TYPE = '.semi.'; which for PocketSphinx
    and Sphinx II, but I see most people set $CFG_HMM_TYPE = '.cont.'. so I set my
    $CFG_HMM_TYPE = '.cont.'.

    Below is my current setting.... can you please check?

    $CFG_HMM_TYPE = '.cont.'; # Sphinx III

    $CFG_HMM_TYPE = '.semi.'; # PocketSphinx and Sphinx II

    $CFG_HMM_TYPE = '.ptm.'; # PocketSphinx (larger data sets)

    ...
    $CFG_FINAL_NUM_DENSITIES = 2;
    ...
    $CFG_N_TIED_STATES = 200;

     
  • Biet Hoang

    Biet Hoang - 2011-09-14

    RecognizerTask.java I set
    c.setString("-hmm",
    "/sdcard/Android/data/edu.cmu.pocketsphinx/hmm/tv/digit.cd_cont_200");
    c.setString("-dict",
    "/sdcard/Android/data/edu.cmu.pocketsphinx/lm/tv/digit.dic");
    c.setString("-lm",
    "/sdcard/Android/data/edu.cmu.pocketsphinx/lm/tv/digit.lm.DMP");

    c.setString("-rawlogdir", "/sdcard/Android/data/edu.cmu.pocketsphinx");
    c.setFloat("-samprate", 8000.0);
    c.setInt("-maxhmmpf", 2000);
    c.setInt("-maxwpf", 10);
    c.setInt("-pl_window", 2);
    c.setBoolean("-backtrace", true);
    c.setBoolean("-bestpath", false);

     
  • eliasmajic

    eliasmajic - 2011-09-14

    You need hours. The below site gives good numbers but....Why not use an
    existing model? You are just doing simple digits...

    http://cmusphinx.sourceforge.net/wiki/tutorialam

     
  • Biet Hoang

    Biet Hoang - 2011-09-15

    Hi eliasmajic,
    I build the it for an application that only need ~100 commands. I did tried
    the existing model which included in the pocketsphinx directory. it takes
    along time to load the model, because it is large, and it does not translate
    text correctly either (totally wrong), so I do not adapt it. I tried the
    models below but no successful
    US English WSJ5K
    US English HUB4
    Can you suggest me a good one that can be used for android?
    Do I need to adapt the acoustic model before using it?
    I tried tidigits model, and it is loading fast and accurate. That is why I
    decided to create a new model. Well, I am in the learning process, so I would
    like to try any thing that works first.

    thank you very much for your help

     

Log in to post a comment.