Menu

Pocketsphinx on ios - no hypothesis/recognition

Help
A darsh
2016-06-09
2016-06-10
  • A darsh

    A darsh - 2016-06-09

    Hi,
    I've been trying to use pocketsphinx on ios (tried iphone and simulator). Compiled using the provided build_iphone.sh and linked statically. Attempt to recognize speech does not produce any output on ios. I converted the raw-file written by pocketsphinx in the process to .wav and fed it to pocketsphinx_continuous (from the same compilation) with exactly the same parameters, and it works there. What could I be missing?

    For some reason the number of words, senones (etc) evaluated for ios is considerably lower than that of continuous and I wonder if that is the reason for not recognising anything on ios.

    Following is the output of pocketsphinx_continuous and ios (below)

    Current configuration:
    [NAME]          [DEFLT]     [VALUE]
    -agc            none        none
    -agcthresh      2.0     2.000000e+00
    -allphone               
    -allphone_ci        no      no
    -alpha          0.97        9.700000e-01
    -ascale         20.0        2.000000e+01
    -aw         1       1
    -backtrace      no      no
    -beam           1e-48       1.000000e-48
    -bestpath       yes     yes
    -bestpathlw     9.5     9.500000e+00
    -ceplen         13      13
    -cmn            live        current
    -cmninit        40,3,-1     40,3,-1
    -compallsen     no      no
    -debug                  0
    -dict                   data/model/dict
    -dictcase       no      no
    -dither         no      no
    -doublebw       no      no
    -ds         1       1
    -fdict                  
    -feat           1s_c_d_dd   s2_4x
    -featparams             
    -fillprob       1e-8        1.000000e-08
    -frate          100     100
    -fsg                    
    -fsgusealtpron      yes     yes
    -fsgusefiller       yes     yes
    -fwdflat        yes     yes
    -fwdflatbeam        1e-64       1.000000e-64
    -fwdflatefwid       4       4
    -fwdflatlw      8.5     8.500000e+00
    -fwdflatsfwin       25      25
    -fwdflatwbeam       7e-29       7.000000e-29
    -fwdtree        yes     yes
    -hmm                    data/model/acoustic
    -input_endian       little      little
    -jsgf                   
    -keyphrase              
    -kws                    
    -kws_delay      10      10
    -kws_plp        1e-1        1.000000e-01
    -kws_threshold      1       1.000000e+00
    -latsize        5000        5000
    -lda                    
    -ldadim         0       0
    -lifter         0       22
    -lm                 /tmp/test.lm
    -lmctl                  
    -lmname                 
    -logbase        1.0001      1.000100e+00
    -logfn                  
    -logspec        no      no
    -lowerf         133.33334   2.000000e+02
    -lpbeam         1e-40       1.000000e-40
    -lponlybeam     7e-29       7.000000e-29
    -lw         6.5     6.500000e+00
    -maxhmmpf       30000       30000
    -maxwpf         -1      -1
    -mdef                   
    -mean                   
    -mfclogdir              
    -min_endfr      0       0
    -mixw                   
    -mixwfloor      0.0000001   1.000000e-07
    -mllr                   
    -mmap           yes     yes
    -ncep           13      13
    -nfft           512     512
    -nfilt          40      15
    -nwpen          1.0     1.000000e+00
    -pbeam          1e-48       1.000000e-48
    -pip            1.0     1.000000e+00
    -pl_beam        1e-10       1.000000e-10
    -pl_pbeam       1e-10       1.000000e-10
    -pl_pip         1.0     1.000000e+00
    -pl_weight      3.0     3.000000e+00
    -pl_window      5       5
    -rawlogdir              
    -remove_dc      no      no
    -remove_noise       yes     yes
    -remove_silence     yes     yes
    -round_filters      yes     yes
    -samprate       16000       8.000000e+03
    -seed           -1      -1
    -sendump                
    -senlogdir              
    -senmgau                
    -silprob        0.005       5.000000e-03
    -smoothspec     no      no
    -svspec                 
    -tmat                   
    -tmatfloor      0.0001      1.000000e-04
    -topn           4       4
    -topn_beam      0       0
    -toprule                
    -transform      legacy      dct
    -unit_area      yes     yes
    -upperf         6855.4976   3.500000e+03
    -uw         1.0     1.000000e+00
    -vad_postspeech     50      50
    -vad_prespeech      20      20
    -vad_startspeech    10      10
    -vad_threshold      2.0     2.000000e+00
    -var                    
    -varfloor       0.0001      1.000000e-04
    -varnorm        no      no
    -verbose        no      no
    -warp_params                
    -warp_type      inverse_linear  inverse_linear
    -wbeam          7e-29       7.000000e-29
    -wip            0.65        6.500000e-01
    -wlen           0.025625    2.562500e-02
    
    INFO: feat.c(715): Initializing feature stream to type: 's2_4x', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
    INFO: cmn.c(97): mean[0]= 12.00, mean[1..12]= 0.0
    INFO: mdef.c(518): Reading model definition: data/model/acoustic/mdef
    INFO: bin_mdef.c(181): Allocating 8899 * 8 bytes (69 KiB) for CD tree
    INFO: tmat.c(206): Reading HMM transition probability matrices: data/model/acoustic/transition_matrices
    INFO: acmod.c(117): Attempting to use PTM computation module
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: data/model/acoustic/means
    INFO: ms_gauden.c(242): 27 codebook, 4 feature, size: 
    INFO: ms_gauden.c(244):  64x12
    INFO: ms_gauden.c(244):  64x24
    INFO: ms_gauden.c(244):  64x3
    INFO: ms_gauden.c(244):  64x12
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: data/model/acoustic/variances
    INFO: ms_gauden.c(242): 27 codebook, 4 feature, size: 
    INFO: ms_gauden.c(244):  64x12
    INFO: ms_gauden.c(244):  64x24
    INFO: ms_gauden.c(244):  64x3
    INFO: ms_gauden.c(244):  64x12
    INFO: ms_gauden.c(304): 2 variance values floored
    INFO: ptm_mgau.c(476): Loading senones from dump file data/model/acoustic/sendump
    INFO: ptm_mgau.c(500): BEGIN FILE FORMAT DESCRIPTION
    INFO: ptm_mgau.c(563): Rows: 64, Columns: 4081
    INFO: ptm_mgau.c(595): Using memory-mapped I/O for senones
    ERROR: "mmio.c", line 226: Failed to mmap 0 bytes: Invalid argument
    INFO: ptm_mgau.c(838): Maximum top-N: 4
    INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
    INFO: dict.c(320): Allocating 5245 * 32 bytes (163 KiB) for word entries
    INFO: dict.c(333): Reading main dictionary: data/model/dict
    INFO: dict.c(213): Dictionary size 1146, allocated 7 KiB for strings, 13 KiB for phones
    INFO: dict.c(336): 1146 words read
    INFO: dict.c(358): Reading filler dictionary: data/model/acoustic/noisedict
    INFO: dict.c(213): Dictionary size 1149, allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(361): 3 words read
    INFO: dict2pid.c(396): Building PID tables for dictionary
    INFO: dict2pid.c(406): Allocating 27^3 * 2 bytes (38 KiB) for word-initial triphones
    INFO: dict2pid.c(132): Allocated 17712 bytes (17 KiB) for word-final triphones
    INFO: dict2pid.c(196): Allocated 17712 bytes (17 KiB) for single-phone word triphones
    INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
    INFO: ngram_model_trie.c(365): Header doesn't match
    INFO: ngram_model_trie.c(177): Trying to read LM in arpa format
    INFO: ngram_model_trie.c(193): LM of order 3
    INFO: ngram_model_trie.c(195): #1-grams: 7
    INFO: ngram_model_trie.c(195): #2-grams: 9
    INFO: ngram_model_trie.c(195): #3-grams: 6
    INFO: lm_trie.c(474): Training quantizer
    INFO: lm_trie.c(482): Building LM trie
    INFO: ngram_search_fwdtree.c(74): Initializing search tree
    INFO: ngram_search_fwdtree.c(101): 146 unique initial diphones
    INFO: ngram_search_fwdtree.c(186): Creating search channels
    INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 136
    INFO: ngram_search_fwdtree.c(333): Created 5 root, 8 non-root channels, 3 single-phone words
    INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
    INFO: continuous.c(307): ./bin/x86_64/bin/pocketsphinx_continuous COMPILED ON: Jun  8 2016, AT: 18:48:35
    
    INFO: cmn_live.c(120): Update from < 40.00  3.00 -1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
    INFO: cmn_live.c(138): Update to   < 48.27  0.52 -9.28  7.13  1.50  2.32 -7.44 -1.02 -0.31 -3.32  2.96 -1.87 -0.13 >
    INFO: ngram_search_fwdtree.c(1550):      585 words recognized (3/fr)
    INFO: ngram_search_fwdtree.c(1552):    18108 senones evaluated (97/fr)
    INFO: ngram_search_fwdtree.c(1556):     7708 channels searched (41/fr), 775 1st, 6127 last
    INFO: ngram_search_fwdtree.c(1559):      711 words for which last channels evaluated (3/fr)
    INFO: ngram_search_fwdtree.c(1561):      278 candidate words for entering last phone (1/fr)
    INFO: ngram_search_fwdtree.c(1564): fwdtree 0.05 CPU 0.025 xRT
    INFO: ngram_search_fwdtree.c(1567): fwdtree 0.07 wall 0.035 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 6 words
    INFO: ngram_search_fwdflat.c(948):      509 words recognized (3/fr)
    INFO: ngram_search_fwdflat.c(950):    20410 senones evaluated (110/fr)
    INFO: ngram_search_fwdflat.c(952):     9220 channels searched (49/fr)
    INFO: ngram_search_fwdflat.c(954):      940 words searched (5/fr)
    INFO: ngram_search_fwdflat.c(957):      472 word transitions (2/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.02 CPU 0.012 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.04 wall 0.021 xRT
    INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.172
    INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1381): Lattice has 66 nodes, 50 links
    INFO: ps_lattice.c(1380): Bestpath score: -3267
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:172:184) = -205658
    INFO: ps_lattice.c(1441): Joint P(O,S) = -226521 P(S|O) = -20863
    INFO: ngram_search.c(872): bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(875): bestpath 0.00 wall 0.000 xRT
    HOLA MI NOMBRE
    INFO: ngram_search_fwdtree.c(429): TOTAL fwdtree 0.05 CPU 0.025 xRT
    INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 0.07 wall 0.035 xRT
    INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.02 CPU 0.012 xRT
    INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.04 wall 0.021 xRT
    INFO: ngram_search.c(303): TOTAL bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(306): TOTAL bestpath 0.00 wall 0.000 xRT
    

    On ios simulator -

    Current configuration:
    [NAME]          [DEFLT]     [VALUE]
    -agc            none        none
    -agcthresh      2.0     2.000000e+00
    -allphone               
    -allphone_ci        no      no
    -alpha          0.97        9.700000e-01
    -ascale         20.0        2.000000e+01
    -aw         1       1
    -backtrace      no      no
    -beam           1e-48       1.000000e-48
    -bestpath       yes     yes
    -bestpathlw     9.5     9.500000e+00
    -ceplen         13      13
    -cmn            live        current
    -cmninit        40,3,-1     40,3,-1
    -compallsen     no      no
    -debug                  3
    -dict                   /Users/..../data/model/dict
    -dictcase       no      no
    -dither         no      no
    -doublebw       no      no
    -ds         1       1
    -fdict                  
    -feat           1s_c_d_dd   s2_4x
    -featparams             
    -fillprob       1e-8        1.000000e-08
    -frate          100     100
    -fsg                    
    -fsgusealtpron      yes     yes
    -fsgusefiller       yes     yes
    -fwdflat        yes     yes
    -fwdflatbeam        1e-64       1.000000e-64
    -fwdflatefwid       4       4
    -fwdflatlw      8.5     8.500000e+00
    -fwdflatsfwin       25      25
    -fwdflatwbeam       7e-29       7.000000e-29
    -fwdtree        yes     yes
    -hmm                    /Users/..../data/model/acoustic
    -input_endian       little      little
    -jsgf                   
    -keyphrase              
    -kws                    
    -kws_delay      10      10
    -kws_plp        1e-1        1.000000e-01
    -kws_threshold      1       1.000000e+00
    -latsize        5000        5000
    -lda                    
    -ldadim         0       0
    -lifter         0       22
    -lm                 /Users/.../test.lm
    -lmctl                  
    -lmname                 
    -logbase        1.0001      1.000100e+00
    -logfn                  
    -logspec        no      no
    -lowerf         133.33334   2.000000e+02
    -lpbeam         1e-40       1.000000e-40
    -lponlybeam     7e-29       7.000000e-29
    -lw         6.5     6.500000e+00
    -maxhmmpf       30000       30000
    -maxwpf         -1      -1
    -mdef                   
    -mean                   
    -mfclogdir              
    -min_endfr      0       0
    -mixw                   
    -mixwfloor      0.0000001   1.000000e-07
    -mllr                   
    -mmap           yes     yes
    -ncep           13      13
    -nfft           512     512
    -nfilt          40      15
    -nwpen          1.0     1.000000e+00
    -pbeam          1e-48       1.000000e-48
    -pip            1.0     1.000000e+00
    -pl_beam        1e-10       1.000000e-10
    -pl_pbeam       1e-10       1.000000e-10
    -pl_pip         1.0     1.000000e+00
    -pl_weight      3.0     3.000000e+00
    -pl_window      5       5
    -rawlogdir              /Users/..../tmp/
    -remove_dc      no      no
    -remove_noise       yes     yes
    -remove_silence     yes     yes
    -round_filters      yes     yes
    -samprate       16000       8.000000e+03
    -seed           -1      -1
    -sendump                
    -senlogdir              
    -senmgau                
    -silprob        0.005       5.000000e-03
    -smoothspec     no      no
    -svspec                 
    -tmat                   
    -tmatfloor      0.0001      1.000000e-04
    -topn           4       4
    -topn_beam      0       0
    -toprule                
    -transform      legacy      dct
    -unit_area      yes     yes
    -upperf         6855.4976   3.500000e+03
    -uw         1.0     1.000000e+00
    -vad_postspeech     50      50
    -vad_prespeech      20      20
    -vad_startspeech    10      10
    -vad_threshold      2.0     2.000000e+00
    -var                    
    -varfloor       0.0001      1.000000e-04
    -varnorm        no      no
    -verbose        no      no
    -warp_params                
    -warp_type      inverse_linear  inverse_linear
    -wbeam          7e-29       7.000000e-29
    -wip            0.65        6.500000e-01
    -wlen           0.025625    2.562500e-02
    
    INFO: feat.c(715): Initializing feature stream to type: 's2_4x', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
    INFO: cmn.c(97): mean[0]= 12.00, mean[1..12]= 0.0
    INFO: mdef.c(518): Reading model definition: /Users/..../data/model/acoustic/mdef
    INFO: bin_mdef.c(181): Allocating 8899 * 8 bytes (69 KiB) for CD tree
    INFO: tmat.c(206): Reading HMM transition probability matrices: /Users/..../data/model/acoustic/transition_matrices
    INFO: acmod.c(117): Attempting to use PTM computation module
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /Users/..../data/model/acoustic/means
    INFO: ms_gauden.c(242): 27 codebook, 4 feature, size: 
    INFO: ms_gauden.c(244):  64x12
    INFO: ms_gauden.c(244):  64x24
    INFO: ms_gauden.c(244):  64x3
    INFO: ms_gauden.c(244):  64x12
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: /Users/..../data/model/acoustic/variances
    INFO: ms_gauden.c(242): 27 codebook, 4 feature, size: 
    INFO: ms_gauden.c(244):  64x12
    INFO: ms_gauden.c(244):  64x24
    INFO: ms_gauden.c(244):  64x3
    INFO: ms_gauden.c(244):  64x12
    INFO: ms_gauden.c(304): 2 variance values floored
    INFO: ptm_mgau.c(476): Loading senones from dump file /Users/..../data/model/acoustic/sendump
    INFO: ptm_mgau.c(500): BEGIN FILE FORMAT DESCRIPTION
    INFO: ptm_mgau.c(563): Rows: 64, Columns: 4081
    INFO: ptm_mgau.c(595): Using memory-mapped I/O for senones
    INFO: ptm_mgau.c(838): Maximum top-N: 4
    INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
    INFO: dict.c(320): Allocating 5245 * 20 bytes (102 KiB) for word entries
    INFO: dict.c(333): Reading main dictionary: /Users/..../data/model/dict
    INFO: dict.c(213): Dictionary size 1146, allocated 7 KiB for strings, 13 KiB for phones
    INFO: dict.c(336): 1146 words read
    INFO: dict.c(358): Reading filler dictionary: /Users/..../data/model/acoustic/noisedict
    INFO: dict.c(213): Dictionary size 1149, allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(361): 3 words read
    INFO: dict2pid.c(396): Building PID tables for dictionary
    INFO: dict2pid.c(406): Allocating 27^3 * 2 bytes (38 KiB) for word-initial triphones
    INFO: dict2pid.c(132): Allocated 8856 bytes (8 KiB) for word-final triphones
    INFO: dict2pid.c(196): Allocated 8856 bytes (8 KiB) for single-phone word triphones
    INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
    INFO: ngram_model_trie.c(365): Header doesn't match
    INFO: ngram_model_trie.c(177): Trying to read LM in arpa format
    INFO: ngram_model_trie.c(193): LM of order 3
    INFO: ngram_model_trie.c(195): #1-grams: 7
    INFO: ngram_model_trie.c(195): #2-grams: 9
    INFO: ngram_model_trie.c(195): #3-grams: 6
    INFO: lm_trie.c(474): Training quantizer
    INFO: lm_trie.c(482): Building LM trie
    INFO: ngram_search_fwdtree.c(74): Initializing search tree
    INFO: ngram_search_fwdtree.c(101): 146 unique initial diphones
    INFO: ngram_search_fwdtree.c(186): Creating search channels
    INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 136
    INFO: ngram_search_fwdtree.c(333): Created 5 root, 8 non-root channels, 3 single-phone words
    INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
    INFO: pocketsphinx.c(982): Writing raw audio log file: /Users/..../tmp//000000000.raw
    Starting audio AudioCallbackData(maxCallbackNum: 30, num: 0, listener: LearnLangFramework.(createSR (LearnLangFramework.BaseAppDelegate) -> ()).(Listener #1), decoder: LearnLangFramework.DecoderWrapper)
    INFO: cmn_live.c(120): Update from < 40.00  3.00 -1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
    INFO: cmn_live.c(138): Update to   < 78.35 -10.78 -1.65 -5.07 -2.43 -4.69 -0.59 -0.63  1.50 -0.47  1.51 -0.67  0.83 >
    INFO: ngram_search_fwdtree.c(1550):      261 words recognized (3/fr)
    INFO: ngram_search_fwdtree.c(1552):     2669 senones evaluated (29/fr)
    INFO: ngram_search_fwdtree.c(1556):      940 channels searched (10/fr), 440 1st, 294 last
    INFO: ngram_search_fwdtree.c(1559):      269 words for which last channels evaluated (2/fr)
    INFO: ngram_search_fwdtree.c(1561):        2 candidate words for entering last phone (0/fr)
    INFO: ngram_search_fwdtree.c(1564): fwdtree 0.07 CPU 0.073 xRT
    INFO: ngram_search_fwdtree.c(1567): fwdtree 2.10 wall 2.286 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 2 words
    INFO: ngram_search_fwdflat.c(948):      261 words recognized (3/fr)
    INFO: ngram_search_fwdflat.c(950):      273 senones evaluated (3/fr)
    INFO: ngram_search_fwdflat.c(952):      267 channels searched (2/fr)
    INFO: ngram_search_fwdflat.c(954):      267 words searched (2/fr)
    INFO: ngram_search_fwdflat.c(957):       76 word transitions (0/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.00 CPU 0.003 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.00 wall 0.003 xRT
    INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.24
    INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1381): Lattice has 9 nodes, 9 links
    INFO: ps_lattice.c(1380): Bestpath score: -372
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:24:90) = -33419
    INFO: ps_lattice.c(1441): Joint P(O,S) = -38083 P(S|O) = -4664
    INFO: ngram_search.c(872): bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(875): bestpath 0.00 wall 0.000 xRT
    
     
    • Nickolay V. Shmyrev

      You corrupted the raw data you feed into decoder in ios, like endian is wrong or something like that. Without seeing your code it is hard to help you there.

       
  • A darsh

    A darsh - 2016-06-09

    Thanks for your quick response Nickolay! The data that I feed into pocketsphinx_continuous is actually the one that is logged by ios pocketsphinx (in rawlogdir). Does that not mean that the data input-format is correct? I also played the same in Audacity and it sounds fine.

    In code, I am initializing as follows -

          ps_default_search_args(conf)
          decoder = ps_init(conf)
          ps_start_utt(decoder)
    

    after initializing the decoder I am reading the data in chunks of 2048 bytes and calling -

       ps_process_raw(decoder, UnsafePointer(bytes),
                                       byteslength / 2, 0, 0)
       ps_get_hyp(decoder, &score)
    

    in a loop.

    At the end

          ps_end_utt(decoder)
          ps_get_hyp(decoder, &score)
    

    My guess was that some initialization part is amiss.

     

    Last edit: A darsh 2016-06-09
    • Nickolay V. Shmyrev

      in a loop.

      You need to provide a complete code, not just a couple of lines.

       
  • A darsh

    A darsh - 2016-06-10

    Turned out that the input format (32-bit-float) was indeed wrong. Audacity was able to import and export to wav, but when I added the header myself it did not work. Changed to 16-bit-signed and it works. Thanks for your inputs.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.