In the final release of pocketsphinx, it is , or not it is, possible to do
continuous recognition ?
There is some way to do it as openars do it for iphone?
Thanks
Andre
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can use existing pocketsphinx functions to implement continuous
recognition. First task you need to solve is to record the audio continuously
on Android, this is an android-specific task. Then you can feed this audio to
pocketsphinx as it's done in pocketsphinx_continuous. You will have to wrap
several functions of pocketsphinx API in swig to make them accessible from
Android Java code.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Nikolay,
I am already recording the audio continuously on Android. What I need, it is
an example of pocketsphinx_continuous consume in c or whatever, so I can know
which functions I should wrap on swig.
Thanks
Andre
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Humm... doesnt like too difficult to port it ....
I will start it. Do you have notice about some big challeging on doing this?
Do you know if anybody did it yet?
Thanks
Andre
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do you have notice about some big challeging on doing this?
The challenge will start afterwards the porting will be done. There are tweaks
needed to provide good confidence measure of the recognition results and you
will have to implement noise cancellation.
Do you know if anybody did it yet?
No, I'm not aware about such thing released to public
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Shouldn´t I include ad and cont_ad C source references on Android.mk?
Yes, you also need to write wrappers for them. Actually I think it might be
easier to implement energy-based endpointer in java yourself. For example the
one from Google. You can find other implementations as well.
you mean if I detect silence and only pass the array of bytes to already
wrapped processRaw, it works?
Because the only functions in pocketsphinx_continuous that is needed to wrap
is ad and cont_ad, and, as these ones is needed to detect silence and handle
recording, maybe the ones already 'swiged' , it works.
As my analysis, only ad_open_dev , cont_ad_init , ad_start_rec, cont_ad_calib
and cont_ad_read is needed to wrap.
static void
230 recognize_from_microphone()
231 {
232 ad_rec_t ad;
233 int16 adbuf;
234 int32 k, ts, rem;
235 char const hyp;
236 char const uttid;
237 cont_ad_t cont;
238 char word;
239
240 if ((ad = ad_open_dev(cmd_ln_str_r(config, "-adcdev"),
241 (int)cmd_ln_float32_r(config, "-samprate"))) == NULL)
242 E_FATAL("Failed top open audio device\n");
243
244 / Initialize continuous listening module /
245 if ((cont = cont_ad_init(ad, ad_read)) == NULL)
246 E_FATAL("Failed to initialize voice activity detection\n");
247 if (ad_start_rec(ad) < 0)
248 E_FATAL("Failed to start recording\n");
249 if (cont_ad_calib(cont) < 0)
250 E_FATAL("Failed to calibrate voice activity detection\n");
251
252 for (;;) {
253 / Indicate listening for next utterance /
254 printf("READY....\n");
255 fflush(stdout);
256 fflush(stderr);
257
258 / Wait data for next utterance /
259 while ((k = cont_ad_read(cont, adbuf, 4096)) == 0)
260 sleep_msec(100);
261
262 if (k < 0)
263 E_FATAL("Failed to read audio\n");
264
265 /
266 * Non-zero amount of data received; start recognition of new utterance.
267 * NULL argument to uttproc_begin_utt => automatic generation of utterance-
id.
268 /
269 if (ps_start_utt(ps, NULL) < 0)
270 E_FATAL("Failed to start utterance\n");
271 ps_process_raw(ps, adbuf, k, FALSE, FALSE);
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
you mean if I detect silence and only pass the array of bytes to already
wrapped processRaw, it works?
Yes, you can do voice activity detection yourself and pass raw bytes. This
solution might be more straightforward than the one to use cont_ad API which
is sort of too complex
As my analysis, only ad_open_dev , cont_ad_init , ad_start_rec,
cont_ad_calib and cont_ad_read is needed to wrap.
Yes, even with this limited set of functions it's already painful I think
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After load and configure the decoder, I need to call startUtt when have voice
activity and endUtt when stops activity. To get the final results. No need to
recreate the decoder after each silence period?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Nikolay, I got very near, but stuck an a crazy issue.
Alerady detect voice activity, already capture the speech in PCM, 1 Channel,
8khz, bitsPerSample 16, that is recorded to a local file and passed to decoder
as short.
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-ceplen 13 13
-cmn current current
-cmninit 8.0 56,-3,1
-dither no no
-doublebw no no
-feat 1s_c_d_dd 1s_c_d_dd
-frate 100 100
-input_endian little little
-lda
-ldadim 0 0
-lifter 0 0
-logspec no no
-lowerf 133.33334 1.000000e+00
-ncep 13 13
-nfft 512 512
-nfilt 40 20
-remove_dc no yes
-round_filters yes no
-samprate 16000 8.000000e+03
-seed -1 -1
-smoothspec no no
-svspec 0-12/13-25/26-38
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 4.000000e+03
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wlen 0.025625 2.500000e-02
INFO: acmod.c(242): Parsed model-specific feature parameters from
/sdcard/Android/data/pocketsphinx/hmm/en_US//feat.params
INFO: feat.c(697): Initializing feature stream to type: '1s_c_d_dd',
ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean= 12.00, mean= 0.0
INFO: acmod.c(163): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(520): Reading model definition:
/sdcard/Android/data/pocketsphinx/hmm/en_US//mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef
file
INFO: bin_mdef.c(330): Reading binary model definition:
/sdcard/Android/data/pocketsphinx/hmm/en_US//mdef
INFO: bin_mdef.c(507): 50 CI-phone, 143047 CD-phone, 3 emitstate/phone, 150
CI-sen, 5150 Sen, 27135 Sen-Seq
INFO: tmat.c(205): Reading HMM transition probability matrices:
/sdcard/Android/data/pocketsphinx/hmm/en_US//transition_matrices
INFO: acmod.c(117): Attempting to use SCHMM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/sdcard/Android/data/pocketsphinx/hmm/en_US//means
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/sdcard/Android/data/pocketsphinx/hmm/en_US//variances
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(354): 0 variance values floored
INFO: s2_semi_mgau.c(908): Loading senones from dump file
/sdcard/Android/data/pocketsphinx/hmm/en_US//sendump
INFO: s2_semi_mgau.c(932): BEGIN FILE FORMAT DESCRIPTION
INFO: s2_semi_mgau.c(1027): Using memory-mapped I/O for senones
INFO: s2_semi_mgau.c(1304): Maximum top-N: 4 Top-N beams: 0 0 0
INFO: phone_loop_search.c(105): State beam -230231 Phone exit beam -115115
Insertion penalty 0
INFO: dict.c(306): Allocating 4114 * 20 bytes (80 KiB) for word entries
INFO: dict.c(321): Reading main dictionary:
/sdcard/Android/data/pocketsphinx/lm/en_US/dic.dic
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(324): 7 words read
INFO: dict.c(330): Reading filler dictionary:
/sdcard/Android/data/pocketsphinx/hmm/en_US//noisedict
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(333): 11 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(404): Allocating 50^3 * 2 bytes (244 KiB) for word-initial
triphones
INFO: dict2pid.c(131): Allocated 30200 bytes (29 KiB) for word-final triphones
INFO: dict2pid.c(195): Allocated 30200 bytes (29 KiB) for single-phone word
triphones
INFO: fsg_search.c(145): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26,
pip: 0)
INFO: jsgf.c(546): Defined rule: PUBLIC <grm.simple>
INFO: fsg_model.c(213): Computing transitive closure for null transitions
INFO: fsg_model.c(264): 0 null transitions added
INFO: fsg_model.c(411): Adding silence transitions for <sil> to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++NOISE++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++BREATH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++SMACK++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++COUGH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++LAUGH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++TONE++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++UH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++UM++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_search.c(364): Added 1 alternate word transitions
INFO: fsg_lextree.c(108): Allocated 816 bytes (0 KiB) for left and right
context phones
INFO: fsg_lextree.c(251): 102 HMM nodes in lextree (83 leaves)
INFO: fsg_lextree.c(253): Allocated 11016 bytes (10 KiB) for all lextree nodes
INFO: fsg_lextree.c(256): Allocated 8964 bytes (8 KiB) for lextree leafnodes
INFO: pocketsphinx.c(673): Writing raw audio log file:
/sdcard/Android/data/pocketsphinx/000000000.raw
INFO: cmn_prior.c(121): cmn_prior_update: from < 56.00 -3.00 1.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 90.98 -7.73 -2.32 -1.44 -0.87
-0.60 -0.35 -0.24 -0.30 -0.12 -0.09 -0.12 -0.05 >
INFO: fsg_search.c(1030): 255 frames, 5212 HMMs (20/fr), 15652 senones
(61/fr), 691 history entries (2/fr) </sil></grm.simple>
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(846): 000000000: (null) (1144249008)
INFO: word start end pprob ascr lscr lback
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(673): Writing raw audio log file:
/sdcard/Android/data/pocketsphinx/000000001.raw
INFO: cmn_prior.c(121): cmn_prior_update: from < 90.98 -7.73 -2.32 -1.44 -0.87
-0.60 -0.35 -0.24 -0.30 -0.12 -0.09 -0.12 -0.05 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 90.92 -7.97 -2.23 -1.37 -0.80
-0.59 -0.38 -0.28 -0.25 -0.14 -0.09 -0.14 -0.07 >
INFO: fsg_search.c(1030): 255 frames, 3479 HMMs (13/fr), 12305 senones
(48/fr), 254 history entries (0/fr)
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(846): 000000001: (null) (2969816)
INFO: word start end pprob ascr lscr lback
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(673): Writing raw audio log file:
/sdcard/Android/data/pocketsphinx/000000002.raw
INFO: cmn_prior.c(121): cmn_prior_update: from < 90.92 -7.97 -2.23 -1.37 -0.80
-0.59 -0.38 -0.28 -0.25 -0.14 -0.09 -0.14 -0.07 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 90.90 -7.99 -2.23 -1.37 -0.81
-0.59 -0.38 -0.29 -0.23 -0.13 -0.10 -0.13 -0.08 >
INFO: fsg_search.c(1030): 153 frames, 2090 HMMs (13/fr), 7390 senones (48/fr),
152 history entries (0/fr)
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 153
INFO: pocketsphinx.c(846): 000000002: (null) (1164416)
INFO: word start end pprob ascr lscr lback
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 153
INFO: cmd_ln.c(559): Parsing command line:
\
-nfilt 20 \
-lowerf 1 \
-upperf 4000 \
-wlen 0.025 \
-transform dct \
-round_filters no \
-remove_dc yes \
-svspec 0-12/13-25/26-38 \
-feat 1s_c_d_dd \
-agc none \
-cmn current \
-cmninit 56,-3,1 \
-varnorm no
Current configuration:
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-ceplen 13 13
-cmn current current
-cmninit 8.0 56,-3,1
-dither no no
-doublebw no no
-feat 1s_c_d_dd 1s_c_d_dd
-frate 100 100
-input_endian little little
-lda
-ldadim 0 0
-lifter 0 0
-logspec no no
-lowerf 133.33334 1.000000e+00
-ncep 13 13
-nfft 512 512
-nfilt 40 20
-remove_dc no yes
-round_filters yes no
-samprate 16000 8.000000e+03
-seed -1 -1
-smoothspec no no
-svspec 0-12/13-25/26-38
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 4.000000e+03
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wlen 0.025625 2.500000e-02
INFO: acmod.c(242): Parsed model-specific feature parameters from
/sdcard/Android/data/pocketsphinx/hmm/en_US//feat.params
INFO: feat.c(697): Initializing feature stream to type: '1s_c_d_dd',
ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean= 12.00, mean= 0.0
INFO: acmod.c(163): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(520): Reading model definition:
/sdcard/Android/data/pocketsphinx/hmm/en_US//mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef
file
INFO: bin_mdef.c(330): Reading binary model definition:
/sdcard/Android/data/pocketsphinx/hmm/en_US//mdef
INFO: bin_mdef.c(507): 50 CI-phone, 143047 CD-phone, 3 emitstate/phone, 150
CI-sen, 5150 Sen, 27135 Sen-Seq
INFO: tmat.c(205): Reading HMM transition probability matrices:
/sdcard/Android/data/pocketsphinx/hmm/en_US//transition_matrices
INFO: acmod.c(117): Attempting to use SCHMM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/sdcard/Android/data/pocketsphinx/hmm/en_US//means
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/sdcard/Android/data/pocketsphinx/hmm/en_US//variances
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(354): 0 variance values floored
INFO: s2_semi_mgau.c(908): Loading senones from dump file
/sdcard/Android/data/pocketsphinx/hmm/en_US//sendump
INFO: s2_semi_mgau.c(932): BEGIN FILE FORMAT DESCRIPTION
INFO: s2_semi_mgau.c(1027): Using memory-mapped I/O for senones
INFO: s2_semi_mgau.c(1304): Maximum top-N: 4 Top-N beams: 0 0 0
INFO: phone_loop_search.c(105): State beam -230231 Phone exit beam -115115
Insertion penalty 0
INFO: dict.c(306): Allocating 4114 * 20 bytes (80 KiB) for word entries
INFO: dict.c(321): Reading main dictionary:
/sdcard/Android/data/pocketsphinx/lm/en_US/dic.dic
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(324): 7 words read
INFO: dict.c(330): Reading filler dictionary:
/sdcard/Android/data/pocketsphinx/hmm/en_US//noisedict
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(333): 11 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(404): Allocating 50^3 * 2 bytes (244 KiB) for word-initial
triphones
INFO: dict2pid.c(131): Allocated 30200 bytes (29 KiB) for word-final triphones
INFO: dict2pid.c(195): Allocated 30200 bytes (29 KiB) for single-phone word
triphones
INFO: fsg_search.c(145): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26,
pip: 0)
INFO: jsgf.c(546): Defined rule: PUBLIC <grm.simple>
INFO: fsg_model.c(213): Computing transitive closure for null transitions
INFO: fsg_model.c(264): 0 null transitions added
INFO: fsg_model.c(411): Adding silence transitions for <sil> to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++NOISE++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++BREATH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++SMACK++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++COUGH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++LAUGH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++TONE++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++UH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++UM++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_search.c(364): Added 1 alternate word transitions
INFO: fsg_lextree.c(108): Allocated 816 bytes (0 KiB) for left and right
context phones
INFO: fsg_lextree.c(251): 102 HMM nodes in lextree (83 leaves)
INFO: fsg_lextree.c(253): Allocated 11016 bytes (10 KiB) for all lextree nodes
INFO: fsg_lextree.c(256): Allocated 8964 bytes (8 KiB) for lextree leafnodes
INFO: pocketsphinx.c(673): Writing raw audio log file:
/sdcard/Android/data/pocketsphinx/000000000.raw
INFO: cmn_prior.c(121): cmn_prior_update: from < 56.00 -3.00 1.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 91.02 -7.05 -2.33 -1.44 -0.76
-0.54 -0.39 -0.21 -0.19 -0.14 -0.19 -0.15 -0.13 >
INFO: fsg_search.c(1030): 255 frames, 4972 HMMs (19/fr), 15047 senones
(59/fr), 641 history entries (2/fr) </sil></grm.simple>
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(846): 000000000: (null) (1144249008)
INFO: word start end pprob ascr lscr lback
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, made it work without need to generate a file. Now i am passing directly
the data from onBufferReceived(byte buf) to process_raw after convert to a
short. At least now I get some partialResults with wrong values but never got
a endResults.
So, I am wondering two things:
1 - I am obligated to pass 512 shorts to process_raw and this is misbehaving
the decoder. onBufferReceived raises an array of bytes with 320 positions and
i am passing directly. No much sense for me.
2 - Maybe the format of audio fed by onBufferReceived, is not what
pocketsphinx expects.
Accordingly with Google:
android.speech.RecognitionListener.onBufferReceived(byte buffer)
buffer: a buffer containing a sequence of big-endian 16-bit integers
representing a single channel audio stream. The sample rate is implementation
dependent.
If i save this data and put a wave header on it this way:
WaveHeader hdr = new WaveHeader(WaveHeader.FORMAT_PCM,(short)1, 8000,
(short)16, pcm.length);
The audio plays perfectly.
Makes sense?
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And if I record only the voice between silence and use pocketsphinx to
recognize from a recorded file? Is this possible?
I'm trying to do the same, but I don't know how to find the silence in the
audio. Can you please help me with that?
What did you use for finding the silences? Did you used
android.Media.AudioRecord?
Any tips would be helpful for me.
Thanks,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, i am acumulating it to 512 as i saw in some .c version consuming example.
onBufferReceived raises 320 bytes, that converted to shorts get 160.So, if i
pass 320 shorts is fine, i will need to acumulate it only twice before pass to
process_Raw
About the little endian issue, i think this is a tremendous problem to make
this work.
Hope this is changeable on android source code or it is a brick on top.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In the final release of pocketsphinx, it is , or not it is, possible to do
continuous recognition ?
There is some way to do it as openars do it for iphone?
Thanks
Andre
No, current implementation doesn't let you do that. You need to implement
wrappers specific for continuous recognition.
You need to wrap proper functions in Android Java code.
Can you please only tell me which I need?
Thanks
Should I compile another library and write a full new wrapper, or
libpocketsphinx.so is enough and I should only add new wrapper to swig ?
Thanks
Andre
Hello Andre
You can use existing pocketsphinx functions to implement continuous
recognition. First task you need to solve is to record the audio continuously
on Android, this is an android-specific task. Then you can feed this audio to
pocketsphinx as it's done in pocketsphinx_continuous. You will have to wrap
several functions of pocketsphinx API in swig to make them accessible from
Android Java code.
Nikolay,
I am already recording the audio continuously on Android. What I need, it is
an example of pocketsphinx_continuous consume in c or whatever, so I can know
which functions I should wrap on swig.
Thanks
Andre
The source code for pocketsphinx_continuous is available here:
http://cmusphinx.svn.sourceforge.net/viewvc/cmusphinx/trunk/pocketsphinx/src/
programs/continuous.c?revision=10974&view=markup
It's pretty small
Humm... doesnt like too difficult to port it ....
I will start it. Do you have notice about some big challeging on doing this?
Do you know if anybody did it yet?
Thanks
Andre
The challenge will start afterwards the porting will be done. There are tweaks
needed to provide good confidence measure of the recognition results and you
will have to implement noise cancellation.
No, I'm not aware about such thing released to public
And if I record only the voice between silence and use pocketsphinx to
recognize from a recorded file?
Is this possible?
Yes
Nykolay,
Shouldn´t I include ad and cont_ad C source references on Android.mk?
I think they aren´t being included on shared libray defaulf compilation.
Didn´t see any references to it on this file.
Thanks
Yes, you also need to write wrappers for them. Actually I think it might be
easier to implement energy-based endpointer in java yourself. For example the
one from Google. You can find other implementations as well.
http://www.google.com/codesearch#hfE6470xZHk/chrome/browser/speech/endpointer
/energy_endpointer.cc&type=cs
Or this
http://www.google.com/codesearch#cZwlSNS7aEw/packages/inputmethods/LatinIME/j
ava/src/com/android/inputmethod/voice/VoiceInput.java&q=endpoint%20speech%20la
ng:%5Ejava$&type=cs
Android seems to have this thing already. Maybe you just need to check Android
API?
Nikolay,
you mean if I detect silence and only pass the array of bytes to already
wrapped processRaw, it works?
Because the only functions in pocketsphinx_continuous that is needed to wrap
is ad and cont_ad, and, as these ones is needed to detect silence and handle
recording, maybe the ones already 'swiged' , it works.
As my analysis, only ad_open_dev , cont_ad_init , ad_start_rec, cont_ad_calib
and cont_ad_read is needed to wrap.
static void
230 recognize_from_microphone()
231 {
232 ad_rec_t ad;
233 int16 adbuf;
234 int32 k, ts, rem;
235 char const hyp;
236 char const uttid;
237 cont_ad_t cont;
238 char word;
239
240 if ((ad = ad_open_dev(cmd_ln_str_r(config, "-adcdev"),
241 (int)cmd_ln_float32_r(config, "-samprate"))) == NULL)
242 E_FATAL("Failed top open audio device\n");
243
244 / Initialize continuous listening module /
245 if ((cont = cont_ad_init(ad, ad_read)) == NULL)
246 E_FATAL("Failed to initialize voice activity detection\n");
247 if (ad_start_rec(ad) < 0)
248 E_FATAL("Failed to start recording\n");
249 if (cont_ad_calib(cont) < 0)
250 E_FATAL("Failed to calibrate voice activity detection\n");
251
252 for (;;) {
253 / Indicate listening for next utterance /
254 printf("READY....\n");
255 fflush(stdout);
256 fflush(stderr);
257
258 / Wait data for next utterance /
259 while ((k = cont_ad_read(cont, adbuf, 4096)) == 0)
260 sleep_msec(100);
261
262 if (k < 0)
263 E_FATAL("Failed to read audio\n");
264
265 /
266 * Non-zero amount of data received; start recognition of new utterance.
267 * NULL argument to uttproc_begin_utt => automatic generation of utterance-
id.
268 /
269 if (ps_start_utt(ps, NULL) < 0)
270 E_FATAL("Failed to start utterance\n");
271 ps_process_raw(ps, adbuf, k, FALSE, FALSE);
Yes, you can do voice activity detection yourself and pass raw bytes. This
solution might be more straightforward than the one to use cont_ad API which
is sort of too complex
Yes, even with this limited set of functions it's already painful I think
Well, so... this way no new wrapper is needed...
I was trying to port it and it is painfuil... handle alsa issues and so on...
So, just for double check:
After load and configure the decoder, I need to call startUtt when have voice
activity and endUtt when stops activity. To get the final results. No need to
recreate the decoder after each silence period?
Yes
You do not need to recreiate the decoder. You can use existing one just call
start_utt again.
Nikolay, I got very near, but stuck an a crazy issue.
Alerady detect voice activity, already capture the speech in PCM, 1 Channel,
8khz, bitsPerSample 16, that is recorded to a local file and passed to decoder
as short.
I can see the decoder receives the data ok, because he generates it (you can
get a sample here: http://dl.dropbox.com/u/6231836/000000000.raw , saying OPEN BROWSER)
But, in log, I am getting a lot of "Final state not reached in frame".
Looks a very little detail. Do you have any idea?
Thanks
Follows the log.
INFO: cmd_ln.c(559): Parsing command line:
\
-nfilt 20 \
-lowerf 1 \
-upperf 4000 \
-wlen 0.025 \
-transform dct \
-round_filters no \
-remove_dc yes \
-svspec 0-12/13-25/26-38 \
-feat 1s_c_d_dd \
-agc none \
-cmn current \
-cmninit 56,-3,1 \
-varnorm no
Current configuration:
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-ceplen 13 13
-cmn current current
-cmninit 8.0 56,-3,1
-dither no no
-doublebw no no
-feat 1s_c_d_dd 1s_c_d_dd
-frate 100 100
-input_endian little little
-lda
-ldadim 0 0
-lifter 0 0
-logspec no no
-lowerf 133.33334 1.000000e+00
-ncep 13 13
-nfft 512 512
-nfilt 40 20
-remove_dc no yes
-round_filters yes no
-samprate 16000 8.000000e+03
-seed -1 -1
-smoothspec no no
-svspec 0-12/13-25/26-38
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 4.000000e+03
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wlen 0.025625 2.500000e-02
INFO: acmod.c(242): Parsed model-specific feature parameters from
/sdcard/Android/data/pocketsphinx/hmm/en_US//feat.params
INFO: feat.c(697): Initializing feature stream to type: '1s_c_d_dd',
ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean= 12.00, mean= 0.0
INFO: acmod.c(163): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(520): Reading model definition:
/sdcard/Android/data/pocketsphinx/hmm/en_US//mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef
file
INFO: bin_mdef.c(330): Reading binary model definition:
/sdcard/Android/data/pocketsphinx/hmm/en_US//mdef
INFO: bin_mdef.c(507): 50 CI-phone, 143047 CD-phone, 3 emitstate/phone, 150
CI-sen, 5150 Sen, 27135 Sen-Seq
INFO: tmat.c(205): Reading HMM transition probability matrices:
/sdcard/Android/data/pocketsphinx/hmm/en_US//transition_matrices
INFO: acmod.c(117): Attempting to use SCHMM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/sdcard/Android/data/pocketsphinx/hmm/en_US//means
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/sdcard/Android/data/pocketsphinx/hmm/en_US//variances
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(354): 0 variance values floored
INFO: s2_semi_mgau.c(908): Loading senones from dump file
/sdcard/Android/data/pocketsphinx/hmm/en_US//sendump
INFO: s2_semi_mgau.c(932): BEGIN FILE FORMAT DESCRIPTION
INFO: s2_semi_mgau.c(1027): Using memory-mapped I/O for senones
INFO: s2_semi_mgau.c(1304): Maximum top-N: 4 Top-N beams: 0 0 0
INFO: phone_loop_search.c(105): State beam -230231 Phone exit beam -115115
Insertion penalty 0
INFO: dict.c(306): Allocating 4114 * 20 bytes (80 KiB) for word entries
INFO: dict.c(321): Reading main dictionary:
/sdcard/Android/data/pocketsphinx/lm/en_US/dic.dic
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(324): 7 words read
INFO: dict.c(330): Reading filler dictionary:
/sdcard/Android/data/pocketsphinx/hmm/en_US//noisedict
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(333): 11 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(404): Allocating 50^3 * 2 bytes (244 KiB) for word-initial
triphones
INFO: dict2pid.c(131): Allocated 30200 bytes (29 KiB) for word-final triphones
INFO: dict2pid.c(195): Allocated 30200 bytes (29 KiB) for single-phone word
triphones
INFO: fsg_search.c(145): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26,
pip: 0)
INFO: jsgf.c(546): Defined rule: PUBLIC <grm.simple>
INFO: fsg_model.c(213): Computing transitive closure for null transitions
INFO: fsg_model.c(264): 0 null transitions added
INFO: fsg_model.c(411): Adding silence transitions for <sil> to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++NOISE++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++BREATH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++SMACK++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++COUGH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++LAUGH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++TONE++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++UH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++UM++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_search.c(364): Added 1 alternate word transitions
INFO: fsg_lextree.c(108): Allocated 816 bytes (0 KiB) for left and right
context phones
INFO: fsg_lextree.c(251): 102 HMM nodes in lextree (83 leaves)
INFO: fsg_lextree.c(253): Allocated 11016 bytes (10 KiB) for all lextree nodes
INFO: fsg_lextree.c(256): Allocated 8964 bytes (8 KiB) for lextree leafnodes
INFO: pocketsphinx.c(673): Writing raw audio log file:
/sdcard/Android/data/pocketsphinx/000000000.raw
INFO: cmn_prior.c(121): cmn_prior_update: from < 56.00 -3.00 1.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 90.98 -7.73 -2.32 -1.44 -0.87
-0.60 -0.35 -0.24 -0.30 -0.12 -0.09 -0.12 -0.05 >
INFO: fsg_search.c(1030): 255 frames, 5212 HMMs (20/fr), 15652 senones
(61/fr), 691 history entries (2/fr) </sil></grm.simple>
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(846): 000000000: (null) (1144249008)
INFO: word start end pprob ascr lscr lback
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(673): Writing raw audio log file:
/sdcard/Android/data/pocketsphinx/000000001.raw
INFO: cmn_prior.c(121): cmn_prior_update: from < 90.98 -7.73 -2.32 -1.44 -0.87
-0.60 -0.35 -0.24 -0.30 -0.12 -0.09 -0.12 -0.05 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 90.92 -7.97 -2.23 -1.37 -0.80
-0.59 -0.38 -0.28 -0.25 -0.14 -0.09 -0.14 -0.07 >
INFO: fsg_search.c(1030): 255 frames, 3479 HMMs (13/fr), 12305 senones
(48/fr), 254 history entries (0/fr)
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(846): 000000001: (null) (2969816)
INFO: word start end pprob ascr lscr lback
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(673): Writing raw audio log file:
/sdcard/Android/data/pocketsphinx/000000002.raw
INFO: cmn_prior.c(121): cmn_prior_update: from < 90.92 -7.97 -2.23 -1.37 -0.80
-0.59 -0.38 -0.28 -0.25 -0.14 -0.09 -0.14 -0.07 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 90.90 -7.99 -2.23 -1.37 -0.81
-0.59 -0.38 -0.29 -0.23 -0.13 -0.10 -0.13 -0.08 >
INFO: fsg_search.c(1030): 153 frames, 2090 HMMs (13/fr), 7390 senones (48/fr),
152 history entries (0/fr)
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 153
INFO: pocketsphinx.c(846): 000000002: (null) (1164416)
INFO: word start end pprob ascr lscr lback
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 153
INFO: cmd_ln.c(559): Parsing command line:
\
-nfilt 20 \
-lowerf 1 \
-upperf 4000 \
-wlen 0.025 \
-transform dct \
-round_filters no \
-remove_dc yes \
-svspec 0-12/13-25/26-38 \
-feat 1s_c_d_dd \
-agc none \
-cmn current \
-cmninit 56,-3,1 \
-varnorm no
Current configuration:
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-ceplen 13 13
-cmn current current
-cmninit 8.0 56,-3,1
-dither no no
-doublebw no no
-feat 1s_c_d_dd 1s_c_d_dd
-frate 100 100
-input_endian little little
-lda
-ldadim 0 0
-lifter 0 0
-logspec no no
-lowerf 133.33334 1.000000e+00
-ncep 13 13
-nfft 512 512
-nfilt 40 20
-remove_dc no yes
-round_filters yes no
-samprate 16000 8.000000e+03
-seed -1 -1
-smoothspec no no
-svspec 0-12/13-25/26-38
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 4.000000e+03
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wlen 0.025625 2.500000e-02
INFO: acmod.c(242): Parsed model-specific feature parameters from
/sdcard/Android/data/pocketsphinx/hmm/en_US//feat.params
INFO: feat.c(697): Initializing feature stream to type: '1s_c_d_dd',
ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean= 12.00, mean= 0.0
INFO: acmod.c(163): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(520): Reading model definition:
/sdcard/Android/data/pocketsphinx/hmm/en_US//mdef
INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef
file
INFO: bin_mdef.c(330): Reading binary model definition:
/sdcard/Android/data/pocketsphinx/hmm/en_US//mdef
INFO: bin_mdef.c(507): 50 CI-phone, 143047 CD-phone, 3 emitstate/phone, 150
CI-sen, 5150 Sen, 27135 Sen-Seq
INFO: tmat.c(205): Reading HMM transition probability matrices:
/sdcard/Android/data/pocketsphinx/hmm/en_US//transition_matrices
INFO: acmod.c(117): Attempting to use SCHMM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/sdcard/Android/data/pocketsphinx/hmm/en_US//means
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/sdcard/Android/data/pocketsphinx/hmm/en_US//variances
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(354): 0 variance values floored
INFO: s2_semi_mgau.c(908): Loading senones from dump file
/sdcard/Android/data/pocketsphinx/hmm/en_US//sendump
INFO: s2_semi_mgau.c(932): BEGIN FILE FORMAT DESCRIPTION
INFO: s2_semi_mgau.c(1027): Using memory-mapped I/O for senones
INFO: s2_semi_mgau.c(1304): Maximum top-N: 4 Top-N beams: 0 0 0
INFO: phone_loop_search.c(105): State beam -230231 Phone exit beam -115115
Insertion penalty 0
INFO: dict.c(306): Allocating 4114 * 20 bytes (80 KiB) for word entries
INFO: dict.c(321): Reading main dictionary:
/sdcard/Android/data/pocketsphinx/lm/en_US/dic.dic
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(324): 7 words read
INFO: dict.c(330): Reading filler dictionary:
/sdcard/Android/data/pocketsphinx/hmm/en_US//noisedict
INFO: dict.c(212): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(333): 11 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(404): Allocating 50^3 * 2 bytes (244 KiB) for word-initial
triphones
INFO: dict2pid.c(131): Allocated 30200 bytes (29 KiB) for word-final triphones
INFO: dict2pid.c(195): Allocated 30200 bytes (29 KiB) for single-phone word
triphones
INFO: fsg_search.c(145): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26,
pip: 0)
INFO: jsgf.c(546): Defined rule: PUBLIC <grm.simple>
INFO: fsg_model.c(213): Computing transitive closure for null transitions
INFO: fsg_model.c(264): 0 null transitions added
INFO: fsg_model.c(411): Adding silence transitions for <sil> to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++NOISE++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++BREATH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++SMACK++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++COUGH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++LAUGH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++TONE++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++UH++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_model.c(411): Adding silence transitions for ++UM++ to FSG
INFO: fsg_model.c(431): Added 8 silence word transitions
INFO: fsg_search.c(364): Added 1 alternate word transitions
INFO: fsg_lextree.c(108): Allocated 816 bytes (0 KiB) for left and right
context phones
INFO: fsg_lextree.c(251): 102 HMM nodes in lextree (83 leaves)
INFO: fsg_lextree.c(253): Allocated 11016 bytes (10 KiB) for all lextree nodes
INFO: fsg_lextree.c(256): Allocated 8964 bytes (8 KiB) for lextree leafnodes
INFO: pocketsphinx.c(673): Writing raw audio log file:
/sdcard/Android/data/pocketsphinx/000000000.raw
INFO: cmn_prior.c(121): cmn_prior_update: from < 56.00 -3.00 1.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 91.02 -7.05 -2.33 -1.44 -0.76
-0.54 -0.39 -0.21 -0.19 -0.14 -0.19 -0.15 -0.13 >
INFO: fsg_search.c(1030): 255 frames, 4972 HMMs (19/fr), 15047 senones
(59/fr), 641 history entries (2/fr) </sil></grm.simple>
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
INFO: pocketsphinx.c(846): 000000000: (null) (1144249008)
INFO: word start end pprob ascr lscr lback
ERROR: "fsg_search.c", line 1099: Final state not reached in frame 255
Ok, made it work without need to generate a file. Now i am passing directly
the data from onBufferReceived(byte buf) to process_raw after convert to a
short. At least now I get some partialResults with wrong values but never got
a endResults.
So, I am wondering two things:
1 - I am obligated to pass 512 shorts to process_raw and this is misbehaving
the decoder. onBufferReceived raises an array of bytes with 320 positions and
i am passing directly. No much sense for me.
2 - Maybe the format of audio fed by onBufferReceived, is not what
pocketsphinx expects.
Accordingly with Google:
android.speech.RecognitionListener.onBufferReceived(byte buffer)
buffer: a buffer containing a sequence of big-endian 16-bit integers
representing a single channel audio stream. The sample rate is implementation
dependent.
If i save this data and put a wave header on it this way:
WaveHeader hdr = new WaveHeader(WaveHeader.FORMAT_PCM,(short)1, 8000,
(short)16, pcm.length);
The audio plays perfectly.
Makes sense?
Thanks
Hi
I'm trying to do the same, but I don't know how to find the silence in the
audio. Can you please help me with that?
What did you use for finding the silences? Did you used
android.Media.AudioRecord?
Any tips would be helpful for me.
Thanks,
I'm not sure why is this. 320 shorts is one frame and it looks natural size.
Why do you work with 512 shorts?
Pocketsphinx expects little endian.
Ok, i am acumulating it to 512 as i saw in some .c version consuming example.
onBufferReceived raises 320 bytes, that converted to shorts get 160.So, if i
pass 320 shorts is fine, i will need to acumulate it only twice before pass to
process_Raw
About the little endian issue, i think this is a tremendous problem to make
this work.
Hope this is changeable on android source code or it is a brick on top.
You don't need to accumulate, you can pass 160 too
Thank you. Do you have any tip on the little endian issue?
I will start dig on androide code....