I have been playing with pockesphinx on a intel based PC with a Linux Ubuntu installation
and I have been getting some good results (90%) with our tailored language model. I am using
a set of WAV files as test data.
I have come to trying to get this to work on a NXP8308 based PPC card, with Linux OS, and
this is where I am having issues. Pocketsphinx builds and runs without any reported errors but
I seem to have lost the recognition! If I leave the default endian to big, then all I get is empty recognition
but setting the '-input_endian little' gives me some recognition but down at the 5% mark on the same
input data.
My questions are, has anyone been using PPC for running this? If so are there any compiler, pocketshinx
settings that I should eb looking at? Could this be a timing issue as the processing core on the NXP8308 is
a lot slower thatn the PC based system?
Thanks
Russ
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My questions are, has anyone been using PPC for running this?
No, you are the first.
If so are there any compiler, pocketshinx settings that I should eb looking at?
There are no specific settings.
In order to debug this issue, you need to pinpoint it to a specific file and try to ensure all scores and intermediate values in decoding process are the same. You can compare acoustic scores and language model scores for example.
You can try to reproduce this problem in qemu, that would help us to debug this issue.
Could this be a timing issue as the processing core on the NXP8308 is a lot slower thatn the PC based system?
Unlikely.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the reply. These are the 2 runs that I did, everything up until the batch.c - decoding is the same. The audio test file is a WAV that just contains the speech 'UNDO'.
Ok, so the scores are different. I suspect something wrong with the language model scores since we didn't test this part. Could you test with pocketsphinx-0.8 and sphinxbase-0.8 if that works?
Also, do you compile on the device or cross-compile? Did you run the tests in sphinxbase?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, I was able to reproduce this thing in qemu, our new lm code is does not work in big endian. sphinxbase ngram tests fail. So just another issue to fix.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I constructed a grammar file and this seems to work ok on the target hardware, no language model to worry about for the moment. I am going to start updating my target application and see what sort of timings I have. I am assuming that parts of the pocketsphinx API can be called directly, to save time, so that I dont have to keep calling pocketshinx_batch?
Russ
Last edit: Russ Pitman 2016-10-13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Overall, I'm for more powerful hardware for speech-related tasks. Also, it's probably better for you to consider keyword spotting instead of a grammar. It gives more natural experience to the user.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just an update, Pocketsphinx is running on the 8308 and the semi-continuous model does improve things BUT I ma still looking at ways to improve latency. I am using the grammar file rather than LM as this seems faster. I only have 112 words in the dictionary and a limited grammar as this is for controlling a multifunction display but as the user has only certain phrases he is allowed to use I feel that the grammar file would be sufficicnet.
Russ
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are various config options you can use to make it faster and keep accuracy - beams (-beam, -wbeam, -pbeam), downsampling (-ds), topn gaussian (-topn), phoneme loop (-pl_window). You can check
Thanks for the info, I will go and have a look :-)
I have another question about word detection speeds and speech gaps.
In my JSGF file I have an entery as such: page ( up* | down*)+
I find that if I speak at a normal speed something like 'page up up down' I tend to get recognised PAGE UP UP UP UP DOWN but yesterday I decided to try it and talk a little faster and this DID manage to get PAGE UP UP DOWN!
Is there any threshold settings to allow for a more normalized speed of speech?
Thanks
Russ
Last edit: Nickolay V. Shmyrev 2016-11-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
I have been playing with pockesphinx on a intel based PC with a Linux Ubuntu installation
and I have been getting some good results (90%) with our tailored language model. I am using
a set of WAV files as test data.
I have come to trying to get this to work on a NXP8308 based PPC card, with Linux OS, and
this is where I am having issues. Pocketsphinx builds and runs without any reported errors but
I seem to have lost the recognition! If I leave the default endian to big, then all I get is empty recognition
but setting the '-input_endian little' gives me some recognition but down at the 5% mark on the same
input data.
My questions are, has anyone been using PPC for running this? If so are there any compiler, pocketshinx
settings that I should eb looking at? Could this be a timing issue as the processing core on the NXP8308 is
a lot slower thatn the PC based system?
Thanks
Russ
No, you are the first.
There are no specific settings.
In order to debug this issue, you need to pinpoint it to a specific file and try to ensure all scores and intermediate values in decoding process are the same. You can compare acoustic scores and language model scores for example.
You can try to reproduce this problem in qemu, that would help us to debug this issue.
Unlikely.
Thanks for the reply. These are the 2 runs that I did, everything up until the batch.c - decoding is the same. The audio test file is a WAV that just contains the speech 'UNDO'.
The configuration for both runs were:
Current configuration:
[NAME] [DEFLT] [VALUE]
-agc none none
-agcthresh 2.0 2.000000e+00
-allphone
-allphone_ci no no
-alpha 0.97 9.700000e-01
-ascale 20.0 2.000000e+01
-aw 1 1
-backtrace no no
-beam 1e-48 1.000000e-48
-bestpath yes yes
-bestpathlw 9.5 9.500000e+00
-ceplen 13 13
-cmn current current
-cmninit 8.0 40,3,-1
-compallsen no no
-debug 0
-dict /en-dvi/dvi-en-us.dict
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-fdict
-feat 1s_c_d_dd 1s_c_d_dd
-featparams
-fillprob 1e-8 1.000000e-08
-frate 100 100
-fsg
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-64
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+00
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-29
-fwdtree yes yes
-hmm /en-dvi/en-us
-input_endian big little
-jsgf
-keyphrase
-kws
-kws_delay 10 10
-kws_plp 1e-1 1.000000e-01
-kws_threshold 1 1.000000e+00
-latsize 5000 5000
-lda
-ldadim 0 0
-lifter 0 22
-lm /en-dvi/dvi.lm
-lmctl
-lmname
-logbase 1.0001 1.000100e+00
-logfn
-logspec no no
-lowerf 133.33334 1.300000e+02
-lpbeam 1e-40 1.000000e-40
-lponlybeam 7e-29 7.000000e-29
-lw 6.5 6.500000e+00
-maxhmmpf 30000 30000
-maxwpf -1 -1
-mdef
-mean
-mfclogdir
-min_endfr 0 0
-mixw
-mixwfloor 0.0000001 1.000000e-07
-mllr
-mmap yes yes
-ncep 13 13
-nfft 512 512
-nfilt 40 25
-nwpen 1.0 1.000000e+00
-pbeam 1e-48 1.000000e-48
-pip 1.0 1.000000e+00
-pl_beam 1e-10 1.000000e-10
-pl_pbeam 1e-10 1.000000e-10
-pl_pip 1.0 1.000000e+00
-pl_weight 3.0 3.000000e+00
-pl_window 5 5
-rawlogdir
-remove_dc no no
-remove_noise yes yes
-remove_silence yes yes
-round_filters yes yes
-samprate 16000 1.600000e+04
-seed -1 -1
-sendump
-senlogdir
-senmgau
-silprob 0.005 5.000000e-03
-smoothspec no no
-svspec 0-12/13-25/26-38
-tmat
-tmatfloor 0.0001 1.000000e-04
-topn 4 4
-topn_beam 0 0
-toprule
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 6.800000e+03
-uw 1.0 1.000000e+00
-vad_postspeech 50 50
-vad_prespeech 20 20
-vad_startspeech 10 10
-vad_threshold 2.0 2.000000e+00
-var
-varfloor 0.0001 1.000000e-04
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 7.000000e-29
-wip 0.65 6.500000e-01
-wlen 0.025625 2.562500e-02
Working on PC based Linux....
INFO: batch.c(729): Decoding 'dvi_0001'
INFO: cmn.c(183): CMN: 46.62 11.12 -15.49 30.30 -14.97 -22.92 -1.66 -10.51 -12.51 -1.91 3.24 6.40 -10.64
INFO: ngram_search_fwdtree.c(1553): 424 words recognized (4/fr)
INFO: ngram_search_fwdtree.c(1555): 48962 senones evaluated (466/fr)
INFO: ngram_search_fwdtree.c(1559): 21922 channels searched (208/fr), 8100 1st, 3851 last
INFO: ngram_search_fwdtree.c(1562): 532 words for which last channels evaluated (5/fr)
INFO: ngram_search_fwdtree.c(1564): 506 candidate words for entering last phone (4/fr)
INFO: ngram_search_fwdtree.c(1567): fwdtree 0.07 CPU 0.065 xRT
INFO: ngram_search_fwdtree.c(1570): fwdtree 0.19 wall 0.177 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 7 words
INFO: ngram_search_fwdflat.c(948): 466 words recognized (4/fr)
INFO: ngram_search_fwdflat.c(950): 11888 senones evaluated (113/fr)
INFO: ngram_search_fwdflat.c(952): 7149 channels searched (68/fr)
INFO: ngram_search_fwdflat.c(954): 847 words searched (8/fr)
INFO: ngram_search_fwdflat.c(957): 288 word transitions (2/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.03 CPU 0.027 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.03 wall 0.024 xRT
INFO: ngram_search.c(1253): lattice start node
.0 end node.71INFO: ngram_search.c(1279): Eliminated 2 nodes before end node
INFO: ngram_search.c(1384): Lattice has 141 nodes, 106 links
INFO: ps_lattice.c(1380): Bestpath score: -2515
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(:71:103) = -195297
INFO: ps_lattice.c(1441): Joint P(O,S) = -205515 P(S|O) = -10218
INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(878): bestpath 0.00 wall 0.000 xRT
INFO: batch.c(761): dvi_0001: 1.04 seconds speech, 0.10 seconds CPU, 0.21 seconds wall
INFO: batch.c(763): dvi_0001: 0.09 xRT (CPU), 0.20 xRT (elapsed)
undo (dvi_0001 -2770)
dvi_0001 done --------------------------------------
INFO: batch.c(778): TOTAL 1.04 seconds speech, 0.10 seconds CPU, 0.21 seconds wall
INFO: batch.c(780): AVERAGE 0.09 xRT (CPU), 0.20 xRT (elapsed)
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 0.07 CPU 0.065 xRT
INFO: ngram_search_fwdtree.c(435): TOTAL fwdtree 0.19 wall 0.178 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.03 CPU 0.027 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.03 wall 0.025 xRT
INFO: ngram_search.c(303): TOTAL bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(306): TOTAL bestpath 0.00 wall 0.000 xRT
Non-working on 8308 based Linux with little endian set and the same WAV file.
INFO: batch.c(729): Decoding 'dvi_0001'
INFO: cmn.c(183): CMN: 46.62 11.11 -15.49 30.28 -14.96 -22.94 -1.67 -10.52 -12.5
2 -1.93 3.22 6.39 -10.65
INFO: ngram_search_fwdtree.c(1553): 399 words recognized (4/fr)
INFO: ngram_search_fwdtree.c(1555): 23001 senones evaluated (219/fr)
INFO: ngram_search_fwdtree.c(1559): 11418 channels searched (108/fr), 2527 1s
t, 4919 last
INFO: ngram_search_fwdtree.c(1562): 448 words for which last channels evalu
ated (4/fr)
INFO: ngram_search_fwdtree.c(1564): 319 candidate words for entering last p
hone (3/fr)
INFO: ngram_search_fwdtree.c(1567): fwdtree 1.97 CPU 1.876 xRT
INFO: ngram_search_fwdtree.c(1570): fwdtree 1.97 wall 1.876 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 8 words
INFO: ngram_search_fwdflat.c(948): 592 words recognized (6/fr)
INFO: ngram_search_fwdflat.c(950): 15743 senones evaluated (150/fr)
INFO: ngram_search_fwdflat.c(952): 11665 channels searched (111/fr)
INFO: ngram_search_fwdflat.c(954): 1113 words searched (10/fr)
INFO: ngram_search_fwdflat.c(957): 326 word transitions (3/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.98 CPU 0.930 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.98 wall 0.930 xRT
INFO: ngram_search.c(1253): lattice start node
.0 end node.71INFO: ngram_search.c(1279): Eliminated 2 nodes before end node
INFO: ngram_search.c(1384): Lattice has 115 nodes, 129 links
INFO: ps_lattice.c(1380): Bestpath score: -5280
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(:71:103) = -350950
INFO: ps_lattice.c(1441): Joint P(O,S) = -427149 P(S|O) = -76199
INFO: ngram_search.c(875): bestpath 0.01 CPU 0.010 xRT
INFO: ngram_search.c(878): bestpath 0.01 wall 0.009 xRT
INFO: batch.c(761): dvi_0001: 1.04 seconds speech, 2.96 seconds CPU, 2.96 s
econds wall
INFO: batch.c(763): dvi_0001: 2.84 xRT (CPU), 2.84 xRT (elapsed)
one down (dvi_0001 -4823)
dvi_0001 done --------------------------------------
INFO: batch.c(778): TOTAL 1.04 seconds speech, 2.96 seconds CPU, 2.96 seconds wa
ll
INFO: batch.c(780): AVERAGE 2.84 xRT (CPU), 2.84 xRT (elapsed)
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 1.97 CPU 1.894 xRT
INFO: ngram_search_fwdtree.c(435): TOTAL fwdtree 1.97 wall 1.894 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.98 CPU 0.938 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.98 wall 0.938 xRT
INFO: ngram_search.c(303): TOTAL bestpath 0.01 CPU 0.010 xRT
INFO: ngram_search.c(306): TOTAL bestpath 0.01 wall 0.009 xRT
Thanx
Russ
Last edit: Russ Pitman 2016-10-05
Ok, so the scores are different. I suspect something wrong with the language model scores since we didn't test this part. Could you test with pocketsphinx-0.8 and sphinxbase-0.8 if that works?
Also, do you compile on the device or cross-compile? Did you run the tests in sphinxbase?
Ok, I was able to reproduce this thing in qemu, our new lm code is does not work in big endian. sphinxbase ngram tests fail. So just another issue to fix.
Hi there,
We have not tried 0.8 yet, would this issue be in that version also?
Thanks
Russ
0.8 should be fine.
I'll try to fix this issue in coming days.
Fantastic! If you get it fixed, please let me know and I will give it a go :-)
Last edit: Russ Pitman 2016-10-06
Hi Nickolay,
Have you had any chance to look at this issue?
regards
Russ
Sorry, I didn't have time to look yet, it will take few more days.
HI there,
I have made some progress on this today.
I constructed a grammar file and this seems to work ok on the target hardware, no language model to worry about for the moment. I am going to start updating my target application and see what sort of timings I have. I am assuming that parts of the pocketsphinx API can be called directly, to save time, so that I dont have to keep calling pocketshinx_batch?
Russ
Last edit: Russ Pitman 2016-10-13
Ok, let me know how it works. If it is too slow, you'd probably try semi-continuous model instead.
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English/cmusphinx-en-us-semi-5.1.tar.gz/download
Overall, I'm for more powerful hardware for speech-related tasks. Also, it's probably better for you to consider keyword spotting instead of a grammar. It gives more natural experience to the user.
Thanks, will have a look. What does the semi continuous model do?
I agree that more powerful hardware would be better, this is board is just for me to get a feel of what is needed.
I will have a look at keyword spotting.
Many thanks
Russ
Last edit: Russ Pitman 2016-10-13
Just an update, Pocketsphinx is running on the 8308 and the semi-continuous model does improve things BUT I ma still looking at ways to improve latency. I am using the grammar file rather than LM as this seems faster. I only have 112 words in the dictionary and a limited grammar as this is for controlling a multifunction display but as the user has only certain phrases he is allowed to use I feel that the grammar file would be sufficicnet.
Russ
Hi Russ
There are various config options you can use to make it faster and keep accuracy - beams (-beam, -wbeam, -pbeam), downsampling (-ds), topn gaussian (-topn), phoneme loop (-pl_window). You can check
http://cmusphinx.sourceforge.net/wiki/pocketsphinxhandhelds
You only need to test accuracy on a test data while chaning the parameters.
HI Nickolay,
Thanks for the info, I will go and have a look :-)
I have another question about word detection speeds and speech gaps.
In my JSGF file I have an entery as such:
page ( up* | down*)+
I find that if I speak at a normal speed something like 'page up up down' I tend to get recognised
PAGE UP UP UP UP DOWN
but yesterday I decided to try it and talk a little faster and this DID manage to getPAGE UP UP DOWN!
Is there any threshold settings to allow for a more normalized speed of speech?
Thanks
Russ
Last edit: Nickolay V. Shmyrev 2016-11-16
You can simplify that to more simple
page (up | down)+
There are many parameters, for example, word insertion penalty
-wip
, but to optimize them you need to prepare a test set as described in http://cmusphinx.sourceforge.net/wiki/tutorialtuning