# We tell PocketSphinx that the user is finished saying what they wanted
# to say, and that it should makes it's best guess as to what thay was.
self.decoder.end_utt()
# The following will get a hypothesis object with, amongst other things,
# the string of words that PocketSphinx thinks the user said.
self.hypothesis = self.decoder.hyp()
if self.hypothesis is not None:
bestGuess = self.hypothesis.hypstr
print 'I just heard you say:"{}"'.format(bestGuess)
# We are done with the microphone for now so we'll close the stream.
self.stream.stop_stream()
self.stream.close()
# We have what we came for! A string representing what the user said.
# We'll now return it to the runMain function so that it can be
# processed and some meaning can be gleamed from it.
return bestGuess
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
3 stages (fwdtree, fwdflat and bestpath) took 0.88 + 0.47 + 0.01 or 1.36 seconds CPU time. The speed of decoding was 0.356 + 0.190 + 0.004 or 0.55xRT which means 1 second of speech was decode in 0.55 seconds of CPU time. The total length of the audio is 1.36 / 0.55 or 2.47 seconds, but it is probably irrelevant. xRT number matters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
total-wall-time is the sum of times from start of speech to end of speech, including inter-word silences?
Is there a term for total-cpu-time / total-wall-time?
result_14.txt
::::::::::::::
Utterances=14
CpuTime=56.42 seconds
CPU xRealTime=0.926 or 92.6% of one core
Actual Speech=60.9287 seconds
Utterances=92.89 seconds total
66% of utterances were speech
::::::::::::::
result_63.txt
::::::::::::::
Utterances=63
CpuTime=75.7 seconds
CPU xRealTime=0.52 or 52% of one core
Actual Speech=145.577 seconds
Utterances=189.77 seconds total
77% of utterances were speech
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
total-wall-time is the sum of times from start of speech to end of speech, including inter-word silences?
Well, it is not exactly with silence included. Silence is always filtered out in processing and is not accounted in performance computation.
It's more about what system time did it took to process speech. When you process from microphone, yes, it waits for input and the time it simply waits is included. When you process from file it is just the time taken to process speech. This time also accounts for machine doing something else, for example, if you are doing some other computation it will be included into wall time.
Is there a term for total-cpu-time / total-wall-time?
Not really
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually it would be very interesting to evaluate keyword spotting mode too which supposed to be a primary operation mode for IOT. It might be also interesting to play with decoding parameters for LVCSR, it might be reasonably fast and accurate after tuning.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am running pocketsphinx on Raspberry Pi 3 / Raspbian Jessie-lite and want to compare my recognition performance on a Raspberry Pi B+.
I have set -verbose yes, and -logfn to psphinx.log, and I am getting lots of output but in "grammar mode" nothing looks like what I get in "lm mode":
Q1) Is it possible to get timing for grammar mode?
Q2) for "lm mode" how do I interpret how long was the utterance, and how long did the reco take?
BTW: This is my reco listener loop: (from Neil Davenport https://github.com/bynds/makevoicedemo )
Grammar mode must print similar numbers if you use latest pocketsphinx:
From this lines
3 stages (fwdtree, fwdflat and bestpath) took 0.88 + 0.47 + 0.01 or 1.36 seconds CPU time. The speed of decoding was 0.356 + 0.190 + 0.004 or 0.55xRT which means 1 second of speech was decode in 0.55 seconds of CPU time. The total length of the audio is 1.36 / 0.55 or 2.47 seconds, but it is probably irrelevant. xRT number matters.
Superb - Thank You! Exactly what I needed.
xRT is a measure only "in-speech", correct?
total-wall-time is the sum of times from start of speech to end of speech, including inter-word silences?
Is there a term for total-cpu-time / total-wall-time?
result_14.txt
::::::::::::::
Utterances=14
CpuTime=56.42 seconds
CPU xRealTime=0.926 or 92.6% of one core
Actual Speech=60.9287 seconds
Utterances=92.89 seconds total
66% of utterances were speech
::::::::::::::
result_63.txt
::::::::::::::
Utterances=63
CpuTime=75.7 seconds
CPU xRealTime=0.52 or 52% of one core
Actual Speech=145.577 seconds
Utterances=189.77 seconds total
77% of utterances were speech
Yes
Well, it is not exactly with silence included. Silence is always filtered out in processing and is not accounted in performance computation.
It's more about what system time did it took to process speech. When you process from microphone, yes, it waits for input and the time it simply waits is included. When you process from file it is just the time taken to process speech. This time also accounts for machine doing something else, for example, if you are doing some other computation it will be included into wall time.
Not really
Nickolay - you are credited in Alan's Pi 3 Road Test Using CMU PocketSphinx: https://goo.gl/RrGgCm
and the video: https://vimeo.com/169445418
Thanks for your help.
Hey, Alan, many thanks, this is a very important publication. I shared on our blog too
http://cmusphinx.sourceforge.net/2016/06/should-you-select-raspberry-pi-3-or-raspberry-pi-b-for-cmusphinx/
Actually it would be very interesting to evaluate keyword spotting mode too which supposed to be a primary operation mode for IOT. It might be also interesting to play with decoding parameters for LVCSR, it might be reasonably fast and accurate after tuning.