Hello. I use keyword search and get word bounds after it is found. But frames returned by ps_seg_frames are too big compairing to number of frames processed.
I did look at continuous.c, but I still don't see where I am wrong. I also call ps_start_utt and ps_end_utt, and stop processing after first hypothesis. If frame offsets were lower than expected, then I may thought that I have to ignore data for which ps_get_in_speech returns 0 or something like this. But frame offsets are larger than ps_n_frames and larger than (number_of_samples_fed / sample_rate * frame_rate), how can it be?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After using this approach (restarting utterance after each silence) my program stopped recognizing keyword. Every time after silence is detected hypotesis is empty. While it recognizes keyword if data are fed continously until hypotesis found.
test.cc — start_end() restarts utterance after silence and cont() doesn't. synth.wav — source audio
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello. I use keyword search and get word bounds after it is found. But frames returned by
ps_seg_frames
are too big compairing to number of frames processed.I have made small program reproducing problem: https://yadi.sk/i/_JTvOD10ri6dt
It uses language data from https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Russian/zero_ru_cont_8k_v3.tar.gz/download
and this raw sound file: https://yadi.sk/d/EApMmoEari6du
Program output on my machine is:
hyp: железяка
n_frames: 515
samples total: 48128
sample rate: 8000.000000
frame rate: 100
word: железяка, frames: 887-940
I.e. word frames are larger than number of frames processed.
There was no need to start another thread, you could continue in https://sourceforge.net/p/cmusphinx/discussion/help/thread/8d848336/
To calculate the offset properly you need to track and restart utterances with ps_in_speech/ps_end_utt/ps_start_utt, like in continuous.c.
Also, in your code you do not need
it just slows down loading
I did look at continuous.c, but I still don't see where I am wrong. I also call
ps_start_utt
andps_end_utt
, and stop processing after first hypothesis. If frame offsets were lower than expected, then I may thought that I have to ignore data for whichps_get_in_speech
returns 0 or something like this. But frame offsets are larger thanps_n_frames
and larger than (number_of_samples_fed / sample_rate * frame_rate), how can it be?This is how pocketsphinx works, you need to restart utterance on every silence.
Thanks, this seems to resolve my problem. Is this true for every mode, not only keyword search? Where is this documented?
Something strange happens.
After using this approach (restarting utterance after each silence) my program stopped recognizing keyword. Every time after silence is detected hypotesis is empty. While it recognizes keyword if data are fed continously until hypotesis found.
test.cc — start_end() restarts utterance after silence and cont() doesn't.
synth.wav — source audio