I have used Pocketsphinx multiple times now to perform speech recognition tasks on Android devices. Most recently, I tried to detect a single word with a grammar, and afterwards obtain time stamps for each of the segments. However, it turned out that the time stamps do not match the actual times from the file. It always tells me that the word was detected right in the beginning of the file, after 7 frames of SILence.
This is my jsgf:
This is the output for any file where I say the German word "Kompliment":
11-15 12:15:08.680 5858-6719/de.---.app I/ASR: Phoneme: sil Start: 0 End: 7
11-15 12:15:08.686 5858-6719/de.---.app I/ASR: Phoneme: k Start: 8 End: 16
11-15 12:15:08.686 5858-6719/de.---.app I/ASR: Phoneme: oo Start: 17 End: 25
11-15 12:15:08.686 5858-6719/de.---.app I/ASR: Phoneme: m Start: 26 End: 35
11-15 12:15:08.686 5858-6719/de.---.app I/ASR: Phoneme: p Start: 36 End: 42
11-15 12:15:08.686 5858-6719/de.---.app I/ASR: Phoneme: y Start: 43 End: 49
11-15 12:15:08.687 5858-6719/de.---.app I/ASR: Phoneme: iy Start: 50 End: 56
11-15 12:15:08.687 5858-6719/de.---.app I/ASR: Phoneme: m Start: 57 End: 67
11-15 12:15:08.687 5858-6719/de.---.app I/ASR: Phoneme: ehh Start: 68 End: 80
11-15 12:15:08.687 5858-6719/de.---.app I/ASR: Phoneme: n Start: 81 End: 100
11-15 12:15:08.687 5858-6719/de.---.app I/ASR: Phoneme: t Start: 101 End: 113
11-15 12:15:08.687 5858-6719/de.---.app I/ASR: Phoneme: sil Start: 114 End: 229
11-15 12:15:08.689 5858-6719/de.---.app I/ASR: RESULT: sil k oo m p y iy m ehh n t sil
11-15 12:15:08.689 5858-6719/de.---.app I/ASR: SCORE: -3536.0
11-15 12:15:08.689 5858-6719/de.---.app I/ASR: FILE DURATION: 4.608
The last line shows the duration of the recorded file in seconds. One can clearly see that there is a mismatch between frame numbers and total time.
Audio is RIFF-WAVE, 16 kHz, 16 bit/mono
Can you please explain why this happens and how I can get correct frame numbers?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello all,
I have used Pocketsphinx multiple times now to perform speech recognition tasks on Android devices. Most recently, I tried to detect a single word with a grammar, and afterwards obtain time stamps for each of the segments. However, it turned out that the time stamps do not match the actual times from the file. It always tells me that the word was detected right in the beginning of the file, after 7 frames of SILence.
This is my jsgf:
This is the output for any file where I say the German word "Kompliment":
The last line shows the duration of the recorded file in seconds. One can clearly see that there is a mismatch between frame numbers and total time.
Audio is RIFF-WAVE, 16 kHz, 16 bit/mono
Can you please explain why this happens and how I can get correct frame numbers?