My application needs to extract phones from an audio recording, for which I'm using pocketsphinx. I've noticed that the accuracy of the detected phones varies a lot between recordings, but I can't figure out why.
I've followed the tips regarding Why my accuracy is poor. In particular, all files are raw 16kHz 16bit mono.
The following samples were processed with a custom build of pocketsphinx from version 5prealpha (2015-08-05), using the exact command line given in the article Phoneme Recognition (caveat emptor).
The first recording has a woman saying, "Smearing fruit on this art wouldn't make it better art; at least not to me." Raw file -- WAVE file
Output is SIL S N IH R IH NG F EY ER D AA N IH S AY R EH G D W ER Y UH M EY K IH P EH T AE R EH R D SIL IH D M IY S N EH D T IH M IY T
This isn't perfect, but quite close.
The second recording has a man saying, "Oh, well -- huh -- I find it does some interesting things with the space between viewer and image." Raw file -- WAVE file
Output is SIL OW M AA K UW HH L OW HH UH T D AH HH AW B AH D V AY D IH D T AH S AH M UH D B IH N T R IH S T IH NG S T IH NG Z G UH P AH S S P EY IY S P IY CH W IY N UW ZH IH Y UW ER IH D M IY T IH M IH UW D SH.
The first few phones are alright, but the rest has almost nothing to do with the actual dialog.
I don't understand why the first file gives good results, but the second one doesn't. Both files have identical technical properties, and both people are speaking clearly. To me, the second speaker even sounds clearer than the first one.
Is there anything I can do to improve the accuracy of the results?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't know much about speech recognition. It would be great if someone who does could tell me whether any of the following ideas make sense.
My command line looks like this: pocketsphinx_continuous.exe -infile <file> -hmm <directory> -allphone <phonetic lm> -backtrace yes -beam 1e-20 -pbeam 1e-20 -lw 2.0. Are there any settings I could tweak to get better results?
I'm currently using the model that comes with pocketsphinx 5prealpha (2015-08-05). Does it make sense to look for better models, or are they only for word (not phoneme) recognition?
Beside pocketsphinx, there are also Sphinx3 and Sphix4. I haven't looked at those yet. Do either of them offer better accuracy at phoneme recognition?
If I perform word recognition instead of phoneme recognition, the results are quite good. Using the pronounciation dictionary, I could then convert the recognized words back to phonemes. The problem is, I need timestamps for each phoneme. Is there any way to get timestamps for the individual phonemes within a recognized word? I think this isn't possible in pocketsphinx, but maybe it's possible in Sphinx3 or Sphix4?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your reply. I'm writing a tool to automatically create lip-sync animation for characters in games. My tool takes a voice recording, extracts the phonemes with their timestamps, then uses a set of rules to determine mouth positions based on these phonemes. This process happens during production of the game, so it doesn't have to work on a live stream.
Because I'm using pocketsphinx's output for animation, I need a precise timestamp for each individual phoneme.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm writing the tool for someone who needs a simple solution that works only with an audio file without transcription. So I need a solution that works without the text.
But once I have that working, I might add support for a second mode that takes text, too. How would that make things simpler?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First, I run a regular word recognition on the sound file. That will give me the text.
Then I give both the text and the original sound file to ps_alignment. That will align the text with the sound file, giving me timestamps not just for every word, but for each individual phoneme.
Is that correct? If so, that would be really cool.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It seems to me that test_state_alignment.c demonstrates just what I need. There are two points I'm not yet sure about:
The program adds the dummy words <s> and </s> to the alignment. I understand from the code that these mean start/end of sentence. What purpose do they serve and where should I use them?
The function do_search that does the actual alignment is called 5 times before evaluating the results. Does repeating the search improve the results or is that just for testing?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The program adds the dummy words and to the alignment. I understand from the code that these mean start/end of sentence. What purpose do they serve and where should I use them?
They mark start and the end of the utterance
The function do_search that does the actual alignment is called 5 times before evaluating the results. Does repeating the search improve the results or is that just for testing?
It is just for testing
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ps_alignment_add_word takes the word's duration as third argument. The test code simply passes 0, which seems to work fine. Does it make sense to pass an actual (non-zero) value, or will that value be ignored? In other words, does passing an actual value somehow speed up the alignment process or make it more precise?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The alignment works fine for speech without long pauses. However, I just tested it with a recording where the speaker takes long pauses between sentences, each up to 1.2s. The timing within each sentence is correct, but the alignment algorithm seems to cut the pauses short. That means that after each sentence, the reported times are more and more wrong.
My file is 58.7s long, but the last phone alignment reported by ps_alignment_iter_get ends at 55.1s. So by shortening each pause, the alignment algorithm ended up 3.6s short!
I've managed to fix the timing by setting -vad_prespeech and -vad_postspeech to 3000 each. However, I don't understand why these settings influence the timing returned by ps_alignment_iter_get. Is that intended?
Do high values for -vad_prespeech and -vad_postspeech bring any disadvantages, regarding performance or anything else?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I understand now how your code calls ps_start_utt and ps_end_utt depending on the result of ps_get_in_speech. However, when I'm trying to do the same thing for alignment rather than recognition, I can't get it to work. Whenever I iterate over the alignment, all entries have start=0 and duration=0. It works fine when I'm calling acmod_start_utt and acmod_end_utt only once at the start and end, followed by ps_alignment_phones.
Here's a shortened version of my code:
ps_search_start(search.get());
acmod_start_utt(acousticModel);
bool utteranceStarted = false;
// Process entire sound file
while (true) {
// <Fill buffer or break (omitted)...>
const int16* nextSample = buffer.data();
size_t remainingSamples = buffer.size();
while (acmod_process_raw(acousticModel, &nextSample, &remainingSamples, false) > 0) {
while (acousticModel->n_feat_frame > 0) {
ps_search_step(search.get(), acousticModel->output_frame);
acmod_advance(acousticModel);
}
}
bool inSpeech = ps_get_in_speech(&recognizer);
if (inSpeech && !utteranceStarted) {
utteranceStarted = true;
}
if (!inSpeech && utteranceStarted) {
acmod_end_utt(acousticModel);
// Extract phones with timestamps
for (ps_alignment_iter_t* it = ps_alignment_phones(alignment.get()); it; it = ps_alignment_iter_next(it)) {
// Get timing
ps_alignment_entry_t* phoneEntry = ps_alignment_iter_get(it);
int startFrame = phoneEntry->start;
int duration = phoneEntry->duration;
// <Store values (omitted)...>
}
acmod_start_utt(acousticModel);
utteranceStarted = false;
}
}
if (utteranceStarted) {
// <Same ps_alignment_phones code as above (omitted)...>
}
ps_search_finish(search.get());
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm sorry, from your code it is not clean if you update alignment sequence for every chunk or you use one big alignment. You need to recognize first to get word sequence, then use this word sequence to construct alignment, then get phone sequence. If alignment word sequence do not match the audio you will not get proper phone times of course.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First I'm doing word recognition. That gives me a list of words for the entire recording.
Them I'm creating a ps_alignment_t with those words.
Them I'm aligning the words to the original recording.
Everything is working fine, with one exception: If the recording is longer than about 5 minutes, I get a malloc error during alignment. I think that is because I'm aligning the entire recording as one single utterance.
You showed me how to break word recognition into utterances. But now I'm trying to break alignment into utterances, too. I have a long recording and an alignment structure with the words fore the entire recording. Is there a way to do continuous alignment on them? The code above shows my attempt, but it doesn't work (see above).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Long audio alignment is supported in sphinx4, it is not suppored in pocketsphinx. If you want to build long audio alignment with pocketsphinx you need to recognize/align each utterance first. Usually it is done with a grammar constructed from text or with biased language model. There is a whole research about the subject cited on our wiki.
5 minute audio requires long audio alignment, it is not a trivial algorithm.
Once you aligned utterances and you know the text for each utterance you can align to phones.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My application needs to extract phones from an audio recording, for which I'm using pocketsphinx. I've noticed that the accuracy of the detected phones varies a lot between recordings, but I can't figure out why.
I've followed the tips regarding Why my accuracy is poor. In particular, all files are raw 16kHz 16bit mono.
The following samples were processed with a custom build of pocketsphinx from version 5prealpha (2015-08-05), using the exact command line given in the article Phoneme Recognition (caveat emptor).
The first recording has a woman saying, "Smearing fruit on this art wouldn't make it better art; at least not to me."
Raw file -- WAVE file
Output is
SIL S N IH R IH NG F EY ER D AA N IH S AY R EH G D W ER Y UH M EY K IH P EH T AE R EH R D SIL IH D M IY S N EH D T IH M IY T
This isn't perfect, but quite close.
The second recording has a man saying, "Oh, well -- huh -- I find it does some interesting things with the space between viewer and image."
Raw file -- WAVE file
Output is
SIL OW M AA K UW HH L OW HH UH T D AH HH AW B AH D V AY D IH D T AH S AH M UH D B IH N T R IH S T IH NG S T IH NG Z G UH P AH S S P EY IY S P IY CH W IY N UW ZH IH Y UW ER IH D M IY T IH M IH UW D SH
.The first few phones are alright, but the rest has almost nothing to do with the actual dialog.
I don't understand why the first file gives good results, but the second one doesn't. Both files have identical technical properties, and both people are speaking clearly. To me, the second speaker even sounds clearer than the first one.
Is there anything I can do to improve the accuracy of the results?
I don't know much about speech recognition. It would be great if someone who does could tell me whether any of the following ideas make sense.
pocketsphinx_continuous.exe -infile <file> -hmm <directory> -allphone <phonetic lm> -backtrace yes -beam 1e-20 -pbeam 1e-20 -lw 2.0
. Are there any settings I could tweak to get better results?Dear Daniel
Its not like you can select decoder to access phoneme times arbitrarily.
You probably could describe the application you want to implement in order to get better answer on how to make it.
Hi Nickolay,
Thanks for your reply. I'm writing a tool to automatically create lip-sync animation for characters in games. My tool takes a voice recording, extracts the phonemes with their timestamps, then uses a set of rules to determine mouth positions based on these phonemes. This process happens during production of the game, so it doesn't have to work on a live stream.
Because I'm using pocketsphinx's output for animation, I need a precise timestamp for each individual phoneme.
So do you know the text beforehand or not?
I'm writing the tool for someone who needs a simple solution that works only with an audio file without transcription. So I need a solution that works without the text.
But once I have that working, I might add support for a second mode that takes text, too. How would that make things simpler?
Ok, if you don't have the text you can recognize first.
Once you recognized you'd better use ps_alignment API which is available in pocketsphinx. You can find the demo here:
https://github.com/cmusphinx/pocketsphinx/blob/master/test/unit/test_alignment.c
The header is not installed with pocketsphinx yet unfortunately, you have to copy it into your project.
Okay, so let me make sure I understand you.
Is that correct? If so, that would be really cool.
It seems to me that
test_state_alignment.c
demonstrates just what I need. There are two points I'm not yet sure about:do_search
that does the actual alignment is called 5 times before evaluating the results. Does repeating the search improve the results or is that just for testing?They mark start and the end of the utterance
It is just for testing
ps_alignment_add_word
takes the word's duration as third argument. The test code simply passes 0, which seems to work fine. Does it make sense to pass an actual (non-zero) value, or will that value be ignored? In other words, does passing an actual value somehow speed up the alignment process or make it more precise?No
Hi Nickolay,
Thanks for your answer. I'm not sure what you mean by no: No, it doesn't make sense to pass a value, or no, the value won't be ignored?
Could you tell me what the value is used for, given that the alignment algorithm will determine the durations anyway?
It doesn't make sense to pass a value
Thanks Nickolay! I'm now doing word recognition followed by alignment, and the results are much better than using phoneme recogintion.
The alignment works fine for speech without long pauses. However, I just tested it with a recording where the speaker takes long pauses between sentences, each up to 1.2s. The timing within each sentence is correct, but the alignment algorithm seems to cut the pauses short. That means that after each sentence, the reported times are more and more wrong.
My file is 58.7s long, but the last phone alignment reported by
ps_alignment_iter_get
ends at 55.1s. So by shortening each pause, the alignment algorithm ended up 3.6s short!-vad_prespeech
and-vad_postspeech
to3000
each. However, I don't understand why these settings influence the timing returned byps_alignment_iter_get
. Is that intended?-vad_prespeech
and-vad_postspeech
bring any disadvantages, regarding performance or anything else?Long pauses are harmful for accuracy because they do not allow to estimate channel properties reliably.
You need to use voice activity detection to split audio on chunks and you need to process each chunk separately as in pocketsphinx_continuous.
Sounds good. This may even allow me to process the chunks in separate parallel threads, speeding up the process.
Is there any sample code showing how to perform voice activity detection in order to split the sound into chunks?
In continuous.c:
Last edit: Nickolay V. Shmyrev 2016-01-29
Thanks a lot, Nickolay! When I originally read the code, I didn't realize it was splitting the sound into chunks.
I understand now how your code calls
ps_start_utt
andps_end_utt
depending on the result ofps_get_in_speech
. However, when I'm trying to do the same thing for alignment rather than recognition, I can't get it to work. Whenever I iterate over the alignment, all entries have start=0 and duration=0. It works fine when I'm callingacmod_start_utt
andacmod_end_utt
only once at the start and end, followed byps_alignment_phones
.Here's a shortened version of my code:
I'm sorry, from your code it is not clean if you update alignment sequence for every chunk or you use one big alignment. You need to recognize first to get word sequence, then use this word sequence to construct alignment, then get phone sequence. If alignment word sequence do not match the audio you will not get proper phone times of course.
I'm sorry if I wasn't clear.
ps_alignment_t
with those words.Everything is working fine, with one exception: If the recording is longer than about 5 minutes, I get a
malloc
error during alignment. I think that is because I'm aligning the entire recording as one single utterance.You showed me how to break word recognition into utterances. But now I'm trying to break alignment into utterances, too. I have a long recording and an alignment structure with the words fore the entire recording. Is there a way to do continuous alignment on them? The code above shows my attempt, but it doesn't work (see above).
Long audio alignment is supported in sphinx4, it is not suppored in pocketsphinx. If you want to build long audio alignment with pocketsphinx you need to recognize/align each utterance first. Usually it is done with a grammar constructed from text or with biased language model. There is a whole research about the subject cited on our wiki.
5 minute audio requires long audio alignment, it is not a trivial algorithm.
Once you aligned utterances and you know the text for each utterance you can align to phones.
Thanks for the clarification! I might check out Sphinx4 then. Just two questions about the long audio aligner in Spinx4: