Does it give me timespamps for each phoneme (like pocketsphinx), or only per word?
No, not yet unfortunately. Sphinx4 aligner is still in the development stage though, so it might not be that great. Accurate algorithm would require some more attention.
The pocketsphinx aligner is very fast. Someone wrote that the Sphinx4 aligner takes 30min for 3min of audio. Is that still correct?
No, it should not be that bad. The alignment must be almost realtime.
Hi Daniel! I have the exact same requirement. I see you have successfully implemented the same in Rhubarb Lip-sync. I am trying to accompalish similar thing and was hoping you can give me some pointers. I am hoping to work on a project for GSoC this summer.
I have sent you message on SourceForge regarding the same.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello, Daniel and Nicoley! I found in rhubarb-lip-sync, the VAD part is done by WebRTC. Are there any solutions doing VAD and forced phoneme alignment in a single unified framework with only pocketsphinx? Because pocketsphinx can do VAD, I think this may be possible.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello, Daniel and Nicoley! I found in rhubarb-lip-sync, the VAD part is done by WebRTC. Are there any solutions doing VAD and forced phoneme alignment in a single unified framework with only pocketsphinx? Because pocketsphinx can do VAD, I think this may be possible.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to solve a very similar problem, and I spent hours trying to figure out what the example code in test_state_align.c does (unfortunately there's literally no comments).
I was finally able to figure out how to use this phoneme alignment function, and managed to print out more accurate phonemes (and time stamps). However I am running into the same problem that Daniel is facing now.
If, for example, I try to get the phoneme time stamps for the following sentence: Hey <0.3 sec pause> Move
The phonemes I'm getting are HH, EY, M, UW, V. These are the correct phonemes, but the problem is, the 0.3 sec pause is being added to the duration of the "M" phoneme.
Is there any way I can fix it?
Thanks
Last edit: Ezra Miller 2016-02-11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
By adding "<sil>" between every word I was able to fix long gaps between phonemes.
However, if the recording starts with a silence, this offset gets ignored when looking at phoneme time stamps.
Is there a way to know at which time the first phoneme (in my example, "HH") is starting - relative to the actual file, and not the word? (In other words, if there is a 0.5 sec silence/non-speech in the beginning of the file, I would like to see "HH" start at around 0.5s).
Thanks again!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Pocketsphinx performs voice activity detection and you need to shift results to detected offset if you want to print them. acmod_stream_offset returns that. You can also disable voice activity detection with -remove_silence no, though I don't really recommend doing that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Running with the -remove_silence no flag seems to work pretty well for my use case.
Also, as I mentioned earlier, adding "<sil>" between every every word solved the problem with the long gaps. If there is a short gap between words, then SIL gets a duration of 0.03 seconds (this seems to be the lowest) - in this case I simply ignore it and add its duration to the following phoneme.
Appreciate your help, it works great now!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No, not yet unfortunately. Sphinx4 aligner is still in the development stage though, so it might not be that great. Accurate algorithm would require some more attention.
No, it should not be that bad. The alignment must be almost realtime.
Actually I recommend you to check this project:
https://github.com/lowerquality/gentle
It should output what you need - words and phonemes with stamps:
https://github.com/lowerquality/gentle/blob/master/tests/data/lucier_golden.json
Thanks a lot!
Hi Daniel! I have the exact same requirement. I see you have successfully implemented the same in Rhubarb Lip-sync. I am trying to accompalish similar thing and was hoping you can give me some pointers. I am hoping to work on a project for GSoC this summer.
I have sent you message on SourceForge regarding the same.
Hello, Daniel and Nicoley! I found in rhubarb-lip-sync, the VAD part is done by WebRTC. Are there any solutions doing VAD and forced phoneme alignment in a single unified framework with only pocketsphinx? Because pocketsphinx can do VAD, I think this may be possible.
Hello, Daniel and Nicoley! I found in rhubarb-lip-sync, the VAD part is done by WebRTC. Are there any solutions doing VAD and forced phoneme alignment in a single unified framework with only pocketsphinx? Because pocketsphinx can do VAD, I think this may be possible.
Hi Daniel, Nickolay,
I am trying to solve a very similar problem, and I spent hours trying to figure out what the example code in
test_state_align.c
does (unfortunately there's literally no comments).I was finally able to figure out how to use this phoneme alignment function, and managed to print out more accurate phonemes (and time stamps). However I am running into the same problem that Daniel is facing now.
If, for example, I try to get the phoneme time stamps for the following sentence:
Hey <0.3 sec pause> Move
The phonemes I'm getting are HH, EY, M, UW, V. These are the correct phonemes, but the problem is, the 0.3 sec pause is being added to the duration of the "M" phoneme.
Is there any way I can fix it?
Thanks
Last edit: Ezra Miller 2016-02-11
If you have silence between "hey" and "move", you need to insert <sil> there in alignment word sequence before starting alignment.
Thank you Nickolay,
By adding "<sil>" between every word I was able to fix long gaps between phonemes.
However, if the recording starts with a silence, this offset gets ignored when looking at phoneme time stamps.
Is there a way to know at which time the first phoneme (in my example, "HH") is starting - relative to the actual file, and not the word? (In other words, if there is a 0.5 sec silence/non-speech in the beginning of the file, I would like to see "HH" start at around 0.5s).
Thanks again!
Pocketsphinx performs voice activity detection and you need to shift results to detected offset if you want to print them.
acmod_stream_offset
returns that. You can also disable voice activity detection with-remove_silence no
, though I don't really recommend doing that.Hi Nickolay,
I wanted to thank you again for your advice.
Running with the
-remove_silence no
flag seems to work pretty well for my use case.Also, as I mentioned earlier, adding "<sil>" between every every word solved the problem with the long gaps. If there is a short gap between words, then SIL gets a duration of 0.03 seconds (this seems to be the lowest) - in this case I simply ignore it and add its duration to the following phoneme.
Appreciate your help, it works great now!
for arabic ?