I am using pocketspinx, ps_alignment to align a text/sentence to wave file and get duration of words and words specral score. with the help function ps_alignment_add_word() added the words to ps alignment.
my problem is : for example the text/sentence to align is How are you.
Someone spoken with a long silence between words ARE and YOU . i.e How are .................. you.
on sphinx3 on force-alignment by default a SIL phone inserted between words ARE and YOU . i.e sphinx 3 output after text alignment
How are SIL you
But on pocket sphinx SIL phone is not inserted. so the alignment is incorrect.
1) How to solve this issue ?
2) If we can add a optional silence inbetween words on text alignment.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Pocket sphinx text alignment: , alignment output is
0 2.72 <s>
2.73 2.87 WINTER
2.80 4.25 IS
4.26 6.57 COLD
6.58 7.36 HERE
7.37 8.21 </s>
As you see in the above example , sphinx3 force-alignment insert <sil> when there is a silence and alignement accuracy of words are good.
But in case of pocketsphinx, word alignment are bad, word are aligned with incorrect parts of the wave files as silence phone is not inserted, words are align to silence parts of the wave files.
My question is :
On FSG we can have optional silence, before and after a words. decoder insert SIL only when silence parts present in the audio. also same is avillable on sphinx3 force-alignment
Is same thing , Could be implemented using ps alignment ?
Last edit: Nickolay V. Shmyrev 2016-09-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Alignment is designed if you need phone times. If you need just need word times, you can build an FSG and recognize with FSG grammar, it will insert optional silence.
Last edit: Nickolay V. Shmyrev 2016-09-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I also need the Phone times and phone spectral score in addtion to word times and word specral score.
1) FSG bulid at word level does not give the phone times and phone score.
Is it is good idea to first run the audio with FSG recongizer and get recoginize text with optional silence insertion. then give the recognize text as input to text alignment.
for the example audio , On FSG recognizer , we get SIL WINTER SIL IS SIL COLD SIL HERE SIL .
so, input text to txet alignment (ps_alignment method ) will be WINTER SIL IS SIL COLD SIL HERE
2) Is there are any other alternative procedure that, i can follow to get both word and phone times and spectral score.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I also need the Phone times and phone spectral score in addtion to word times and word specral score.
You should have mention that in the original question.
Is it is good idea to first run the audio with FSG recongizer and get recoginize text with optional silence insertion. then give the recognize text as input to text alignment.
Yes you can do that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Try your FSGs with explicit silence with the '-fsgusefiller no' switch and see if that works, please?
Can you get time alignments and acoustic scores using '-backtrace yes' ? I am just getting back into PocketSphinx after giving up on it in 2010 for reasons that I think have been addressed.
Last edit: James Salsman 2017-01-21
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Everyone,
I am using pocketspinx, ps_alignment to align a text/sentence to wave file and get duration of words and words specral score. with the help function ps_alignment_add_word() added the words to ps alignment.
my problem is : for example the text/sentence to align is How are you.
Someone spoken with a long silence between words ARE and YOU . i.e How are .................. you.
on sphinx3 on force-alignment by default a SIL phone inserted between words ARE and YOU . i.e sphinx 3 output after text alignment
How are SIL you
But on pocket sphinx SIL phone is not inserted. so the alignment is incorrect.
1) How to solve this issue ?
2) If we can add a optional silence inbetween words on text alignment.
You can check word times to insert additional silence tags in decoding result.
HI Nickolay,
Thanks for your reply.
Maybe i was not clear on describing problem statement. below i have given a detail example of the issue.
The sentence/text tried to align is : Winter is cold here
words time interval for the senence on the wave files ( manually mark using wavesurfer)
Sphinx force alignment: , alignment output is
Pocket sphinx text alignment: , alignment output is
As you see in the above example , sphinx3 force-alignment insert <sil> when there is a silence and alignement accuracy of words are good.
But in case of pocketsphinx, word alignment are bad, word are aligned with incorrect parts of the wave files as silence phone is not inserted, words are align to silence parts of the wave files.
My question is :
On FSG we can have optional silence, before and after a words. decoder insert SIL only when silence parts present in the audio. also same is avillable on sphinx3 force-alignment
Is same thing , Could be implemented using ps alignment ?
Last edit: Nickolay V. Shmyrev 2016-09-26
Alignment is designed if you need phone times. If you need just need word times, you can build an FSG and recognize with FSG grammar, it will insert optional silence.
Last edit: Nickolay V. Shmyrev 2016-09-26
Hi Nickolay,
Thanks for the reply.
I also need the Phone times and phone spectral score in addtion to word times and word specral score.
1) FSG bulid at word level does not give the phone times and phone score.
Is it is good idea to first run the audio with FSG recongizer and get recoginize text with optional silence insertion. then give the recognize text as input to text alignment.
for the example audio , On FSG recognizer , we get SIL WINTER SIL IS SIL COLD SIL HERE SIL .
so, input text to txet alignment (ps_alignment method ) will be WINTER SIL IS SIL COLD SIL HERE
2) Is there are any other alternative procedure that, i can follow to get both word and phone times and spectral score.
You should have mention that in the original question.
Yes you can do that.
Hi Nickolay,
Thanks for the reply.
Sorry, i missed to point out that i need phone times also.
I will follow the steps as first fsg and then text alignment.
Thanks.
Kalita, I think this is because by default pocketsphinx adds invisible optional silence to every FSG state. See lines 234-236 and 90-111 of http://cmusphinx.sourceforge.net/doc/pocketsphinx/fsg__search_8c_source.html
Try your FSGs with explicit silence with the '-fsgusefiller no' switch and see if that works, please?
Can you get time alignments and acoustic scores using '-backtrace yes' ? I am just getting back into PocketSphinx after giving up on it in 2010 for reasons that I think have been addressed.
Last edit: James Salsman 2017-01-21