I am working on an experiment project that I want to recognize the singing with Do Re Mi Fa So La Ti...
What's the best approach of this?
I have tried below steps, but the recogniztion performance is poor. What I did is based on the example of DialogDemo in sphinx4.
Change or add the words in cmudict-en-us.dict for the words of "Do, Re, Mi...". I modified the pronounciation to reflect the closet pronounciation. For example, for Do, I put 'D OW' instead of 'D UW'.
Change digits.grxml to I use DO, Re, Mi, etc to replace the digits.
After this change basically can recognize some of my singing however the error rate is really high.
What can I do to improve?
Shall I try to adapt the existing en-us acoustic model, or shall I create a new acoustic model from scratch?
Need to mention that even recognizing the digits also gave me a poor recognition rate. I guess my microphone may be not good enough?
.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Regarding bad accuracy on digits, this could be due to a bad mic or a strong accent. You should provide audio examples that did not work well to understand what happens.
For singing, you should definitely train a new model. Even better to include additional features, such as pitch and chroma features.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes by changing the microphone and trying to speak more clearly I can get acceptable success rate for digits. For syllables of singing I can get higher rate of result also. I am still using the standard Us-en modle.
Now I start to work on the pocketsphinx as the mobile platform is my ultimate target.
My further questions:
Grammar vs Keyword List?
Shall I use grammar file (just with the words of Do, Re, ...) or use keyword list? I am now using the grammer file, but occationally I got extra recognized text output as "La" even when I make some noise.
My grammar file (syllable.gram):
grammarsyllables;
<syllable> = do |
re |
mi |
fa |
so |
la |
ti ;
public<syllables> = <syllable>+;
Then I run: ./pocketsphinx_continuous.exe -inmic yes -jsgf ./syllables.gram -hmm ../../../model/en-us/en-us -dict ./syllables.dict
And I got a not-bad result when I sing 'do re mi fa so ... so fa mi re do'. However I just got extra words like double 're', 'do', 'mi', etc.
I read that when using grammar, it will try to match any sound to the context of grammar. I think I may need to swtich to keyword list search, as I can specify the threshold for the keywords to block out "out-of-grammar" sounds.
So I add the keyword file as syllables.kws:
do /1e-10/
re /1e-10/
mi /1e-10/
fa /1e-10/
so /1e-10/
la /1e-10/
ti /1e-10/
Then if I run below: ./pocketsphinx_continuous.exe -inmic yes -kws ./syllables.kws -hmm ../../../model/en-us/en-us -dict ./syllables.dict
If I set threshold to 1e-45, it output a lot of words even though I am silent or just sing a sound. In the file I shown I tried 1e-10, but the recognition is worse than using grammar.
How to get the timestamp of each recognized text
No matter what, I may need to combine the information of the recoginized words with the pitch detection and correlate them. I need to know the time when each word was spoken. How can I get that in pocketsphinx programming?
BTW, you are right, I am going to detect the pitch as well. I am using another library for that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
However I just got extra words like double 're', 'do', 'mi', etc.
you can try tuning word insertion penalty (-wip option). But in genral without garbage loop it is hard to block noisy words. Consider push-to-talk approach in real application
If I set threshold to 1e-45, it output a lot of words even though I am silent or just sing a sound. In the file I shown I tried 1e-10, but the recognition is worse than using grammar
You can spend more time tuning thresholds. Also, for different words different thresholds can be optimal. I'd suggest to record a test set and try lots of different parameters (aka grid search)
How to get the timestamp of each recognized text
use "-time yes" option
Anyway, I do not think that with en-US model you will achieve what you expect
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Could you please also advice how to get the timestamp of each recognized syllable?
I tried passing in '-time yes' option, by invoking either of below:
recognizer = defaultSetup()
.setBoolean("-time", true)
//Or, .setString("-time", "yes")
But not sure I pass it right, and if right, how do I get the timestamps.
If I use the standalone recognizer (pocketsphinx_continuous -infile ... -time yes, I do get the timestamp. But I need to get it in the context of Pocket Sphinx in Android via the wrapper class. Any easy way to get that?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am working on an experiment project that I want to recognize the singing with Do Re Mi Fa So La Ti...
What's the best approach of this?
I have tried below steps, but the recogniztion performance is poor. What I did is based on the example of DialogDemo in sphinx4.
Change or add the words in cmudict-en-us.dict for the words of "Do, Re, Mi...". I modified the pronounciation to reflect the closet pronounciation. For example, for Do, I put 'D OW' instead of 'D UW'.
Change digits.grxml to I use DO, Re, Mi, etc to replace the digits.
After this change basically can recognize some of my singing however the error rate is really high.
What can I do to improve?
Shall I try to adapt the existing en-us acoustic model, or shall I create a new acoustic model from scratch?
Need to mention that even recognizing the digits also gave me a poor recognition rate. I guess my microphone may be not good enough?
.
Regarding bad accuracy on digits, this could be due to a bad mic or a strong accent. You should provide audio examples that did not work well to understand what happens.
For singing, you should definitely train a new model. Even better to include additional features, such as pitch and chroma features.
Yes by changing the microphone and trying to speak more clearly I can get acceptable success rate for digits. For syllables of singing I can get higher rate of result also. I am still using the standard Us-en modle.
Now I start to work on the pocketsphinx as the mobile platform is my ultimate target.
My further questions:
Grammar vs Keyword List?
Shall I use grammar file (just with the words of Do, Re, ...) or use keyword list? I am now using the grammer file, but occationally I got extra recognized text output as "La" even when I make some noise.
My grammar file (syllable.gram):
Then I run:
./pocketsphinx_continuous.exe -inmic yes -jsgf ./syllables.gram -hmm ../../../model/en-us/en-us -dict ./syllables.dict
And I got a not-bad result when I sing 'do re mi fa so ... so fa mi re do'. However I just got extra words like double 're', 'do', 'mi', etc.
I read that when using grammar, it will try to match any sound to the context of grammar. I think I may need to swtich to keyword list search, as I can specify the threshold for the keywords to block out "out-of-grammar" sounds.
So I add the keyword file as syllables.kws:
Then if I run below:
./pocketsphinx_continuous.exe -inmic yes -kws ./syllables.kws -hmm ../../../model/en-us/en-us -dict ./syllables.dict
If I set threshold to 1e-45, it output a lot of words even though I am silent or just sing a sound. In the file I shown I tried 1e-10, but the recognition is worse than using grammar.
What may I do wrong here?
How to get the timestamp of each recognized text
No matter what, I may need to combine the information of the recoginized words with the pitch detection and correlate them. I need to know the time when each word was spoken. How can I get that in pocketsphinx programming?
BTW, you are right, I am going to detect the pitch as well. I am using another library for that.
you can try tuning word insertion penalty (-wip option). But in genral without garbage loop it is hard to block noisy words. Consider push-to-talk approach in real application
You can spend more time tuning thresholds. Also, for different words different thresholds can be optimal. I'd suggest to record a test set and try lots of different parameters (aka grid search)
use "-time yes" option
Anyway, I do not think that with en-US model you will achieve what you expect
For filtering out garbage syllables, I think I will use the on-set detected by the pitch, so it should be fine.
How to set the options like '-time yes' in Pocket Sphinx?
You mean training a new model is better? I will try and update here if any good news.
Last edit: Goldy Liang 2017-03-29
In case of singing, training new model is essential. Yes
Last edit: Arseniy Gorin 2017-03-30
Could you please also advice how to get the timestamp of each recognized syllable?
I tried passing in '-time yes' option, by invoking either of below:
recognizer = defaultSetup()
.setBoolean("-time", true)
//Or, .setString("-time", "yes")
But not sure I pass it right, and if right, how do I get the timestamps.
If I use the standalone recognizer (pocketsphinx_continuous -infile ... -time yes, I do get the timestamp. But I need to get it in the context of Pocket Sphinx in Android via the wrapper class. Any easy way to get that?