CMU Sphinx / Forums / Help: Best approach for recognizing Do Re Mi Fa So La Ti ?

Goldy Liang - 2017-03-14

I am working on an experiment project that I want to recognize the singing with Do Re Mi Fa So La Ti...

What's the best approach of this?

I have tried below steps, but the recogniztion performance is poor. What I did is based on the example of DialogDemo in sphinx4.

Change or add the words in cmudict-en-us.dict for the words of "Do, Re, Mi...". I modified the pronounciation to reflect the closet pronounciation. For example, for Do, I put 'D OW' instead of 'D UW'.

Change digits.grxml to I use DO, Re, Mi, etc to replace the digits.

After this change basically can recognize some of my singing however the error rate is really high.

What can I do to improve?

Shall I try to adapt the existing en-us acoustic model, or shall I create a new acoustic model from scratch?

Need to mention that even recognizing the digits also gave me a poor recognition rate. I guess my microphone may be not good enough?
.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Arseniy Gorin - 2017-03-14
  
  Regarding bad accuracy on digits, this could be due to a bad mic or a strong accent. You should provide audio examples that did not work well to understand what happens.
  
  For singing, you should definitely train a new model. Even better to include additional features, such as pitch and chroma features.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Goldy Liang - 2017-03-16

Yes by changing the microphone and trying to speak more clearly I can get acceptable success rate for digits. For syllables of singing I can get higher rate of result also. I am still using the standard Us-en modle.

Now I start to work on the pocketsphinx as the mobile platform is my ultimate target.

My further questions:

** Grammar vs Keyword List? **
Shall I use grammar file (just with the words of Do, Re, ...) or use keyword list? I am now using the grammer file, but occationally I got extra recognized text output as "La" even when I make some noise.

My grammar file (syllable.gram):

grammar syllables; <syllable> = do | re | mi | fa | so | la | ti ; public <syllables> = <syllable>+;

Then I run:
./pocketsphinx_continuous.exe -inmic yes -jsgf ./syllables.gram -hmm ../../../model/en-us/en-us -dict ./syllables.dict

And I got a not-bad result when I sing 'do re mi fa so ... so fa mi re do'. However I just got extra words like double 're', 'do', 'mi', etc.

INFO: ptm_mgau.c(500): BEAllocating 32 buffers of 2500 samples each fa do re re mi do so do so mi fa mi re do

I read that when using grammar, it will try to match any sound to the context of grammar. I think I may need to swtich to keyword list search, as I can specify the threshold for the keywords to block out "out-of-grammar" sounds.

So I add the keyword file as syllables.kws:

do /1e-10/ re /1e-10/ mi /1e-10/ fa /1e-10/ so /1e-10/ la /1e-10/ ti /1e-10/

Then if I run below:
./pocketsphinx_continuous.exe -inmic yes -kws ./syllables.kws -hmm ../../../model/en-us/en-us -dict ./syllables.dict

If I set threshold to 1e-45, it output a lot of words even though I am silent or just sing a sound. In the file I shown I tried 1e-10, but the recognition is worse than using grammar.

INFO: ptm_mgau.c(500): BEGAllocating 32 buffers of 2500 samples each so fa la so la la fa ti mi mi re la la so la fa mi re do

What may I do wrong here?

** How to get the timestamp of each recognized text **
No matter what, I may need to combine the information of the recoginized words with the pitch detection and correlate them. I need to know the time when each word was spoken. How can I get that in pocketsphinx programming?

BTW, you are right, I am going to detect the pitch as well. I am using another library for that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Arseniy Gorin - 2017-03-16
  
  However I just got extra words like double 're', 'do', 'mi', etc.
  
  you can try tuning word insertion penalty (-wip option). But in genral without garbage loop it is hard to block noisy words. Consider push-to-talk approach in real application
  
  If I set threshold to 1e-45, it output a lot of words even though I am silent or just sing a sound. In the file I shown I tried 1e-10, but the recognition is worse than using grammar
  
  You can spend more time tuning thresholds. Also, for different words different thresholds can be optimal. I'd suggest to record a test set and try lots of different parameters (aka grid search)
  
  How to get the timestamp of each recognized text
  
  use "-time yes" option
  
  Anyway, I do not think that with en-US model you will achieve what you expect
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Goldy Liang - 2017-03-29
    
    For filtering out garbage syllables, I think I will use the on-set detected by the pitch, so it should be fine.
    
    How to set the options like '-time yes' in Pocket Sphinx?
    
    You mean training a new model is better? I will try and update here if any good news.
    
    Last edit: Goldy Liang 2017-03-29
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Arseniy Gorin - 2017-03-30
      
      In case of singing, training new model is essential. Yes
      
      Last edit: Arseniy Gorin 2017-03-30
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Goldy Liang - 2017-04-02
        
        Could you please also advice how to get the timestamp of each recognized syllable?
        
        I tried passing in '-time yes' option, by invoking either of below:
        recognizer = defaultSetup()
        .setBoolean("-time", true)
        //Or, .setString("-time", "yes")
        
        But not sure I pass it right, and if right, how do I get the timestamps.
        
        If I use the standalone recognizer (pocketsphinx_continuous -infile ... -time yes, I do get the timestamp. But I need to get it in the context of Pocket Sphinx in Android via the wrapper class. Any easy way to get that?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Best approach for recognizing Do Re Mi Fa So La Ti ?

Speech Recognition Toolkit

Forums

Help

Best approach for recognizing Do Re Mi Fa So La Ti ? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Best approach for recognizing Do Re Mi Fa So La Ti ?