Menu

Best approach for recognizing Do Re Mi Fa So La Ti ?

Help
2017-03-14
2017-03-14
  • Goldy Liang

    Goldy Liang - 2017-03-14

    I am working on an experiment project that I want to recognize the singing with Do Re Mi Fa So La Ti...

    What's the best approach of this?

    I have tried below steps, but the recogniztion performance is poor. What I did is based on the example of DialogDemo in sphinx4.

    • Change or add the words in cmudict-en-us.dict for the words of "Do, Re, Mi...". I modified the pronounciation to reflect the closet pronounciation. For example, for Do, I put 'D OW' instead of 'D UW'.

    • Change digits.grxml to I use DO, Re, Mi, etc to replace the digits.

    After this change basically can recognize some of my singing however the error rate is really high.

    What can I do to improve?

    Shall I try to adapt the existing en-us acoustic model, or shall I create a new acoustic model from scratch?

    Need to mention that even recognizing the digits also gave me a poor recognition rate. I guess my microphone may be not good enough?
    .

     
    • Arseniy Gorin

      Arseniy Gorin - 2017-03-14

      Regarding bad accuracy on digits, this could be due to a bad mic or a strong accent. You should provide audio examples that did not work well to understand what happens.

      For singing, you should definitely train a new model. Even better to include additional features, such as pitch and chroma features.

       
  • Goldy Liang

    Goldy Liang - 2017-03-16

    Yes by changing the microphone and trying to speak more clearly I can get acceptable success rate for digits. For syllables of singing I can get higher rate of result also. I am still using the standard Us-en modle.

    Now I start to work on the pocketsphinx as the mobile platform is my ultimate target.

    My further questions:

    Grammar vs Keyword List?
    Shall I use grammar file (just with the words of Do, Re, ...) or use keyword list? I am now using the grammer file, but occationally I got extra recognized text output as "La" even when I make some noise.

    My grammar file (syllable.gram):

    grammar syllables;
    
    <syllable> = do   |
              re  |
              mi   |
              fa   |
              so |
              la  |
              ti ;
    
    public <syllables> = <syllable>+;
    

    Then I run:
    ./pocketsphinx_continuous.exe -inmic yes -jsgf ./syllables.gram -hmm ../../../model/en-us/en-us -dict ./syllables.dict

    And I got a not-bad result when I sing 'do re mi fa so ... so fa mi re do'. However I just got extra words like double 're', 'do', 'mi', etc.

    INFO: ptm_mgau.c(500): BEAllocating 32 buffers of 2500 samples each
    fa
    do re re mi do so do
    so mi fa mi re do
    

    I read that when using grammar, it will try to match any sound to the context of grammar. I think I may need to swtich to keyword list search, as I can specify the threshold for the keywords to block out "out-of-grammar" sounds.

    So I add the keyword file as syllables.kws:

    do /1e-10/
    re /1e-10/
    mi /1e-10/
    fa /1e-10/
    so /1e-10/
    la /1e-10/
    ti /1e-10/
    

    Then if I run below:
    ./pocketsphinx_continuous.exe -inmic yes -kws ./syllables.kws -hmm ../../../model/en-us/en-us -dict ./syllables.dict

    If I set threshold to 1e-45, it output a lot of words even though I am silent or just sing a sound. In the file I shown I tried 1e-10, but the recognition is worse than using grammar.

    INFO: ptm_mgau.c(500): BEGAllocating 32 buffers of 2500 samples each
    so
    fa  la  so  la  la  fa  ti  mi  mi  re  la
    la  so  la  fa  mi  re  do
    

    What may I do wrong here?

    How to get the timestamp of each recognized text
    No matter what, I may need to combine the information of the recoginized words with the pitch detection and correlate them. I need to know the time when each word was spoken. How can I get that in pocketsphinx programming?

    BTW, you are right, I am going to detect the pitch as well. I am using another library for that.

     
    • Arseniy Gorin

      Arseniy Gorin - 2017-03-16

      However I just got extra words like double 're', 'do', 'mi', etc.

      you can try tuning word insertion penalty (-wip option). But in genral without garbage loop it is hard to block noisy words. Consider push-to-talk approach in real application

      If I set threshold to 1e-45, it output a lot of words even though I am silent or just sing a sound. In the file I shown I tried 1e-10, but the recognition is worse than using grammar

      You can spend more time tuning thresholds. Also, for different words different thresholds can be optimal. I'd suggest to record a test set and try lots of different parameters (aka grid search)

      How to get the timestamp of each recognized text

      use "-time yes" option

      Anyway, I do not think that with en-US model you will achieve what you expect

       
      • Goldy Liang

        Goldy Liang - 2017-03-29

        For filtering out garbage syllables, I think I will use the on-set detected by the pitch, so it should be fine.

        How to set the options like '-time yes' in Pocket Sphinx?

        You mean training a new model is better? I will try and update here if any good news.

         

        Last edit: Goldy Liang 2017-03-29
        • Arseniy Gorin

          Arseniy Gorin - 2017-03-30

          In case of singing, training new model is essential. Yes

           

          Last edit: Arseniy Gorin 2017-03-30
          • Goldy Liang

            Goldy Liang - 2017-04-02

            Could you please also advice how to get the timestamp of each recognized syllable?

            I tried passing in '-time yes' option, by invoking either of below:
            recognizer = defaultSetup()
            .setBoolean("-time", true)
            //Or, .setString("-time", "yes")

            But not sure I pass it right, and if right, how do I get the timestamps.

            If I use the standalone recognizer (pocketsphinx_continuous -infile ... -time yes, I do get the timestamp. But I need to get it in the context of Pocket Sphinx in Android via the wrapper class. Any easy way to get that?

             

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.