Menu

Mulit-word keywords with PocketSphinx

Help
2016-09-26
2016-09-28
  • Stephen McCants

    Stephen McCants - 2016-09-26

    I'm working on getting PocketSphinx to indentify a handful of keywords. I have a keyword file and I switch it to keyword mode. I've created a variety of short recordings to help me test the keyword recognition and it works great for some and not at all for others.

    Here is my keyword file:
    or(3) control /1e-30/
    exit /1e-10/
    tourniquet checkpoint /1e-50/
    surgery start checkpoint /1e-50/
    surgery end checkpoint /1e-50/

    Works great for the first three. After that, it fails pretty badly (1 out of 4). I've tried adjusting the thresholds, but it doesn't seem to make any difference or make things worse. I read in another post that the limits where between 1 and 1e-50.

    All of these words are in the English dictionary, but I added or(3) to get the correct pronuciation.

    I'm at a loss at this point and unsure how to proceed.

    Thanks in advance.
    --Stephen

     
    • Nickolay V. Shmyrev

      Tutorial says:

      Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else. The length of the audio should be approximately 1 hour

      It is not really helpful to tune threshold on a short files like you are doing, pocketsphinx does not have time to adapt to volume and noise level, you need to test on long files

      I'm attaching one of the failing sound files.

      You also have noise and echo. If you are building real-life system it's better to look for a good microphone.

       
      • Nickolay V. Shmyrev

        Several other things: if you want to use thresholds over 1e-50, you need to set beam to larger value, for example -beam 1e-150.

        It is better to split long phrases on subphrases, this way you get much more reliable estimation since you check both of them. You can check for "surgery start" and "checkpoint" instead of "surgery start checkpoint".

         
        • Nickolay V. Shmyrev

          Tutorial also mentions that:

          If your keyphrase is very long, larger than 10 syllables, it is recommended to split it and spot for parts separately.

           
          • Stephen McCants

            Stephen McCants - 2016-09-27

            "Surgery start checkpoint" is only 6 syllables.... so nothing I have is over 10 syllables. Do you still recommend splitting them?

             
            • Nickolay V. Shmyrev

              Why it, it increases the reliability of detection.

               
        • Stephen McCants

          Stephen McCants - 2016-09-26

          Originally tried that, but my results were worse than with them combined. Ulitmately a command will be something like, "or control surgery start checkpoint".

           
      • Stephen McCants

        Stephen McCants - 2016-09-26

        Not sure where I'm going to come up with an hour long audio file with a few references to things like "or control" and "surgery start checkpoint". Can I just record talking and occassional say those things without any context to the rest of the talking?

        As to noise and echo, I fully expect that our customer deployments will be worse. It will be in an operating room possibly with music in the background. Certainly people talking and noise from surgical equipment.

        Other than the thresholds in the keywords file, is there anything I can tune?

         
        • Nickolay V. Shmyrev

          Can I just record talking and occassional say those things without any context to the rest of the talking?

          You can

          As to noise and echo, I fully expect that our customer deployments will be worse. It will be in an operating room possibly with music in the background. Certainly people talking and noise from surgical equipment.

          You need hardware noise cancellation then. Either microphone array or headset.

           
  • Stephen McCants

    Stephen McCants - 2016-09-26

    I'm attaching one of the failing sound files.

     
  • Stephen McCants

    Stephen McCants - 2016-09-27

    Even if I had an hour long audio recorded by a good microphone, it still isn't clear to me how I would "tune". Actually, it isn't even clear that "tune" is the right word, although I'm convinced "train" is the wrong word.

    I've read http://cmusphinx.sourceforge.net/wiki/tutorialtuning, but it is unclear what is being tuned and how. What do I do with my results? What should I be changing to improve them?

    I've also read some of http://www.speech.cs.cmu.edu/sphinx/tutorial.html, but that seems like overkill for some simple adjustments.

     
    • Nickolay V. Shmyrev

      Even if I had an hour long audio recorded by a good microphone, it still isn't clear to me how I would "tune". Actually, it isn't even clear that "tune" is the right word, although I'm convinced "train" is the wrong word.

      http://cmusphinx.sourceforge.net/wiki/tutoriallm#keyword_lists

      Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else. The length of the audio should be approximately 1 hour
      Run keyword spotting on that file with different thresholds for every keyword, use the following command:
      pocketsphinx_continuous -infile <your_file.wav> -keyphrase <"your keyphrase"> -kws_threshold \
      <your_threshold> -time yes
      It will print many lines, some of them are keywords with detection times and confidences. You can also disable extra logs with -logfn your_file.log option to avoid clutter.</your_threshold></your_file.wav>

      From keyword spotting results count how many false alarms and missed detections you've encountered
      Select the threshold with smallest amount of false alarms and missed detections

       
      • Stephen McCants

        Stephen McCants - 2016-09-28

        Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else.

        Should I splice in my keywords to the movie sound track? There aren't any movies that are going to have "or control surgery start checkpoint" in it. Or would the splicing likely invalidate the exercise because the audio quality and volume will be very different from the movie sound track?

        Thanks for all your help.

         
        • Nickolay V. Shmyrev

          Yes, something like that should work. Threshold is a single number you don't need much data to estimate it.

           
  • Stephen McCants

    Stephen McCants - 2016-09-27

    Also there is "adapting" (http://cmusphinx.sourceforge.net/wiki/tutorialadapt), but that seems wrong because it is for one person or one environment.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.