CMU Sphinx / Forums / Help: Mulit-word keywords with PocketSphinx

Stephen McCants - 2016-09-26

I'm working on getting PocketSphinx to indentify a handful of keywords. I have a keyword file and I switch it to keyword mode. I've created a variety of short recordings to help me test the keyword recognition and it works great for some and not at all for others.

Here is my keyword file:
or(3) control /1e-30/
exit /1e-10/
tourniquet checkpoint /1e-50/
surgery start checkpoint /1e-50/
surgery end checkpoint /1e-50/

Works great for the first three. After that, it fails pretty badly (1 out of 4). I've tried adjusting the thresholds, but it doesn't seem to make any difference or make things worse. I read in another post that the limits where between 1 and 1e-50.

All of these words are in the English dictionary, but I added or(3) to get the correct pronuciation.

I'm at a loss at this point and unsure how to proceed.

Thanks in advance.
--Stephen

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-09-26
  
  Tutorial says:
  
  Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else. The length of the audio should be approximately 1 hour
  
  It is not really helpful to tune threshold on a short files like you are doing, pocketsphinx does not have time to adapt to volume and noise level, you need to test on long files
  
  I'm attaching one of the failing sound files.
  
  You also have noise and echo. If you are building real-life system it's better to look for a good microphone.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2016-09-26
    
    Several other things: if you want to use thresholds over 1e-50, you need to set beam to larger value, for example -beam 1e-150.
    
    It is better to split long phrases on subphrases, this way you get much more reliable estimation since you check both of them. You can check for "surgery start" and "checkpoint" instead of "surgery start checkpoint".
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2016-09-26
      
      Tutorial also mentions that:
      
      If your keyphrase is very long, larger than 10 syllables, it is recommended to split it and spot for parts separately.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Stephen McCants - 2016-09-27
        
        "Surgery start checkpoint" is only 6 syllables.... so nothing I have is over 10 syllables. Do you still recommend splitting them?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2016-09-27
        
        Why it, it increases the reliability of detection.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Stephen McCants - 2016-09-26
      
      Originally tried that, but my results were worse than with them combined. Ulitmately a command will be something like, "or control surgery start checkpoint".
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Stephen McCants - 2016-09-26
    
    Not sure where I'm going to come up with an hour long audio file with a few references to things like "or control" and "surgery start checkpoint". Can I just record talking and occassional say those things without any context to the rest of the talking?
    
    As to noise and echo, I fully expect that our customer deployments will be worse. It will be in an operating room possibly with music in the background. Certainly people talking and noise from surgical equipment.
    
    Other than the thresholds in the keywords file, is there anything I can tune?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2016-09-26
      
      Can I just record talking and occassional say those things without any context to the rest of the talking?
      
      You can
      
      As to noise and echo, I fully expect that our customer deployments will be worse. It will be in an operating room possibly with music in the background. Certainly people talking and noise from surgical equipment.
      
      You need hardware noise cancellation then. Either microphone array or headset.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Stephen McCants - 2016-09-26
        
        Would a fairly inexpensive microphone array like the one below make a difference?
        
        https://www.amazon.com/Andrea-Electronics-Technology-Microphone-C1-1024200-1/dp/B00H85ANIE?SubscriptionId=AKIAILSHYYTFIVPWUY6Q&tag=duckduckgo-d-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B00H85ANIE
        
        Headsets are not going to work.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2016-09-26
        
        Yes, something like that should be much better than what you have now.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephen McCants - 2016-09-26

I'm attaching one of the failing sound files.

surgeryStartCheckpoint1.wav

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephen McCants - 2016-09-27

Even if I had an hour long audio recorded by a good microphone, it still isn't clear to me how I would "tune". Actually, it isn't even clear that "tune" is the right word, although I'm convinced "train" is the wrong word.

I've read http://cmusphinx.sourceforge.net/wiki/tutorialtuning, but it is unclear what is being tuned and how. What do I do with my results? What should I be changing to improve them?

I've also read some of http://www.speech.cs.cmu.edu/sphinx/tutorial.html, but that seems like overkill for some simple adjustments.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-09-27
  
  Even if I had an hour long audio recorded by a good microphone, it still isn't clear to me how I would "tune". Actually, it isn't even clear that "tune" is the right word, although I'm convinced "train" is the wrong word.
  
  http://cmusphinx.sourceforge.net/wiki/tutoriallm#keyword_lists
  
  Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else. The length of the audio should be approximately 1 hour
  Run keyword spotting on that file with different thresholds for every keyword, use the following command:
  pocketsphinx_continuous -infile <your_file.wav> -keyphrase <"your keyphrase"> -kws_threshold \
  <your_threshold> -time yes
  It will print many lines, some of them are keywords with detection times and confidences. You can also disable extra logs with -logfn your_file.log option to avoid clutter.</your_threshold></your_file.wav>
  
  From keyword spotting results count how many false alarms and missed detections you've encountered
  Select the threshold with smallest amount of false alarms and missed detections
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Stephen McCants - 2016-09-28
    
    Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else.
    
    Should I splice in my keywords to the movie sound track? There aren't any movies that are going to have "or control surgery start checkpoint" in it. Or would the splicing likely invalidate the exercise because the audio quality and volume will be very different from the movie sound track?
    
    Thanks for all your help.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2016-09-28
      
      Yes, something like that should work. Threshold is a single number you don't need much data to estimate it.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephen McCants - 2016-09-27

Also there is "adapting" (http://cmusphinx.sourceforge.net/wiki/tutorialadapt), but that seems wrong because it is for one person or one environment.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mulit-word keywords with PocketSphinx

Speech Recognition Toolkit

Forums

Help

Mulit-word keywords with PocketSphinx document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Mulit-word keywords with PocketSphinx