CMU Sphinx / Forums / Help: Improving PocketSphinx Recognition Accuracy

Mark - 2008-11-10

What is being done to improve pocketsphinx (PS) recognition accuracy when voice signals come in over the phone to say a soft switch like FreeSwitch which uses PS? Moreover, how would one start to learn on improving the accuracy of PS (possibly in FS) or get help doing so?

I'm new to speech recognition but have recently read an article (http://www.google.com/patents?id=mZ2jAAAAEBAJ&dq=Pattern+recognition+accuracy+with+distortions) where the voice input signal, x, has two modified copies made of itself before reaching any speech recognition system. The 3 lists of possible utterances derived from each of these signals is looked at to determine the best choice for what the utterance could have been. This technique is claimed to reduce recognition errors up to 80% and obviously will slow down performance but is used in call centers that utilize Fluency Voice technology.

The technique seems simple.

Basically, one variant of the original voice signal x, is "expanded" or amplified in a nonlinear way so that weaker signal strengths are magnified more so than stronger ones. In the other variant, the signal x is expanded so that stronger signal strengths are magnified to a greater extent than weaker ones. Also, there is a gain factor applied to each of these two new signals to account for overall signal strength changes. At least that's my take on the article.

The formula used for signal expansion is just a simple power function of the form y = g*x^c, where y is the new signal, g is the gain factor and c is the the power.

Expected values of c vary from 0.6 to 1.4 and g is around 20 for c=0.6 and g=0.1 for c=1.4 (See the cited article for more details).

Now, this is just one example and there maybe better or other ways to do improve PS accuracy for phone signals. Any guidance on this and how to implement it in pocketsphinx (maybe even within FS) would be much appreciated.

Mark.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2008-11-11
  
  Well, everything is not that easy as you might think. Also I really wonder if there is any sense to implement patented technology in open source ASR system.
  
  There are ways to improve the quality of recognition of course, but many of them require significant work. Also, telephony applications have their own issues even more important than accuracy. I suggest you to define the problem first, evaluate your current system and search for a bottleneck.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mark - 2008-11-11
    
    Yes, the power function is in a patent but I was only using this as an example.
    
    You mentioned ways to improve accuracy but that they needed a lot of work. Would you be kind enough to give me an overview of these or point me in the right direction to read more. This may give me enough background to determine whether it's worth the effort. More important was what you said about telephony applications having other more pressing issues than accuracy. Please let me know what these are or the places I can go look if I need more information. Right now, I'm setting up a test system at home on spare time. In a business setting, this system that I want to cobble together and try will be first set up in a few small practices of doctor friends and some beauty salons. There are some larger places that want to try things out but the scaling up in call density won't be a real big increase initially. Right now I can't say much more other than the contexts of the applications use. However, I'm thinking of issues I might have problems with and how to solve or at least dissolve them. Always a very big issue is the user/customer experience with these speech recognition systems, so accuracy and speed are generally things I'm always concerned about.
    
    Thanks
    
    Mark
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2008-11-11
  
  Well, there are issues in both the decoder and the interface with the
  telephony application.
  
  First about the decoder, pocketsphinx right now is the most supported
  and most feature-reach decoder of the family, but in general it's still
  oriented on the embedded devices. For telephony applications you
  probably need to extend it a lot. The features that are currently
  missing are probably:
  
  Out-of-box support for multiple recognizers (probably more a freeswitch
  issue and a model training issue, for example we have no free
  male/female model).
  
  Speaker clustering.
  
  Automatic VTLN estimation from pitch (This looks simple).
  
  Good endpointer.
  
  Discriminative training support in SphinxTrain (Huge task).
  
  Good and clean support for a garbage model to be able to filter out
  out of grammar words.
  
  Embedded RASTA extraction and RASTA model training.
  
  Advanced features extraction
  
  Another issue is dialog tracking and understanding. CMU folks are doing
  work on dialog systems, for example Raven is available
  
  http://www.ravenclaw-olympus.org/systems_overview.html
  
  It would be worth to look on it and try to integrate it into
  freepbx. Decoder will need to support combined language model. As well
  as you'll need a component for postprocessing. The postprocessing includes
  disfluency removal, text normalization, text boundary detection. Integration
  with nltk probably useful for sense extraction.
  
  If you need more details on any of the above, feel free to ask.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mark - 2008-11-11
    
    Nickolay, thank you for providing the items on accuracy.
    
    Wow, that's a long list but I now have at least four options.
    
    I'm testing so I can leave it as it is and see if unmodified pocketsphinx is good enough.
    
    LumenVox is supported by FreeSwitch. Also, since LumenVox is supposedly based on pocketsphinx then maybe they have done some of the things you have suggested to improve accuracy. If so I could just get a single port lite version license. Otherwise, this would not be an option since I'm getting pocketsphinx for free and I don't want to be paying LumenVox just for some easy to use set-up tools.
    
    Of your list of items, I could make a ranking table of "ease of developement" by "degree of improvement" and pick from there. Could you give me a table that estimates these?
    
    Maybe an approach similar to the one outlined in the patent would be easier than any of the things from your list. With patents there are always loopholes but even if there are none, I could use something different than a power function, say a "fast power estimating function" like those you find in computer graphics libraries, or maybe some other quick non-linear expansion. Actually, I'm a bit surprised that they could have patented this approach since it's used in a lot of places throughout science and engineering. Maybe, this is the first time it was used in speech recognition so it's "patentable" but not novel.
    
    I'm not familiar with Raven and what dialogue tracking systems are so I'll have a look. The folks at FreeSwitch might be interested and I'll let them know.
    
    Also, you mention FreePBX. I've heard of them but don't know anything about FreePBX. Are you suggesting that FreePBX might be better suited to what I'm doing than FreeSwitch?
    
    Mark.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2008-11-12
      
      > I'm testing so I can leave it as it is and see if unmodified pocketsphinx is good enough.
      
      Most probably it's not enough. Try to say something different during the call and see the result :) Even a single hm or um will break everything. Or think about the result if customer will ask for a pizza and recognizer will predict something else.
      
      > LumenVox is supported by FreeSwitch
      
      Well, if you have money it's probably easy just to pay. I wouldn't say that "Lumenvox is just a modified pocketsphinx". The issue that speech system require a huge amount of tuning and tricks which takes a lot of time and cost. Basics are known by everyone, just a modification is tricky. I haven't used Lumenvox though and don't know how do they perform. The support of the engine in Freeswitch itself is rather limited as far as I see.
      
      > Of your list of items, I could make a ranking table of "ease of developement" by "degree of improvement" and pick from there. Could you give me a table that estimates these?
      
      Well, my opinion is that it's impossible to do the easiest thing and leave others. They all are required and the time frames for many tasks are in a year range. I think we should just poke dhd about that. For example I don't know the state of VTLN wrap factor estimator right now.
      
      > Maybe an approach similar to the one outlined in the patent would be easier than any of the things from your list.
      
      Well, what the patent above mentions is called voting across multiple recognizers and almost everyone uses it nowdays. It's also listed as a point 1 in my list. Of course it must be implemented. I suggest you to read articles, not patents because articles give better overview of the system as a whole:
      
      http://www.tc-star.org/pubblicazioni/scientific_publications/IBM/tcstareval06.pdf
      
      > Also, you mention FreePBX.
      
      Ups, it was a typo of course. I'd really recommend to stay with Freeswitch just because I like the developers :) Asterisk and clones also worth to consider, but it's a matter of personal taste.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Mark - 2008-11-12
        
        Thanks, you made my day ;-)
        
        I forgot about a 5th option which Brian of FS reminded me about again and that was to use Voxeo's Prophecy platform for ASR/TTS since they give away a 2 port license for free. This works for the set-up I'm considering that uses Linksys SPA3102's and because FS speaks MRCP.
        
        However, I would still very much would like a usable pocketsphinx inside FS for an ASR back up if the Voxeo's service goes down or pocketsphinx might be the only ASR used for other reasons. Also, I think a second ASR/TTS(e.g. Cepstral, Flite) gives one more options on hardware that can be use. Finally, options help to persuade somebody to try the system and after all persuasion is where "the rubber will meet the road."
        
        Lastly, I like FreeSwitch as well because they just "feel" right so I want to work with them. I'm betting they will overtake Asterisk's for various reasons. One being the multiple OS's that FreeSwitch can run native on. Plus, FreePBX or Asterisk won't do since they are Linux only native solutions and my target market are virtually 100% windows machine users and this stuff must run on windows machine with other windows applications and no extra boxes besides ATA's. That's also going to be an interesting issue but call density is very low and windows application use isn't intensive. Again, Voxeo helps but it's not always going be be implemented.
        
        How about the future of pocketswitch. Will many of the improvements you talk about be put together and made into a "pocketsphinx integrated developement enviroment" so that the people at FS could easily put it into their system and novice/hobby "cobbler" type programmers like me could easily make use of an ASR that works real well. What time frames are we looking at?
        
        I appreciate the link to the paper which helps get me up to speed and your patient assistance.
        
        Mark.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mark - 2008-11-13
  
  Nickolay
  
  If you are familiar with the "Pizza Demo" for FreeSwitch then I would appreciate a hand because I believe I'm having a pocketsphinx related problem that has to do with one of the items needed on your pocketsphinx improvements list.
  
  The FS pizza demo uses two JavaScript files (ps_pizza.js, SpeechTools.jm) and at one point in the order you are asked if you want to "pick your own toppings." If you choose this option then a bunch of choices are given. When one responds with their toppings choices, it seems problems occur if you don't say your toppings "fast enough." I don't think it's a speed problem with that part of the session timing out because I believe I set that correctly and ranged the setting between 1000-10000 milliseconds.
  
  Instead, it might be that pocketsphinx is catching one of my pauses as I say my topping choices and interprets that as the end of my choices. If the "good end pointer" on your list means discriminating between pauses and stops then is there a way to test this and what could I do to improve the situation?
  
  Mark.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-11-14
    
    Well, probably so, I need to look closer on this. It's required to find out what exactly is passed to pocketsphinx. For example it's not good to pass silence to the recognizer.
    
    For example I quickly looked on how speech is collected, there is WITCH_DECLARE(switch_status_t) switch_ivr_collect_digits_callback
    indeed it uses a fixed timeout to cancel recording. It would be nice to implement callback from recognizer to get the time when last word was recognized and setup timeout from that time.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mark - 2008-11-15
      
      Well, it's not dft_min nor dft_confirm since one acts as the minimum score accepted and the other acts as the minimum score when no confirmation is requested.
      
      When I first read your response it sounded right and now I'm even more convinced since I manipulated all the timeouts I could find, hard coded and not, with no effect. Getting into the C code is on the edge of what I haven't barely done but I'll try anything more than once. However, you seem familiar with the work that the guys at FS did on this integration and I'm wondering if one could control the collection process as you suggested right from JavaScript by fixing either of those script files?
      
      Thanks.
      Mark.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mark - 2008-11-15
      
      Nickolay, your good tip is helping.
      
      I think that the fixed timeout needs to stay if the recognized doesn't recognize any of the toppings that are said.
      
      However, as you suggested, another timer could be used.
      
      Maybe a setTimeout in the JavaScript callback function?
      
      What do you think?
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Nickolay V. Shmyrev - 2008-11-15
        
        Sorry, I don't quite understand what setTimeout are you talking about.
        
        There is a constructor pizza.toppingsObtainer = new SpeechObtainer(asr, 1, 5000);
        
        5000 here is a timeout. You can make it bigger if you want, hopefully it will be sufficient for some cases. Or do you speak and try to recognize something different?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mark - 2008-11-18
        
        Some of the problems I encountered were helped by choosing different values for dft_min and dft_confirm. Their correct meanings are:
        
        dft_confirm is threshold (default 400): higher the number the louder you have to talk to be considered "talking"
        dft_min is for silence-hits (default 35): number of hits below threshold before detecting "stop talking"
        
        However, I'm also getting recognition on strange patterns that all start with a superscript of 2222 followed by nothing or standard script like:
        
        2222
        2222||||?
        2222h
        2222p?o??
        2222h|
        2222||||?
        
        What's going on here and what can I do about it?
        
        Mark.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mark - 2008-11-14
  
  You've gotten to places I haven't ventured but I have looked a bit at the two JavaScript files and see a couple things that look like frequency settings.
  
  In ps_pizza.js one finds these variables and their initial values:
  
  dft_min = 40;
  dft_confirm = 70;
  
  They get passed into the setGrammar method of the SpeechObtainer Class and then into the Grammar Class. Both classes are in the SpeechTools.jm file. Their they are respectively called min_score and confirm_score but if not set taking default values of 1 and 400 respectively.
  
  I'm guessing dft might mean "discrete Fourier transform" and these variables are frequency setting to some filters. Maybe these values need to be changed. I'll explore more and see where they lead.
  
  Do you have copies of these two JavaScript files?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Improving PocketSphinx Recognition Accuracy

Speech Recognition Toolkit

Forums

Help

Improving PocketSphinx Recognition Accuracy document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Improving PocketSphinx Recognition Accuracy