CMU Sphinx / Forums / Help: Dynamic grammar

Anonymous - 2004-06-30

I'd like to have several states where each state has a few (10 to 20) words that can be recognized.

I tried defining a JSGF grammer with several public rules and disabling all but one of the rules by

1. Obtaining the JSGFGrammar from the configuration.
2. Obtaining the RuleGrammar from JSGFGrammar.getRuleGrammar().
3. Disabling all public rules in the JSGF grammar but the one I want in the current state by RuleGrammar.setEnabled(ruleName, false).

The enabled state of the disabled JSGF public rules does change to disabled, but words in the disabled rules are still recognized.

Is there a way to dynamically change the set of words that can be recognized? I'd prefer JSGF grammar because I'm getting a lot of value out of the JSGF {tags}, but I'm open to anything.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Lamere - 2004-06-30
  
  Stan:
  
  Adding support for his is at the very top of our TODO list, however it may still be a little while before it is implemented (vacations and such will be competing for time). However, there are some simple things that you can do to make this work (although not as convienient or efficient as our final solution).
  
  One approach is to configure your application to have multiple recognizers, one for each grammar. Since the recognizers are not run simultaneously they can share most of their subcomponents.
  
  This can be set up in the config.xml file. Each recognizer would need its own decoder, each decoder would need its own search manager, each search manager would need its own version of the linguist, and each linguist would need its own version of the appropriate grammar.
  
  Once you have set up multiple recognizers, your application can effectively move between the different states in your dialog graph by calling 'recognize' on the appropriate recognizer.
  
  Again, this is a work around until we properly implement this feature. Also note that I haven't tested this approach, you will be breaking new ground.
  
  Let me know if you need help setting up the config.xml file for this.
  
  Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-06-30
  
  Thanks Paul.
  
  I've broken all kinds of new ground watching sphinx4 change over the last several months :-)
  
  Also, I forgot to mention I'm using sphinx4 rather than it's predecessors.
  
  Initializing recogonizers takes quite a while (about a minute on a slow machine). If I have ten nodes in my dialog graph we're talking about ten minutes for initialization.
  
  Am I correct? Is it the recognizer initialization that takes the time? (No problem for one, but ten would be a show stopper.) Also, would multiple recognizers raise the memory requirements (now at 256 MB for sphinx4 with one recognizer).
  
  Currently, I made a SpeechRecognizer bean that requires an array of "current valid responses" as a parm to a "recognizeSpeech" method. I placed the union of the phrases for all states in the grammar and have the bean fire a "userPrompt" event with "what?" if the phrase recognized isn't in the "current valid responses" list. I'm hoping the accurace will be even greater when I can restrict the grammar to the current state valid response list.
  
  BTW, the accuracy and speed of sphinx4 recognition of a few short phrase choices is remarkable. Maybe an illusory correlation, but it seems better than the commercial counterparts.
  
  Thanks,
  --stan
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Paul Lamere - 2004-06-30
    
    Stan:
    
    Yes, you've been a great groundbreaker, thanks for your patience, and especially, thanks for your kind words about our system.
    
    I believe you mentioned in the past that loading grammars was a small part of the init time for your app. I believe that for your system, the loading of the acoustic model accounts for a significant fraction of the startup time. Since the acoustic model scan be shared by the recognizer, it would not need to be reloaded, so I do not think you'd have a 10 minute load time.
    
    As for footprint, again, the acoustic model accounts for a significant portion of the memory footprint. Since this is shared, I do not think you'd see a significant memory footprint increase.
    
    Note that we've made a recent code change to allow the sharing of acoustic models. If you want to use this technique, you'll need to grab the latest code from CVS since it is not included in the src.gzip file yet.
    
    Once again, i must remind you that I've not actually tried this technique. I don't see any reason why it won't work but ... well, you know how it goes with software.
    
    I am glad to hear that you are getting good speed and accuracy. We are always interested in people's experiences with S4. If you have a moment and feel inspired to share your experiences we'd be extremely interested. In particular we'd like to know your grammar vocaburary size, computer CPU, speed and memory and if available the RT and accuracy that you are seeing. This data would help us understand how S4 is being used and where we should spend our resources. If you are interested you can email this data to me at Paul dot Lamere at Sun dot com.
    
    Paul
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-07-01
  
  Thanks again Paul.
  
  Yes, I'm interested in experimenting with multiple recognizers to see what happens in initialization speed and overall memory requirements.
  
  My application is what I call SpeechPlayer, a voice recognition directed music player.
  
  I have an old P2 300 MHz 384 MB machine gathering dust that I decided to connect to the stereo and use as a music box for MP3's, wav's, etc.
  
  I have an old CRT to go along with the old computer, but esthetics dictate I not put the CRT next to the stereo in the living room.
  
  So, I have the old computer plugged into the stereo and running with a keyboard and microphone. I press a key to begin speech recognition and use a one second window to capture speech responses.
  
  The SpeechPlayer program uses Sphinx4 for recognition, FreeTTS for speech, java sound to play wav's, etc., and the JavaLayer library to play MP3's. I plan to add a little DSP processing for equalization etc, but it sounds pretty good right now.
  
  The commands are pretty simple. "Would you like to play something or shutdown the computer". If the response is "play something", then the computer askes "Would you like to play rock, light, Neil Young, or Dylan". Each response to the music to be played has an associated JSGF tag that points to a directory containing music files to be played. It's all easy to set up by adding directories with music files and associated entries in the grammar file.
  
  Initialization takes quite a while on the old P2, but that includes system boot time. Once going, though, response is too fast to bother timing. Also, sphinx4 might like my voice, but I'm getting 100% accuracy on a total vocabulary of {what, nevermind, shutdown the computer, play rock, play light, play neil young, play dylan}. I didn't bother measuring anything because all is working perfectly.
  
  1. I couldn't get endpointing working well enough to be useable, but I might not be using it correctly. But, a one or two second time window works well.
  
  2. It was tricky to ensure my app window was in focus in order to capture keystrokes to trigger the recognition. I would have used a mouse click but I couldn't ensure the system wouldn't respond to wherever my unseen curson happened to be. I'm open to suggestions for interrupting a song and triggering speech recognition.
  
  3. I'd like to add the capability to select a specific song from a given directory (yes, I can listen to Thrasher several times over), but this will greatly increase the vocabulary and probably reduce accuracy.
  
  Everything works today, and it's all java so my friend with a Mac can run it as well (but, of course he has an ipod and doesn't need SpeechPlayer :-)
  
  Wish list:
  
  1. Support for the jsapi JSGF methods to disable public rules dynamically.
  2. A good way to absolutely positively capture a mouse click with a java program (but, then, how could you terminate a program :-)
  3. The ability to specify the java memory parameter in the manifest file of an executable jar file. (I have to use a bat file today)
  4. Fix for some sort of bug that requires me to place FreeTTS support files in a path with no spaces in it's name. (I can't keep everything together in a directory in the standard Win "My Documents".)
  
  My current sphinx4 config file is rather lengthy so I'll mail it to you for suggestions as to running multiple recognizers. When and if I get it working I'll publish the resulting config file here.
  
  Thanks again,
  --stan
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Paul Lamere - 2004-07-01
    
    Stan:
    
    Thanks for the info, I'm glad to see that S4 is working so well for you, even on a lowly P2 300 MHz 384 MB system. As for your wishlist:
    
    1) This is one of our highest priorities as well
    
    2) I'm not a gui programmer, so I'm not sure if this will help or not, but there is in the 1.4 APIs a FullScreen Exclusive Mode that allows your app to take over the full window. Not sure how this works when there is no display though.
    
    http://java.sun.com/docs/books/tutorial/extra/fullscreen/exclusivemode.html
    
    3) If you package your app in a JNLP file, you can specify all sorts of things including the heap size, classpaths, jar files, equired java version. JNLP files are described in the WebStart documentation here:
    
    http://java.sun.com/products/javawebstart/developers.html
    
    Note that WebStart isn't just for starting java apps over the web, it is a great way to package apps that are to be run locally. It sounds like this is what you are trying to do.
    
    4) This bug has been fixed in the latest FreeTTS release (June 16, 2004). Download it and try it.
    
    Hope this helps
    
    Paul
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Lamere - 2004-07-01
  
  Stan:
  
  I've tried out this multiple recognizer technique, and it does indeed work. I was able to set up to separate grammars and switch between back and forth between the two.
  
  The first grammar takes 15 seconds to load and the second grammar takes less than 1 second to load on my system, which validates our theory that most of the time is spent loading the acoustic model, which only needs to be loaded once.
  
  Memory footprint grew by about 20%, which isn't too bad.
  
  I've written up a page describing the experiment. This includes the config file and the java source used for the experiment. Note that I based the config file on the helloworld.config.xml (which uses the endpointer).
  
  The write-up os on the twiki at:
  
  http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/SwappingGrammars
  
  Good luck
  
  Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mikko Honkala - 2004-07-02
  
  Hello,
  
  First of all, thanks for the great software! I'm also interested in using dynamic grammars. I'm thinking of implementing a speech-directed XForms client (http://www.w3.org/MarkUp/Forms/).
  
  So what I basically need is to generate a new Grammar for every form.
  
  Can I somehow direct the configuration manager or the recognizer to reload the grammar efficiently without re-initializing the acoustic models? Or can I somehow (efficiently) re-initialize the whole configuration manager without re-initializing the acoustic models etc?
  
  Best Regards,
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Paul Lamere - 2004-07-02
    
    Mikko:
    
    This is certainly the most requested enhancement for Sphinx-4. Adding direct support for this is a high priority for us. However, it still may be bit of time before this can be implemented. In the mean time there is a work around that allows you to use multiple grammars. This is decribed here:
    
    http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/SwappingGrammars
    
    This allows you to swap between predefined grammars.
    
    I am guessing that you will want to dynamically generate the JSGF so switching between predefined grammars may not be adequate.
    
    It may be possible to do what you want by telling the linguist to reload the grammars. Let me try a few experiments to see how well this works. I'll get back to you shortly.
    
    Paul
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mikko Honkala - 2004-07-02
      
      Hi,
      
      thanks for the quick reply. Indeed, I need to generate the grammars dynamically, so the method shown in
      
      http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/SwappingGrammars
      
      does not work for me.
      
      If I could just write the generated grammar into a file and say: reloadGrammar somewhere, that would be sufficient.
      
      -mikko
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Paul Lamere - 2004-07-02
        
        Mikko:
        
        I've done an experiment and indeed there is a way to do this. I've added a description of this technique here:
        
        http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/SwappingGrammars
        
        Please try it out and let me know how it works for you. Note that you'll need to grab the latest Sphinx-4 source from the CVS repository since I had to make a minor change to get this to work.
        
        Paul
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mikko Honkala - 2004-07-02
        
        Hi,
        
        thanks! This seems just what I need. Boy, you're fast :) I will try it out more next week. Have a good weekend!
        
        -mikko
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mikko Honkala - 2004-07-12
  
  Hi,
  
  dynamic grammar works well with your approach and fix! Good job! My XForms implementation is already speech enabled.
  
  Now I'm just fighting to have Sphinx4 and FreeTTS working at the same time in linux. You have been able to make them work in the Card demo, but I fail to understand what is the trick. I just get different error messages when trying to ru them both, even though I say microphone.stop() before using FreeTTS. A FAQ on this subject would be greatly appreciated!
  
  -mikko
  
  javax.sound.sampled.LineUnavailableException: Audio Device Unavailable
      at com.sun.media.sound.HeadspaceMixer.nResume(Native Method)
      at com.sun.media.sound.HeadspaceMixer.implOpen(HeadspaceMixer.java:346)
      at com.sun.media.sound.AbstractMixer.open(AbstractMixer.java:286)
      at com.sun.media.sound.AbstractMixer.open(AbstractMixer.java:323)
      at com.sun.media.sound.AbstractDataLine.open(AbstractDataLine.java:101)
      at com.sun.speech.freetts.audio.JavaStreamingAudioPlayer.openLine(JavaStreamingAudioPlayer.java:161)
      at com.sun.speech.freetts.audio.JavaStreamingAudioPlayer.begin(JavaStreamingAudioPlayer.java:332)
      at com.sun.speech.freetts.relp.LPCResult.playWaveSamples(LPCResult.java:503)
      at com.sun.speech.freetts.relp.LPCResult.playWave(LPCResult.java:402)
      at com.sun.speech.freetts.relp.AudioOutput.processUtterance(AudioOutput.java:56)
      at com.sun.speech.freetts.Voice.runProcessor(Voice.java:551)
      at com.sun.speech.freetts.Voice.outputUtterance(Voice.java:510)
      at com.sun.speech.freetts.Voice.access$100(Voice.java:80)
      at com.sun.speech.freetts.Voice$1.run(Voice.java:473)
  java.lang.UnsupportedOperationException: Can't get line
      at com.sun.speech.freetts.audio.JavaStreamingAudioPlayer.openLine(JavaStreamingAudioPlayer.java:171)
      at com.sun.speech.freetts.audio.JavaStreamingAudioPlayer.begin(JavaStreamingAudioPlayer.java:332)
      at com.sun.speech.freetts.relp.LPCResult.playWaveSamples(LPCResult.java:503)
      at com.sun.speech.freetts.relp.LPCResult.playWave(LPCResult.java:402)
      at com.sun.speech.freetts.relp.AudioOutput.processUtterance(AudioOutput.java:56)
      at com.sun.speech.freetts.Voice.runProcessor(Voice.java:551)
      at com.sun.speech.freetts.Voice.outputUtterance(Voice.java:510)
      at com.sun.speech.freetts.Voice.access$100(Voice.java:80)
      at com.sun.speech.freetts.Voice$1.run(Voice.java:473)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dynamic grammar

Speech Recognition Toolkit

Forums

Help

Dynamic grammar document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Dynamic grammar