Menu

Sphinx 2 with limited vocabulary and FSG

Help
2009-09-08
2012-09-22
  • boris gougeon

    boris gougeon - 2009-09-08

    Hi,

    I'm trying to implement a speech recognizer client using Sphinx II, my goal is to have the someone reading a written sentence in a text and have a confidence score for each word.
    So for that I'm using Finite State Grammars and limited dictionary.
    I got something working, but the confidence score are very low (< 0.30). I'm using the same settings as the simple Reco example of Sphinx 2.
    Since there is only one possibility for each word pronounced (each word is one state in the grammar), I should normally have high score and time efficient recognition, but the hypothesis take time to be computed. Also I get the hypothesis as soon as the word as been pronounced.
    So now I don't actually know where to go to improve this. Am i using the right solution for this kind of problem? Should I instead use Language Models, and if yes how to restrict the context to a given sentence in a story?

    Also do I need to use the sphinx trainer? So far I've generated the language models/dictionary using the web tool.

    Your help would be much appreciated.
    Thanks, Boris Gougeon

     
    • Nickolay V. Shmyrev

      Hm, first of all please use pocketsphinx instead of sphinx2. Sphinx2 is really deprecated.
      Once you'll have pocketsphinx results, it's possible to fix the bugs with confidence if there will be any.

       
    • boris gougeon

      boris gougeon - 2009-09-08

      Thanks for your quick answer Nickolay.

      I'm now trying to migrate to pocketsphinx. Unfortunately I need word segmentation with confidence scores for respective words inside an utterance, meaning on partial results. I read in the doc that it is not possible right now using the segmentation iterator.
      Do you know a way to do do that?

      Thanks,
      Boris G.

       
      • Nickolay V. Shmyrev

        Something like

        static void
        dump_result (int32 start)
        {
        ps_seg_t *iter = ps_seg_iter(ps, NULL);
        while (iter != NULL) {
        int32 sf, ef, pprob;
        float conf;

                    ps_seg_frames (iter, &amp;sf, &amp;ef);
                    pprob = ps_seg_prob (iter, NULL, NULL, NULL);
                    conf = logmath_exp(ps_get_logmath(ps), pprob);
                    printf (&quot;%s %f %f %f\n&quot;, ps_seg_word (iter), (sf + start) / 100.0, (ef + start) / 100.0, conf);
                    iter = ps_seg_next (iter);
            }
        

        }

        It should be rather straightforward. See the doxygen documentation for more details.

         
        • boris gougeon

          boris gougeon - 2009-09-08

          Thanks for your help. The thing is that here, all conf scores will be 1 if you dump the iterator inside an utterance in order to have partial results . Do you know if there is any way to get partial results with real scores?

          Here is what the doc says : even if -bestpath is enabled, it will also return zero when called on a partial result

          Thanks, B.

           
          • Nickolay V. Shmyrev

            > The thing is that here, all conf scores will be 1 if you dump the iterator inside an utterance in order to have partial results.

            The search is done this way, I don't think it's possible to calculate something on a partial result. Are you sure you need that? To get a confidence, you need to finish the uttrance first.

             
    • boris gougeon

      boris gougeon - 2009-09-09

      I guess so, since I need a feedback for each word. As soon as the person will say a word, i need to highlight the word said depending on the confidence score.
      An other solution would be to have an utterance per word, but I guess this is quite heavy to do that. Do you have any other idea how I could implement that? Sphinx2 offers partial results with conf scores, but they are very bad, and it's not time efficient at all...

       
      • Nickolay V. Shmyrev

        > I guess so, since I need a feedback for each word.

        You don't need this. You need to split incoming audio on chunks with pauses and decode each one. As soon as user makes a pause you need to present him a result. This is how live decoder works.

        > An other solution would be to have an utterance per word, but I guess this is quite heavy to do that.

        I'm not sure why you guess that.

        > Sphinx2 offers partial results with conf scores, but they are very bad, and it's not time efficient at all...

        I don't think the number you are talking about are confidence scores. They are probably something different. Also I don't think they are bad.

         
    • boris gougeon

      boris gougeon - 2009-09-09

      Ok thanks for your help, it's gonna be very helpful.

      So if I understand well, I just need to do something like the continuous recording example. But once the user pause (between words) I'll load the next word to recognize and feedback the result. This means that each word will be considered as an utterance (and then a 2 states grammar).
      What I was wondering about was the loading time for an FSG, if I have like 2000 words in my story, it means that I will have a set of 2000 FSG to preload, and that's why I supposed it was heavy.
      I also thought computing the result between 2 utterances would take some time, resulting in something not really fluent.

      Thanks again for your help, Speech Recognition is not obvious when you just start!
      B.

       
      • Nickolay V. Shmyrev

        > So if I understand well, I just need to do something like the continuous recording example. But once the user pause (between words) I'll load the next word to recognize and feedback the result. This means that each word will be considered as an utterance (and then a 2 states grammar). What I was wondering about was the loading time for an FSG, if I have like 2000 words in my story, it means that I will have a set of 2000 FSG to preload, and that's why I supposed it was heavy. I also thought computing the result between 2 utterances would take some time, resulting in something not really fluent.

        You understand correctly about the continuous recognition example. The rest is wrong. You don't need finite state grammar or a set of grammars. You need to build a trigram language model from the text and recognize the audio using it. After end of the each utterance you should process the result and present the scores.

         
    • boris gougeon

      boris gougeon - 2009-09-09

      Ok thanks again for your answer! So if I generate the trigram language model, using the Simple LM or the web based Language Model generator, I will be able to recognize a word by itself even if this one is in the middle of a sentence, right? But for that would I need to restrict the context to the current sentence?
      Is it possible to restrict the context? The lm generator generates a .sent file, is that the file I need to use?

       
      • Nickolay V. Shmyrev

        There is no need to restrict the context. Once you restrict the variants you get bad confidence score. The variability of the language model is important. If a text has 2000 thousands words the accuracy will be around 95%.

        If you want to make sure you selected the correct path, most likely you need to dump a lattice and implement a custom search there.

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.