Menu

Utterance probabilities in JSGF Grammar

Horia Cucu
2012-05-12
2012-09-21
  • Horia Cucu

    Horia Cucu - 2012-05-12

    Hi,

    I've created a simple JSGF Grammar for recognizing numbers up to 9 digits long
    using the JSGFDemo and the DialogDemo for guidance. I've tested the
    application and apparently the recognition rate is different for different
    types of numbers (smaller/larger numbers). My guess is that the probabilities
    assigned to different types of utterances (within the grammar) are to blame.
    That's because at this point I didn't explicitly specified probabilities for
    the different alternatives in the grammar.

    My question is: is there a way to print valid utterances plus their
    probabilities when printing a set of utterances (just like in the DialogDemo)?
    And a second question: is there a way to check if a sequence of words is valid
    according to the grammar? And if it is valid, is there a way to check it's
    probability within the grammar?

    Thanks,
    Horia

     
  • Horia Cucu

    Horia Cucu - 2012-05-29

    Hi all,

    I'm reiterating on this subject... The above issues are still unsolved
    questions to me, but I'll try to give you more details.
    Here's the grammar I'm trying to use:

    #JSGF V1.0;
    grammar numbers;
    <unitsWO0> = one | two | three | four | five | six | seven | eigth | nine;
    <units> = zero | <unitsWO0>;
    <elevens> = eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen;
    <simpleTens> = twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety;
    <max2digitNumbers> = <units> | <elevens> | <simpleTens> [<unitsWO0>];
    <3digitNumbers> = <unitsWO0> hundred [<max2digitNumbers>];
    <max3digitNumbers> = <max2digitNumbers> | <3digitNumbers>;
    <max6digitNumbers> = <max3digitNumbers> | (<max3digitNumbers> thousand [<max3digitNumbers>]);
    <max9digitNumbers> = <max6digitNumbers> | (<max3digitNumbers> million [<max6digitNumbers>]);
    public <rationalNumbers> = <max9digitNumbers> [point <max3digitNumbers>];
    

    Now here's my new question: My spoken utterance is "five million three hundred
    thousand". The BestResultNoFiller returned by Sphinx4 is "five million three
    hundred thousand" (which is good), but the BestFinalResultNotFiller is "". I
    can't understand this because I think that given the above grammar the text
    "five million three hundred thousand" would be a valid and complete
    utterance.

    Any ideas? And if you have any ideas regarding my previous questions I would
    be very happy :)
    Thanks,
    Horia

     
  • Horia Cucu

    Horia Cucu - 2012-05-29

    One more clue (which really puzzles me!): if I replace this line:

    <max6digitNumbers> = <max3digitNumbers> | (<max3digitNumbers> thousand [<max3digitNumbers>]);
    

    with this line:

    <max6digitNumbers> = <max3digitNumbers> | (<max3digitNumbers> thousand); /* [<max3digitNumbers>]);*/
    

    That is, when I comment out the optional part of <max6digitnumbers> I get
    the BestFinalResultNoFiller "five million three hundred thousand". </max6digitnumbers>

    Is this a bug? Or am I going crazy?
    Horia

     
  • Nickolay V. Shmyrev

    "is there a way to check if a sequence of words is valid according to the
    grammar? "

    The parse method is implemented in JSAPI to do that. See jsapi demos in
    sphinx4 for details.

     
  • Nickolay V. Shmyrev

    Now here's my new question: My spoken utterance is "five million three
    hundred thousand". The BestResultNoFiller returned by Sphinx4 is "five million
    three hundred thousand" (which is good), but the BestFinalResultNotFiller is
    "". I can't understand this because I think that given the above grammar the
    text "five million three hundred thousand" would be a valid and complete
    utterance.

    This might be a bug but it's senseless to talk about it without the example to
    run. There might be difference in configuration which prevent the grammar
    reaching the final state. For example speech grammar might require a final
    silence after node which was not get found in your audio.

     
  • Horia Cucu

    Horia Cucu - 2012-06-01

    This might be a bug but it's senseless to talk about it without the example
    to run. There might be difference in configuration which prevent the grammar
    reaching the final state. For example speech grammar might require a final
    silence after node which was not get found in your audio.

    I agree, but how can I provide you with an example? Would the config file be
    enough or do you need a full application (with an acoustic model, dictionary
    and everything else)?

    Thanks a lot,
    Horia

     
  • Nickolay V. Shmyrev

    The best example is usually a small application which reads data from a file
    and returns result which you mean to be wrong. That will easy the debugging of
    the problem and enable you to get the answer faster. You can pack everything
    in archive and upload to a public file sharing resource like dropbox. Don't
    forget to give a link after that.

     
  • Horia Cucu

    Horia Cucu - 2012-06-02

    Hi Nicolay,

    I've packed a simple application and uploaded it here: speed.pub.ro/tmp/
    I've provided two acoustic models (AM#1 and AM#2) - can be changed from the
    config file.
    I'm posting below the output of the system when it uses AM#1 and then AM#2
    trying to transcribe 4 wav files (also provided).
    The grammar is modeling a number with maximum 9 digits followed (optionally)
    by a dot and a 3 digits number.
    The language is Romanian.

    Here are my questions:

    1. why in the case of file 1338298530832.wav AM#1 outputs a (correct) best result which is not marked as final although it can be final according to the grammar?
      how can I debug the above issue?
    2. is there any way to tell the recognizer that there won't be any other frames coming so it should try to reach the exit node of the grammar using the available frames?
    3. do you have any tips for designing this type of grammars? mine has 1583 nodes (which seems huge for this simple task!)
    4. how can i design the grammar to have a uniform probability density over all the numbers? is there a well known method for this or should I think of something smart?

    output for AM#1:

    1338451258896.wav:
        reference=douăzeci (twenty)
        bestFinalResultNoFiller=
        bestResultNoFiller=nouăzeci şi
    1338298530832.wav:
        reference=cinci milioane trei sute de mii (five million three hundred thousand)
        bestFinalResultNoFiller=
        bestResultNoFiller=cinci milioane trei sute de mii
    1338451531011.wav:
        reference=şaizeci şi doi (sixty two)
        bestFinalResultNoFiller=cincisprezece
        bestResultNoFiller=cincisprezece
    1338451418840.wav:
        reference=treizeci şi cinci (thirty five)
        bestFinalResultNoFiller=cincizeci şi cinci mii
        bestResultNoFiller=cincizeci şi cinci mii
    

    output for AM#2:

    1338451258896.wav:
        reference=douăzeci (twenty)
        bestFinalResultNoFiller=douăzeci
        bestResultNoFiller=douăzeci
    1338298530832.wav:
        reference=cinci milioane trei sute de mii (five million three hundred thousand)
        bestFinalResultNoFiller=cinci milioane trei sute de mii opt
        bestResultNoFiller=cinci milioane trei sute de mii opt
    1338451531011.wav:
        reference=şaizeci şi doi (sixty two)
        bestFinalResultNoFiller=şaizeci şi doi
        bestResultNoFiller=şaizeci şi doi
    1338451418840.wav:
        reference=treizeci şi cinci (thirty five)
        bestFinalResultNoFiller=treizeci şi cinci
        bestResultNoFiller=treizeci şi cinci
    

    Thank you very much for your help!
    Horia

     
  • Nickolay V. Shmyrev

    1. why in the case of file 1338298530832.wav AM#1 outputs a (correct) best
      result which is not marked as final although it can be final according to the
      grammar?

    In order to be marked as final token should reach grammar final state. In your
    case grammar final state is not reached, only the last word state. The reason
    that final state is not reached is that your word insertion probability is
    very small compared to the beam (1e-36 and 1e-80) and the grammar final token
    is pruned during expansion. The possible ways are to increase beam to 1e-200
    or to increase word insertion prob to 0.2. The second way to increase word
    insertion probability is way more reasonable.

    how can I debug the above issue?

    Just like any other application, go step by step and explore the structures
    and their contents.

    1. is there any way to tell the recognizer that there won't be any other
      frames coming so it should try to reach the exit node of the grammar using the
      available frames?

    This is what is already happening when you retrieve bestFinalResult.

    1. do you have any tips for designing this type of grammars? mine has 1583
      nodes (which seems huge for this simple task!)

    The number of nodes is not huge, should be ok to handle. There are ways to
    optimize it. For example this structure is more optimal:

    <number> </number>

    than this one:

    <number> | thousand <number> | <number> thousand <number> hundred <number> </number></number></number></number></number>

    Skips are better than branches. Though it it's not a big deal.

    You can dump a grammar to dot file (set property showGrammar to true) and
    explore it with Graphvis.

    1. how can i design the grammar to have a uniform probability density over
      all the numbers? is there a well known method for this or should I think of
      something smart?

    The probability is already uniform, there is no way to adjust anything.

     
  • Horia Cucu

    Horia Cucu - 2012-06-04

    Hello again!

    First of all thank you very much for your help!

    Regarding the problem I had (word insertion probability is very small compared
    to the beam): this happened because I tried to configure the recognizer just
    as in the DialogDemo, without trying to understand what's with all these
    properties (insertionProbabilities, beams, etc., etc.).

    I've also developed a large vocabulary ASR system for Romanian with CMU Sphinx
    (and Sphinx 4), but this was also a blind process (I've just used the HUB
    config with some minor modifications).

    I really want to be able to go deeper into Sphinx4, understand all the
    framework, components and properties and finally develop real-world
    applications for Romanian. Maybe even contribute with other components. The
    question is: where do I start?! It seems that the non-systematic, example-
    by-example method
    of trying to understand and build systems got me to this
    point. What do you suggest to do next?

    Thanks,
    Horia

    PS: I've been reading and using for going deeper with the understanding the
    following:

    1. http://cmusphinx.sourceforge.net/sphinx4/
    2. javadocs
    3. source code and config files for various demos
    4. http://cmusphinx.sourceforge.net/sphinx4/doc/Sphinx4Whitepaper.pdf

    PS2: I've finished my phd in speech recognition last year (so I have a decent
    theoretical background)

     
  • Nickolay V. Shmyrev

    without trying to understand what's with all these properties

    so I have a decent theoretical background

    Those statements contradict each other. Theoretical background usually means
    you know what the beam is. If you still not aware about the algorithms you
    might want to review a speech recognition textbook.

    I really want to be able to go deeper into Sphinx4, understand all the
    framework, components and properties and finally develop real-world
    applications for Romanian.

    The practical approach is usually to start a development of the real-world
    application and by means of the needs go deeper into the system.

     
  • Horia Cucu

    Horia Cucu - 2012-06-06

    Those statements contradict each other. Theoretical background usually means
    you know what the beam is. If you still not aware about the algorithms you
    might want to review a speech recognition textbook.

    ... what can I say?! :( maybe you're right.
    i'll try to dig in further and be more specific when i ask questions. if you
    have any suggestions about speech recognition textbooks that are closely
    linked to sphinx4 please tell me.

    horia

     

Log in to post a comment.

Auth0 Logo