I've created a simple JSGF Grammar for recognizing numbers up to 9 digits long
using the JSGFDemo and the DialogDemo for guidance. I've tested the
application and apparently the recognition rate is different for different
types of numbers (smaller/larger numbers). My guess is that the probabilities
assigned to different types of utterances (within the grammar) are to blame.
That's because at this point I didn't explicitly specified probabilities for
the different alternatives in the grammar.
My question is: is there a way to print valid utterances plus their
probabilities when printing a set of utterances (just like in the DialogDemo)?
And a second question: is there a way to check if a sequence of words is valid
according to the grammar? And if it is valid, is there a way to check it's
probability within the grammar?
Thanks,
Horia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm reiterating on this subject... The above issues are still unsolved
questions to me, but I'll try to give you more details.
Here's the grammar I'm trying to use:
Now here's my new question: My spoken utterance is "five million three hundred
thousand". The BestResultNoFiller returned by Sphinx4 is "five million three
hundred thousand" (which is good), but the BestFinalResultNotFiller is "". I
can't understand this because I think that given the above grammar the text
"five million three hundred thousand" would be a valid and complete
utterance.
Any ideas? And if you have any ideas regarding my previous questions I would
be very happy :)
Thanks,
Horia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That is, when I comment out the optional part of <max6digitnumbers> I get
the BestFinalResultNoFiller "five million three hundred thousand". </max6digitnumbers>
Is this a bug? Or am I going crazy?
Horia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Now here's my new question: My spoken utterance is "five million three
hundred thousand". The BestResultNoFiller returned by Sphinx4 is "five million
three hundred thousand" (which is good), but the BestFinalResultNotFiller is
"". I can't understand this because I think that given the above grammar the
text "five million three hundred thousand" would be a valid and complete
utterance.
This might be a bug but it's senseless to talk about it without the example to
run. There might be difference in configuration which prevent the grammar
reaching the final state. For example speech grammar might require a final
silence after node which was not get found in your audio.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This might be a bug but it's senseless to talk about it without the example
to run. There might be difference in configuration which prevent the grammar
reaching the final state. For example speech grammar might require a final
silence after node which was not get found in your audio.
I agree, but how can I provide you with an example? Would the config file be
enough or do you need a full application (with an acoustic model, dictionary
and everything else)?
Thanks a lot,
Horia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The best example is usually a small application which reads data from a file
and returns result which you mean to be wrong. That will easy the debugging of
the problem and enable you to get the answer faster. You can pack everything
in archive and upload to a public file sharing resource like dropbox. Don't
forget to give a link after that.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've packed a simple application and uploaded it here: speed.pub.ro/tmp/
I've provided two acoustic models (AM#1 and AM#2) - can be changed from the
config file.
I'm posting below the output of the system when it uses AM#1 and then AM#2
trying to transcribe 4 wav files (also provided).
The grammar is modeling a number with maximum 9 digits followed (optionally)
by a dot and a 3 digits number.
The language is Romanian.
Here are my questions:
why in the case of file 1338298530832.wav AM#1 outputs a (correct) best result which is not marked as final although it can be final according to the grammar?
how can I debug the above issue?
is there any way to tell the recognizer that there won't be any other frames coming so it should try to reach the exit node of the grammar using the available frames?
do you have any tips for designing this type of grammars? mine has 1583 nodes (which seems huge for this simple task!)
how can i design the grammar to have a uniform probability density over all the numbers? is there a well known method for this or should I think of something smart?
why in the case of file 1338298530832.wav AM#1 outputs a (correct) best
result which is not marked as final although it can be final according to the
grammar?
In order to be marked as final token should reach grammar final state. In your
case grammar final state is not reached, only the last word state. The reason
that final state is not reached is that your word insertion probability is
very small compared to the beam (1e-36 and 1e-80) and the grammar final token
is pruned during expansion. The possible ways are to increase beam to 1e-200
or to increase word insertion prob to 0.2. The second way to increase word
insertion probability is way more reasonable.
how can I debug the above issue?
Just like any other application, go step by step and explore the structures
and their contents.
is there any way to tell the recognizer that there won't be any other
frames coming so it should try to reach the exit node of the grammar using the
available frames?
This is what is already happening when you retrieve bestFinalResult.
do you have any tips for designing this type of grammars? mine has 1583
nodes (which seems huge for this simple task!)
The number of nodes is not huge, should be ok to handle. There are ways to
optimize it. For example this structure is more optimal:
Skips are better than branches. Though it it's not a big deal.
You can dump a grammar to dot file (set property showGrammar to true) and
explore it with Graphvis.
how can i design the grammar to have a uniform probability density over
all the numbers? is there a well known method for this or should I think of
something smart?
The probability is already uniform, there is no way to adjust anything.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Regarding the problem I had (word insertion probability is very small compared
to the beam): this happened because I tried to configure the recognizer just
as in the DialogDemo, without trying to understand what's with all these
properties (insertionProbabilities, beams, etc., etc.).
I've also developed a large vocabulary ASR system for Romanian with CMU Sphinx
(and Sphinx 4), but this was also a blind process (I've just used the HUB
config with some minor modifications).
I really want to be able to go deeper into Sphinx4, understand all the
framework, components and properties and finally develop real-world
applications for Romanian. Maybe even contribute with other components. The
question is: where do I start?! It seems that the non-systematic, example-
by-example method of trying to understand and build systems got me to this
point. What do you suggest to do next?
Thanks,
Horia
PS: I've been reading and using for going deeper with the understanding the
following:
without trying to understand what's with all these properties
so I have a decent theoretical background
Those statements contradict each other. Theoretical background usually means
you know what the beam is. If you still not aware about the algorithms you
might want to review a speech recognition textbook.
I really want to be able to go deeper into Sphinx4, understand all the
framework, components and properties and finally develop real-world
applications for Romanian.
The practical approach is usually to start a development of the real-world
application and by means of the needs go deeper into the system.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Those statements contradict each other. Theoretical background usually means
you know what the beam is. If you still not aware about the algorithms you
might want to review a speech recognition textbook.
... what can I say?! :( maybe you're right.
i'll try to dig in further and be more specific when i ask questions. if you
have any suggestions about speech recognition textbooks that are closely
linked to sphinx4 please tell me.
horia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I've created a simple JSGF Grammar for recognizing numbers up to 9 digits long
using the JSGFDemo and the DialogDemo for guidance. I've tested the
application and apparently the recognition rate is different for different
types of numbers (smaller/larger numbers). My guess is that the probabilities
assigned to different types of utterances (within the grammar) are to blame.
That's because at this point I didn't explicitly specified probabilities for
the different alternatives in the grammar.
My question is: is there a way to print valid utterances plus their
probabilities when printing a set of utterances (just like in the DialogDemo)?
And a second question: is there a way to check if a sequence of words is valid
according to the grammar? And if it is valid, is there a way to check it's
probability within the grammar?
Thanks,
Horia
Hi all,
I'm reiterating on this subject... The above issues are still unsolved
questions to me, but I'll try to give you more details.
Here's the grammar I'm trying to use:
Now here's my new question: My spoken utterance is "five million three hundred
thousand". The BestResultNoFiller returned by Sphinx4 is "five million three
hundred thousand" (which is good), but the BestFinalResultNotFiller is "". I
can't understand this because I think that given the above grammar the text
"five million three hundred thousand" would be a valid and complete
utterance.
Any ideas? And if you have any ideas regarding my previous questions I would
be very happy :)
Thanks,
Horia
One more clue (which really puzzles me!): if I replace this line:
with this line:
That is, when I comment out the optional part of <max6digitnumbers> I get
the BestFinalResultNoFiller "five million three hundred thousand". </max6digitnumbers>
Is this a bug? Or am I going crazy?
Horia
If grammar syntax is complicated you can plot the FSG using graphviz https://
sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4593645
from that you can easily check what are the various valid utts
it's as good as printing all possible character sequences which match a
regular expression, but it will have problem if there is a * (loop) somewhere,
as there will be infinite possible answers (see
http://stackoverflow.com/questions/7614962/create-set-of-all-possible-
matches-for-a-given-regex)
The parse method is implemented in JSAPI to do that. See jsapi demos in
sphinx4 for details.
This might be a bug but it's senseless to talk about it without the example to
run. There might be difference in configuration which prevent the grammar
reaching the final state. For example speech grammar might require a final
silence after node which was not get found in your audio.
I agree, but how can I provide you with an example? Would the config file be
enough or do you need a full application (with an acoustic model, dictionary
and everything else)?
Thanks a lot,
Horia
The best example is usually a small application which reads data from a file
and returns result which you mean to be wrong. That will easy the debugging of
the problem and enable you to get the answer faster. You can pack everything
in archive and upload to a public file sharing resource like dropbox. Don't
forget to give a link after that.
Hi Nicolay,
I've packed a simple application and uploaded it here: speed.pub.ro/tmp/
I've provided two acoustic models (AM#1 and AM#2) - can be changed from the
config file.
I'm posting below the output of the system when it uses AM#1 and then AM#2
trying to transcribe 4 wav files (also provided).
The grammar is modeling a number with maximum 9 digits followed (optionally)
by a dot and a 3 digits number.
The language is Romanian.
Here are my questions:
how can I debug the above issue?
output for AM#1:
output for AM#2:
Thank you very much for your help!
Horia
In order to be marked as final token should reach grammar final state. In your
case grammar final state is not reached, only the last word state. The reason
that final state is not reached is that your word insertion probability is
very small compared to the beam (1e-36 and 1e-80) and the grammar final token
is pruned during expansion. The possible ways are to increase beam to 1e-200
or to increase word insertion prob to 0.2. The second way to increase word
insertion probability is way more reasonable.
Just like any other application, go step by step and explore the structures
and their contents.
This is what is already happening when you retrieve bestFinalResult.
The number of nodes is not huge, should be ok to handle. There are ways to
optimize it. For example this structure is more optimal:
<number> </number>
than this one:
<number> | thousand <number> | <number> thousand <number> hundred <number> </number></number></number></number></number>
Skips are better than branches. Though it it's not a big deal.
You can dump a grammar to dot file (set property showGrammar to true) and
explore it with Graphvis.
The probability is already uniform, there is no way to adjust anything.
Hello again!
First of all thank you very much for your help!
Regarding the problem I had (word insertion probability is very small compared
to the beam): this happened because I tried to configure the recognizer just
as in the DialogDemo, without trying to understand what's with all these
properties (insertionProbabilities, beams, etc., etc.).
I've also developed a large vocabulary ASR system for Romanian with CMU Sphinx
(and Sphinx 4), but this was also a blind process (I've just used the HUB
config with some minor modifications).
I really want to be able to go deeper into Sphinx4, understand all the
framework, components and properties and finally develop real-world
applications for Romanian. Maybe even contribute with other components. The
question is: where do I start?! It seems that the non-systematic, example-
by-example method of trying to understand and build systems got me to this
point. What do you suggest to do next?
Thanks,
Horia
PS: I've been reading and using for going deeper with the understanding the
following:
PS2: I've finished my phd in speech recognition last year (so I have a decent
theoretical background)
Those statements contradict each other. Theoretical background usually means
you know what the beam is. If you still not aware about the algorithms you
might want to review a speech recognition textbook.
The practical approach is usually to start a development of the real-world
application and by means of the needs go deeper into the system.
Those statements contradict each other. Theoretical background usually means
you know what the beam is. If you still not aware about the algorithms you
might want to review a speech recognition textbook.
... what can I say?! :( maybe you're right.
i'll try to dig in further and be more specific when i ask questions. if you
have any suggestions about speech recognition textbooks that are closely
linked to sphinx4 please tell me.
horia