I created a corpus (for reference, I appended it to the
end of this message) and uploaded it to the Sphinx
Knowledge Base Tool
(http://www.speech.cs.cmu.edu/tools/lmtool.html). I then
used the resulting files to test recognition. I noticed
the following anomalies:
1. In "live" mode SHOOT_THE_OTHER_ONE, or
SHOOT_THE_LEFT_ONE have a tendency to be recognized as
SHOOT_THE_RIGHT_ONE or worse. I never saw
SHOOT_THE_LEFT_ONE (rightly or wrongly) recognized. One might possibly assume that something simple broken.
However, this is contradicted by the fact that all the
other phrases are recognized with a very high degree of
accuracy (including distinguishing between GO_LEFT and
GO_RIGHT).
2. I tried to investigate further in "batch" mode. I made
recordings of "shoot the other one", "shoot the left one"
and "shoot the right one" and asked for a list of the top
4 hypotheses. Here are the results:
"shoot the other one"
<s> SHOOT_THE_RIGHT_ONE
"shoot the left one"
<s> GOOD
<s>
<s> SHOOT_IT
<s> GOOD GOOD
"shoot the right one"
<s> SHOOT_THE_RIGHT_ONE
<s> GOOD
<s>
Aside from the fact that it does so poorly I am confused
by these results on a deeper level. For example, take the
results of "shoot the right one". It gets this right, but
why are SHOOT_THE_OTHER_ONE and SHOOT_THE_LEFT_ONE not in
the list of hypotheses? To understand my confusion here
is the relevant entries in the domain.dic file:
SHOOT_THE_LEFT_ONE SH UW T DH AH L EH F T HH W AH N
SHOOT_THE_OTHER_ONE SH UW T DH AH AH DH AXR HH W AH N
SHOOT_THE_RIGHT_ONE SH UW T DH AH R AY T HH W AH N
As you can see these 3 only differ in the choice of "L EH
F T" (left), "AH DH AXR" (other) and " R AY T" (right).
But how can the recognition engine recognize all of "SH UW
T DH AH L EH F T HH W AH N" without also thinking
"SH UW T DH AH AH DH AXR HH AH N" and "SH UW
T DH AH R AY T HH WAH N" are high probability
alternatives?
Any insight in to the results I am seeing would be much
appreciated.
Thanks.
John
For reference, here is the domain.corpus file I used:
HELP_ME
COME_HERE
GO_LEFT
GO_RIGHT
SHOOT_IT
SHOOT_THE_OTHER_ONE
SHOOT_THE_LEFT_ONE
SHOOT_THE_RIGHT_ONE
RUN_AWAY
GET_OUT_OF_THE_WAY
NO
GOOD
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you are using Sphinx 2.03, the problem is, it doesn't work. I found, after trying to figure out what **I** was doing wrong for a couple of months, that it wasn't me. Although Sphinx 2.02 blows up with a memory leak if you give it too many samples, it **will** work when fed feature files -- maybe even 16khz data. It will not work with 8khz data -- I haven't found any that do.
It seems that on this site, you get what you pay for? I'm interested in 8khz telephone data. Apparently, Sphinx was designed for just the 16khz microphone type data..??
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think the problem you were having with the feat files has been corrected -- if that was what you meant by 0.3 doesn't work. Try getting the latest version out of CVS; I just checked in a patch.
Sphinx will go 8K models also, and with some small changes, can handle arpbitrary sample rates that you have acoustic models for. There is a sample telephone bandwidth model in http://www.speech.cs.cmu.edu/sphinx/models/hmm/ .
kevin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The reason sphinx 2.03 doesn't work is that someone had introduced a bug into search.c in building the lextree. I've pointed out the bug to Kevin Lenzo. Hopefully the repository will be updated with the fix soon. BTW, sphinx does work with 8KHz telephone data as well. Maybe you're seeing the effects of this bug?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I created a corpus (for reference, I appended it to the
end of this message) and uploaded it to the Sphinx
Knowledge Base Tool
(http://www.speech.cs.cmu.edu/tools/lmtool.html). I then
used the resulting files to test recognition. I noticed
the following anomalies:
1. In "live" mode SHOOT_THE_OTHER_ONE, or
SHOOT_THE_LEFT_ONE have a tendency to be recognized as
SHOOT_THE_RIGHT_ONE or worse. I never saw
SHOOT_THE_LEFT_ONE (rightly or wrongly) recognized. One might possibly assume that something simple broken.
However, this is contradicted by the fact that all the
other phrases are recognized with a very high degree of
accuracy (including distinguishing between GO_LEFT and
GO_RIGHT).
2. I tried to investigate further in "batch" mode. I made
recordings of "shoot the other one", "shoot the left one"
and "shoot the right one" and asked for a list of the top
4 hypotheses. Here are the results:
"shoot the other one"
<s> SHOOT_THE_RIGHT_ONE
"shoot the left one"
<s> GOOD
<s>
<s> SHOOT_IT
<s> GOOD GOOD
"shoot the right one"
<s> SHOOT_THE_RIGHT_ONE
<s> GOOD
<s>
Aside from the fact that it does so poorly I am confused
by these results on a deeper level. For example, take the
results of "shoot the right one". It gets this right, but
why are SHOOT_THE_OTHER_ONE and SHOOT_THE_LEFT_ONE not in
the list of hypotheses? To understand my confusion here
is the relevant entries in the domain.dic file:
SHOOT_THE_LEFT_ONE SH UW T DH AH L EH F T HH W AH N
SHOOT_THE_OTHER_ONE SH UW T DH AH AH DH AXR HH W AH N
SHOOT_THE_RIGHT_ONE SH UW T DH AH R AY T HH W AH N
As you can see these 3 only differ in the choice of "L EH
F T" (left), "AH DH AXR" (other) and " R AY T" (right).
But how can the recognition engine recognize all of "SH UW
T DH AH L EH F T HH W AH N" without also thinking
"SH UW T DH AH AH DH AXR HH AH N" and "SH UW
T DH AH R AY T HH WAH N" are high probability
alternatives?
Any insight in to the results I am seeing would be much
appreciated.
Thanks.
John
For reference, here is the domain.corpus file I used:
HELP_ME
COME_HERE
GO_LEFT
GO_RIGHT
SHOOT_IT
SHOOT_THE_OTHER_ONE
SHOOT_THE_LEFT_ONE
SHOOT_THE_RIGHT_ONE
RUN_AWAY
GET_OUT_OF_THE_WAY
NO
GOOD
If you are using Sphinx 2.03, the problem is, it doesn't work. I found, after trying to figure out what **I** was doing wrong for a couple of months, that it wasn't me. Although Sphinx 2.02 blows up with a memory leak if you give it too many samples, it **will** work when fed feature files -- maybe even 16khz data. It will not work with 8khz data -- I haven't found any that do.
It seems that on this site, you get what you pay for? I'm interested in 8khz telephone data. Apparently, Sphinx was designed for just the 16khz microphone type data..??
I think the problem you were having with the feat files has been corrected -- if that was what you meant by 0.3 doesn't work. Try getting the latest version out of CVS; I just checked in a patch.
Sphinx will go 8K models also, and with some small changes, can handle arpbitrary sample rates that you have acoustic models for. There is a sample telephone bandwidth model in http://www.speech.cs.cmu.edu/sphinx/models/hmm/ .
kevin
The reason sphinx 2.03 doesn't work is that someone had introduced a bug into search.c in building the lextree. I've pointed out the bug to Kevin Lenzo. Hopefully the repository will be updated with the fix soon. BTW, sphinx does work with 8KHz telephone data as well. Maybe you're seeing the effects of this bug?
The patch has been applied to search.c. Thanks, rkm; that's going to help quite a bit.
Kevin