I have been playing with Sphinx2 and trying to get it to do wordspotting. Given any arbitrary utterance, I want it to detect
only words that are in the dictionary and throw away all other words that are not. My first attempt was to create a "noise"
dictionary that consists of all of the phones:
$GARBAGE AA
$GARBAGE(2) AE
$GARBAGE(3) AH
$GARBAGE(4) AO
...
$GARBAGE(43) ZH
Then I create the "official" dictionary containing the list of words I want to look for (I made a recordining from CNBC):
COFFEE K AA F IY
FINANCE F AX N AE N S
FINANCE(2) F AY N AE N S
FLOWERS F L AW AXR Z
MCCAIN M AX K EY N
REFORM R AX F AO R M
SENATE S EH N AX T
CAMPAIGN K AE M P EY N
Finally, I created a simple language-model. This is one part I did wrong, but a snippet from it looks like this:
I took my audio of 60 seconds and sliced it into 10 second pieces. Eventually, i'll use the methods in 'cont_ad_base.c' to
segment the audio when there is silence, but i'm just being lazy right now. So, I send it to the engine using
'uttproc_begin_utt()', 'uttproc_rawdata()', and 'uttproc_end_utt()' and then, at first, used 'uttproc_result()' to display
the results.
The results were very accurate, but unfortunately, it freezes up. I suspect the reason that it freezes is because the
$GARBAGE word makes the lattice near impossible to evaluate. I got around the freeze-up by making a new
'uttproc_result_wordspot()' in uttproc.c that doesn't call 'uttproc_windup()'. Instead, it calls 'search_finish_fwd()'
directly. One of the remaining method calls (again too lazy to figure out which) in 'uttproc_windup()' is the one that
causes it to freeze, so by bypassing this, it runs without freezing (yay!).
The other problem was that it's really slow. Slower than real-time, in fact. It probably takes 15 seconds to evaluate 10
seconds of data. I suspect, again, that this is because of my $GARBAGE word making it really hard to make a good guess. I
suspect there might be away to make a new phoneme that has pronunciatinos for everything, but i'm not sure how.
So, I thought i'd share my results and issue these questions:
- Although this works, is this just a silly way to be trying wordspotting?
- Is there a way to avoid the freezing that is different than the way I did it?
- Is there a way to make it faster?
- Sphinx3 was the engine that did the broadcast news transcription, isn't it? Is it available for download?
Thanks,
Bob
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2001-08-10
I have received a lot of e-mail lately about my post here and the feasibility of using Sphinx2 as a wordspotting engine. The technique that I described here, although it works, is incredibly slow and i'm not convinced I was doing it "right". I have also experimented a little bit with the functions used for recognizing phonemes instead of whole words to try to speed up this process. For example, the "go forward ten meters" is translated to:
SIL SILe DD G OW F W UH ER M DD T AE N UW DX EY D ER SH SILe
We can create a table of phones that are good replacements for other phones and apply an error-value to each replacement that has to be made. Then, we can calculate the total percent error for an utterance to see if we have matches. I wonder, though: is phoneme recognition still bounded by the language-model you are using, or can you now say "anything"?
** I would like to formally "open up" this topic for conversation if anyone is interested to share ideas how they have attempted to get Sphinx to perform this important type of operation instead of privately discussing it with a few individuals. In the end, nobody I know has it working very well at all...
Lenzo, I would be particularly interested in any feedback that you could give, since you seem to be the local authority on the inner-workings of Sphinx.
Thanks,
Bob.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When you say wordspotting i assume your refering to picking out a word or a group of words from the output, and this is not some speech lingo im not aware of it.. If this is the case then what i did for the game im making using sphinx was to create a speech parser that simple read in a name.dat and command.dat file. These are simple text files with the words i wanted it to recognize, then since sphinx outputs a single string, i simple tokenized the string and did comparision against the words i wanted to detect. If this sounds clunky and slow it actually works out well since i have the cont.c working in one thread and used a mutex to pass this single string to the parser in another. So while your gathering new speech input it is sorting the last sentence and so on. It was actually quite fast in my case but i havent tried it on more then a 100 words so far.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just a quick thought, since your looking for words that will only be found in your speech dictionary then simply read in the text file you use to make the Lm and dic files into your arrays, token the speech string output against these arrays and return it in whatever form you want, the detected words i assume, I have it coming back as numbers.. for when hornblower is detected then its -1, and jones is -2 ect , the order is some simple code like "set the main sail" 23, so i know that when the speech input is "blah blah blah hornblower set the main sail the the the blah blah" then the speech parser outputs -1,23 and that is passed to my 3d engine.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have been playing with Sphinx2 and trying to get it to do wordspotting. Given any arbitrary utterance, I want it to detect
only words that are in the dictionary and throw away all other words that are not. My first attempt was to create a "noise"
dictionary that consists of all of the phones:
$GARBAGE AA
$GARBAGE(2) AE
$GARBAGE(3) AH
$GARBAGE(4) AO
...
$GARBAGE(43) ZH
Then I create the "official" dictionary containing the list of words I want to look for (I made a recordining from CNBC):
COFFEE K AA F IY
FINANCE F AX N AE N S
FINANCE(2) F AY N AE N S
FLOWERS F L AW AXR Z
MCCAIN M AX K EY N
REFORM R AX F AO R M
SENATE S EH N AX T
CAMPAIGN K AE M P EY N
Finally, I created a simple language-model. This is one part I did wrong, but a snippet from it looks like this:
-0.9999 </s> 0.0000
-0.9999 <s> 0.0000
-9.9999 COFFEE 0.0000
-9.9999 FINANCE 0.0000
-9.9999 FLOWERS 0.0000
...
-9.9999 CAMPAIGN 0.0000
I took my audio of 60 seconds and sliced it into 10 second pieces. Eventually, i'll use the methods in 'cont_ad_base.c' to
segment the audio when there is silence, but i'm just being lazy right now. So, I send it to the engine using
'uttproc_begin_utt()', 'uttproc_rawdata()', and 'uttproc_end_utt()' and then, at first, used 'uttproc_result()' to display
the results.
The results were very accurate, but unfortunately, it freezes up. I suspect the reason that it freezes is because the
$GARBAGE word makes the lattice near impossible to evaluate. I got around the freeze-up by making a new
'uttproc_result_wordspot()' in uttproc.c that doesn't call 'uttproc_windup()'. Instead, it calls 'search_finish_fwd()'
directly. One of the remaining method calls (again too lazy to figure out which) in 'uttproc_windup()' is the one that
causes it to freeze, so by bypassing this, it runs without freezing (yay!).
The other problem was that it's really slow. Slower than real-time, in fact. It probably takes 15 seconds to evaluate 10
seconds of data. I suspect, again, that this is because of my $GARBAGE word making it really hard to make a good guess. I
suspect there might be away to make a new phoneme that has pronunciatinos for everything, but i'm not sure how.
So, I thought i'd share my results and issue these questions:
- Although this works, is this just a silly way to be trying wordspotting?
- Is there a way to avoid the freezing that is different than the way I did it?
- Is there a way to make it faster?
- Sphinx3 was the engine that did the broadcast news transcription, isn't it? Is it available for download?
Thanks,
Bob
I have received a lot of e-mail lately about my post here and the feasibility of using Sphinx2 as a wordspotting engine. The technique that I described here, although it works, is incredibly slow and i'm not convinced I was doing it "right". I have also experimented a little bit with the functions used for recognizing phonemes instead of whole words to try to speed up this process. For example, the "go forward ten meters" is translated to:
SIL SILe DD G OW F W UH ER M DD T AE N UW DX EY D ER SH SILe
We can create a table of phones that are good replacements for other phones and apply an error-value to each replacement that has to be made. Then, we can calculate the total percent error for an utterance to see if we have matches. I wonder, though: is phoneme recognition still bounded by the language-model you are using, or can you now say "anything"?
** I would like to formally "open up" this topic for conversation if anyone is interested to share ideas how they have attempted to get Sphinx to perform this important type of operation instead of privately discussing it with a few individuals. In the end, nobody I know has it working very well at all...
Lenzo, I would be particularly interested in any feedback that you could give, since you seem to be the local authority on the inner-workings of Sphinx.
Thanks,
Bob.
When you say wordspotting i assume your refering to picking out a word or a group of words from the output, and this is not some speech lingo im not aware of it.. If this is the case then what i did for the game im making using sphinx was to create a speech parser that simple read in a name.dat and command.dat file. These are simple text files with the words i wanted it to recognize, then since sphinx outputs a single string, i simple tokenized the string and did comparision against the words i wanted to detect. If this sounds clunky and slow it actually works out well since i have the cont.c working in one thread and used a mutex to pass this single string to the parser in another. So while your gathering new speech input it is sorting the last sentence and so on. It was actually quite fast in my case but i havent tried it on more then a 100 words so far.
Just a quick thought, since your looking for words that will only be found in your speech dictionary then simply read in the text file you use to make the Lm and dic files into your arrays, token the speech string output against these arrays and return it in whatever form you want, the detected words i assume, I have it coming back as numbers.. for when hornblower is detected then its -1, and jones is -2 ect , the order is some simple code like "set the main sail" 23, so i know that when the speech input is "blah blah blah hornblower set the main sail the the the blah blah" then the speech parser outputs -1,23 and that is passed to my 3d engine.