I am wondering if there is anything new with respect to word spotting and Sphinx. In following a few posts about word spotting last year there were comments made that Spinx 4 supported word spotting; however, the Sphinx FAQ states that sphinx does not support word spotting.
Nuance's VoCon 3000 embedded solution lets you define word spotting in the grammar files for specific words. Is Sphinx any closer to a simple implementation of word spotting yet? It is really critical for my application.
I need to feed the decoder an arbitrary chunck/s of audio and have it look for and find specific words as defined in the grammar file on a real-time basis. I also need to get results that show the location of the recognized words within the entire audio sample sent to the decoder.
Thanks in advance...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is also a discussion on using anti-word models found on page 454 of "Spoken Languaga Processing" (Huang, Acero, Hon).
I am uncertain if there are other proposed word spotting solutions and whether there is already some work that exists or in process for Sphinx4. I would hate to reinvent the wheel.
My goal here is to be able to word spot for multiple words defined in a grammar file and be able to find them anywhere within a supplied audio sample, that may or may not be just a clip of a complete audio utterance.
Example:
Where the audio is (complete utterance):
"American three twenty one heavy taxi via alpha bravo delta hold short of runway one niner left"
the grammar (although incomplete here) contains:
<...> <aircraft_callsign> <...> ||
<...> taxi <...> || <...> taxi via <...> || <...> hold short <...> || <...> runway <...> || <...> hold short <...> ||
<...> alpha <...> ....... <...> zulu <...> ||
<...> one <...> ....... <...> niner <...>
where <...> word <...> denotes a keyword search that is not dependant on the length of the audio that is fed to the decoder. The audio might be the entire transmission as stated above, or it might be truncated/trimmed and fed to the decoder in stages (in order to narrow the options) in a fashion shown here:
This example would be for aircraft call sign recognition only:
Audio transmission clipped to:
"American three twenty one heavy taxi vi"
and the grammar in this case specifically looks for
<...> <call_sign> <...>
anywhere in the audio sample.
Results need to supply the word/s found and the location/number samples within the audio sample where the words/s occur along with confidence/results score.
Obviously Air Traffic Control recognition is challanging enough in itself; however, I have already had great success with real-time ATC recognition using Nuance's Vocon 3000 embedded platform.
My interest here is whether I can modify and ultimately use Sphinx 4 to acheive my needs.
BTW, are you at CMU? I am an alumni of CMU (Master's Software Engineering Program). I also taught distance core courses for the CMU MSE program. I also know Alex R. Rich S. in the speech department. Also was a good friend of Jim Tomayko.
I am planning a visit to campus in the near future.
I can also be reached off line at 'winters@matrixhci.com'
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Results need to supply the word/s found and the location/number samples within the audio sample where the words/s occur along with confidence/results score.
Thanks, I've got the idea but I need some time to look at the range of the existing algorithms. To be honest I don't really trust in CI loop, it never really worked for me. The variant with n-best full text transcription, probably not very accurate and then with lattice search looks more appealing for me.
> BTW, are you at CMU? I am an alumni of CMU (Master's Software Engineering Program). I also taught distance core courses for the CMU MSE program. I also know Alex R. Rich S. in the speech department. Also was a good friend of Jim Tomayko.
>I am planning a visit to campus in the near future.
Unfortunately not. But please ping CMU stuff to make them code something as well :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am wondering if there is anything new with respect to word spotting and Sphinx. In following a few posts about word spotting last year there were comments made that Spinx 4 supported word spotting; however, the Sphinx FAQ states that sphinx does not support word spotting.
Nuance's VoCon 3000 embedded solution lets you define word spotting in the grammar files for specific words. Is Sphinx any closer to a simple implementation of word spotting yet? It is really critical for my application.
I need to feed the decoder an arbitrary chunck/s of audio and have it look for and find specific words as defined in the grammar file on a real-time basis. I also need to get results that show the location of the recognized words within the entire audio sample sent to the decoder.
Thanks in advance...
Hello
>however, the Sphinx FAQ states that sphinx does not support word spotting.
The FAQ is correct, word spotting is not supported
> Is Sphinx any closer to a simple implementation of word spotting yet?
If you are still interested, we could discuss the implementation details on that
Nickoloy,
Thanks for the reply. Yes I am interested in implementation details as I need word spotting for my application.
I have read "Rejecting Out-of-Grammar Utterances" http://www.speech.cs.cmu.edu/sphinx/twiki/bin/view/Sphinx4/RejectionHandling
which uses a CI Phone Loop implmentation with FlatLinguist
There is also a discussion on using anti-word models found on page 454 of "Spoken Languaga Processing" (Huang, Acero, Hon).
I am uncertain if there are other proposed word spotting solutions and whether there is already some work that exists or in process for Sphinx4. I would hate to reinvent the wheel.
My goal here is to be able to word spot for multiple words defined in a grammar file and be able to find them anywhere within a supplied audio sample, that may or may not be just a clip of a complete audio utterance.
Example:
Where the audio is (complete utterance):
"American three twenty one heavy taxi via alpha bravo delta hold short of runway one niner left"
the grammar (although incomplete here) contains:
<...> <aircraft_callsign> <...> ||
<...> taxi <...> || <...> taxi via <...> || <...> hold short <...> || <...> runway <...> || <...> hold short <...> ||
<...> alpha <...> ....... <...> zulu <...> ||
<...> one <...> ....... <...> niner <...>
where <...> word <...> denotes a keyword search that is not dependant on the length of the audio that is fed to the decoder. The audio might be the entire transmission as stated above, or it might be truncated/trimmed and fed to the decoder in stages (in order to narrow the options) in a fashion shown here:
This example would be for aircraft call sign recognition only:
Audio transmission clipped to:
"American three twenty one heavy taxi vi"
and the grammar in this case specifically looks for
<...> <call_sign> <...>
anywhere in the audio sample.
Results need to supply the word/s found and the location/number samples within the audio sample where the words/s occur along with confidence/results score.
Obviously Air Traffic Control recognition is challanging enough in itself; however, I have already had great success with real-time ATC recognition using Nuance's Vocon 3000 embedded platform.
My interest here is whether I can modify and ultimately use Sphinx 4 to acheive my needs.
BTW, are you at CMU? I am an alumni of CMU (Master's Software Engineering Program). I also taught distance core courses for the CMU MSE program. I also know Alex R. Rich S. in the speech department. Also was a good friend of Jim Tomayko.
I am planning a visit to campus in the near future.
I can also be reached off line at 'winters@matrixhci.com'
Thanks
> Results need to supply the word/s found and the location/number samples within the audio sample where the words/s occur along with confidence/results score.
Thanks, I've got the idea but I need some time to look at the range of the existing algorithms. To be honest I don't really trust in CI loop, it never really worked for me. The variant with n-best full text transcription, probably not very accurate and then with lattice search looks more appealing for me.
> BTW, are you at CMU? I am an alumni of CMU (Master's Software Engineering Program). I also taught distance core courses for the CMU MSE program. I also know Alex R. Rich S. in the speech department. Also was a good friend of Jim Tomayko.
>I am planning a visit to campus in the near future.
Unfortunately not. But please ping CMU stuff to make them code something as well :)