I have some recorded voice in wav format
(over telephone line, english, many speaker, vocabulary: "IT, computer", but might be unlimited).
I just want to extract some words from the wav files so I can do some kind of approximate keywords search (text search) over the audio files. Is it possible with Sphinx? The accuracy is not so important, 40-50% of correctly recognized words would be OK.
If yes, could anyone provide me some simple example config file or code?
Thanks in advance for any suggestion.
Phuong Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I had implemented a "word spotter" in Sphinx, but it proved feasible only for a small acoustic model. I am told that research in the late 90s showed that doing full-blown recognition to convert speech to text and then doing normal text searching on the result proved more effective than word spotting.
I won't have time to give you some real good examples, but let me tell you what you'll want:
(1) you will probably need to use a 8 kHz acoustic model (since you mentioned it is telephone data). DO NOT UPSAMPLE WAVS to 16 kHz and use the 16 kHz models. There are several posts about which parameters to change to use the 8 kHz WSJ model.
(2) You will need a good n-gram model. The already built WSJ and HUB4 are probably inadequate, though it certainly wouldn't hurt to start with those to see what kind of accuracy you get. Training your own n-gram model means you need to have lots (preferably at least 1 million words) of text in your domain (IT, computer). The SRI-LM Toolkit or the CMU-Cambridge Lanugage Modelling toolkits are adequte. I believe there are a few posts on the matter.
(3) You may need to manually add some new words to the dictionary along with their corresponding pronunciation; I believe there were some recent posts about that as well.
(4) You will have to set up the speech recognizer. I do not currently have time to give an example config file. However, the Transcriber and WavFile demos are an excellent start for the code you would need to run. You would also need to use a frontend similar to the one found in those config files (making changes of course for the sample rate). The hellongram has an excellent example of what is involved in running a n-gram recognizer. You would essentially need to change the acoustic model, the actual language model and possibly the dictionary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> I am told that research in the late 90s showed
> that doing full-blown recognition to convert
> speech to text and then doing normal text
> searching on the result proved more effective than
> word spotting.
Who told you that? I am eager to get pointed to some papers about this result. Any ideas?
Best regards,
Holger
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Both keyword-spotting and speech recognition are principled studies. As for which one is better in certain situations, there is actually no definitive answer.
Consider we are talking about speech for a dialogue system, the dialogue may only have keywords that you want to pick up. So keyword spotting could be a contender.
If we are talking about dictation type of task and keyword/utt desnity is high. Intuitively, SR with trigram could be better too.
Another common approach is to use SR and post-process with a robust parser. It is very similar to keyword spotter but it could allow grammar be specified.
When you use the three in real-life, their effect could be very similar. So, we probably want to be open on this issue and not to get a quick conclusion.
-a
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
I have some recorded voice in wav format
(over telephone line, english, many speaker, vocabulary: "IT, computer", but might be unlimited).
I just want to extract some words from the wav files so I can do some kind of approximate keywords search (text search) over the audio files. Is it possible with Sphinx? The accuracy is not so important, 40-50% of correctly recognized words would be OK.
If yes, could anyone provide me some simple example config file or code?
Thanks in advance for any suggestion.
Phuong Nguyen
I had implemented a "word spotter" in Sphinx, but it proved feasible only for a small acoustic model. I am told that research in the late 90s showed that doing full-blown recognition to convert speech to text and then doing normal text searching on the result proved more effective than word spotting.
I won't have time to give you some real good examples, but let me tell you what you'll want:
(1) you will probably need to use a 8 kHz acoustic model (since you mentioned it is telephone data). DO NOT UPSAMPLE WAVS to 16 kHz and use the 16 kHz models. There are several posts about which parameters to change to use the 8 kHz WSJ model.
(2) You will need a good n-gram model. The already built WSJ and HUB4 are probably inadequate, though it certainly wouldn't hurt to start with those to see what kind of accuracy you get. Training your own n-gram model means you need to have lots (preferably at least 1 million words) of text in your domain (IT, computer). The SRI-LM Toolkit or the CMU-Cambridge Lanugage Modelling toolkits are adequte. I believe there are a few posts on the matter.
(3) You may need to manually add some new words to the dictionary along with their corresponding pronunciation; I believe there were some recent posts about that as well.
(4) You will have to set up the speech recognizer. I do not currently have time to give an example config file. However, the Transcriber and WavFile demos are an excellent start for the code you would need to run. You would also need to use a frontend similar to the one found in those config files (making changes of course for the sample rate). The hellongram has an excellent example of what is involved in running a n-gram recognizer. You would essentially need to change the acoustic model, the actual language model and possibly the dictionary.
> I am told that research in the late 90s showed
> that doing full-blown recognition to convert
> speech to text and then doing normal text
> searching on the result proved more effective than
> word spotting.
Who told you that? I am eager to get pointed to some papers about this result. Any ideas?
Best regards,
Holger
Hi Holger and Robert,
Both keyword-spotting and speech recognition are principled studies. As for which one is better in certain situations, there is actually no definitive answer.
Consider we are talking about speech for a dialogue system, the dialogue may only have keywords that you want to pick up. So keyword spotting could be a contender.
If we are talking about dictation type of task and keyword/utt desnity is high. Intuitively, SR with trigram could be better too.
Another common approach is to use SR and post-process with a robust parser. It is very similar to keyword spotter but it could allow grammar be specified.
When you use the three in real-life, their effect could be very similar. So, we probably want to be open on this issue and not to get a quick conclusion.
-a
Thanks for the useful information.
I will try!
Best regards,
Nguyen