Hello,
I'm using Sphinx 4 in a live application with a vocabulary of about 35 commands in a grammar, from a dictionary with 37 words. The configuration is based on the HelloDigits demo, but with the WSJ acoustic model. For testing I'd like to be able to create some audio files of in-grammar speech, and then run the recognizer on them so I can tweak the configuration and see how it affects performance in a rigorous, repeatable way.
The best way I could think of to do this was to create a version of the configuration file that was identical except for a StreamDataSource in the pipeline instead of a Microphone. I looked at the wavfile demo's config file for some guidance in this. The initial result with several different short audio files was either a blank result or <unk>. I noticed that the wavfile demo config has no SpeechClassifier, SpeechMarker or NonSpeechDataFilter in its pipeline. My original live config did have these. I tried it both with and without those elements. I turned on "debug" on the SpeechClassifier to see what was being tagged as speech, and saw that almost every frame was speech, with a file that had about 10 seconds of silence followed by a single short utterance. So I've been playing with the "adjustment" parameter on the SpeechClassifier to get it to adjust more quickly to the ambient sound level in the file. That seems to be helping somewhat.
Sorry for the length of this, but 3 questions I guess:
-Is there a better way to run a live configuration with recorded data?
-Is there some reason not to use the SpeechClassifier and associated elements when decoding from files?
-How much silence should I include in the file before the speech starts, to let the levels adjust? The files in the wavfile demo didn't seem to have much.
Thanks,
-Jay
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm using Sphinx 4 in a live application with a vocabulary of about 35 commands in a grammar, from a dictionary with 37 words. The configuration is based on the HelloDigits demo, but with the WSJ acoustic model. For testing I'd like to be able to create some audio files of in-grammar speech, and then run the recognizer on them so I can tweak the configuration and see how it affects performance in a rigorous, repeatable way.
The best way I could think of to do this was to create a version of the configuration file that was identical except for a StreamDataSource in the pipeline instead of a Microphone. I looked at the wavfile demo's config file for some guidance in this. The initial result with several different short audio files was either a blank result or <unk>. I noticed that the wavfile demo config has no SpeechClassifier, SpeechMarker or NonSpeechDataFilter in its pipeline. My original live config did have these. I tried it both with and without those elements. I turned on "debug" on the SpeechClassifier to see what was being tagged as speech, and saw that almost every frame was speech, with a file that had about 10 seconds of silence followed by a single short utterance. So I've been playing with the "adjustment" parameter on the SpeechClassifier to get it to adjust more quickly to the ambient sound level in the file. That seems to be helping somewhat.
Sorry for the length of this, but 3 questions I guess:
-Is there a better way to run a live configuration with recorded data?
-Is there some reason not to use the SpeechClassifier and associated elements when decoding from files?
-How much silence should I include in the file before the speech starts, to let the levels adjust? The files in the wavfile demo didn't seem to have much.
Thanks,
-Jay