Tl;Dr I am working on a project that takes in the audio from a podcast and outputs a processed version with bad words 'bleeped.' I don't care about the accuracy of the rest of the transcript, just that the StreamSpeechRecognizer has high accuracy with this smallish set of words. I'd love some advice for optimizing Sphinx for this purpose.
I'm a new dad who loves to listen to podcasts. I'd love to listen to them aloud with my son as he gets older, but you can't always know if a guest will use words that a 3 year old ~probably~ shouldn't be hearing too often. I am not, however, a speech analysis expert. I am working on a Python script that downloads the most recent episode of a show I like, converts it to a .wav file, uses pocketsphinx to transcribe and force alignment, and then replaces the time stamps of curse words with a beep sound.
However, the transcription accuracy is pretty low and really bad for the curse words. Having read through the documentation on CMU Sphinx's website and some of the source code, I don't really know how to start. I was hoping someone with a little more knowledge of the fundamentals would help me understand which parameters I can tweak to help with this.
I was planning to use PocketSphinx so that I didn't have to touch the Java directly, but if it'll make the project work better, I'm happy to work with Sphinx 4.
I really appreciate any help you can provide!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Tl;Dr I am working on a project that takes in the audio from a podcast and outputs a processed version with bad words 'bleeped.' I don't care about the accuracy of the rest of the transcript, just that the StreamSpeechRecognizer has high accuracy with this smallish set of words. I'd love some advice for optimizing Sphinx for this purpose.
I'm a new dad who loves to listen to podcasts. I'd love to listen to them aloud with my son as he gets older, but you can't always know if a guest will use words that a 3 year old ~probably~ shouldn't be hearing too often. I am not, however, a speech analysis expert. I am working on a Python script that downloads the most recent episode of a show I like, converts it to a .wav file, uses pocketsphinx to transcribe and force alignment, and then replaces the time stamps of curse words with a beep sound.
However, the transcription accuracy is pretty low and really bad for the curse words. Having read through the documentation on CMU Sphinx's website and some of the source code, I don't really know how to start. I was hoping someone with a little more knowledge of the fundamentals would help me understand which parameters I can tweak to help with this.
I was planning to use PocketSphinx so that I didn't have to touch the Java directly, but if it'll make the project work better, I'm happy to work with Sphinx 4.
I really appreciate any help you can provide!