Menu

Building a speech recognition system for a library archive

Help
Victor He
2018-06-29
2018-06-29
  • Victor He

    Victor He - 2018-06-29

    Hi,

    We are looking to develop a speech recognition system for transcribing audio files from a library archive to generate subtitle files that includes keywords such as the people's names or events that are mentioned in the clip. The files include types such as news recordings, radio podcasts etc., and have a New Zealand accent. We are really new to the speech recognition business, and was hoping to get some tips on whether we are on the right track for improving the accuracy for the recognizer.

    We initially used the default US-English acoustic model, language model, and dictionary to start with, and we tried it on a sample audio clip of a 1 minute 43 seconds news recording. The recording had a variety of speakers and different audio qualities: there were two presenters that enunciated words more clearly and an interviewee that spoke less clearly. Using pocketsphinx_batch and the word_align.pl script, we got a WER of 75.63% for transcribing the whole audio clip in one audio file with one line as the correct transcription, and a WER of 67.30% when we split the file to 36 sentences and smaller clips and transcribed it one by one. Note that the subtitles we used for the correct transcription does not exactly match the words spoken by the interviewee (but the presenters all had correct subtitle transcriptions), as some of it was paraphrased, thus the result is only a rough estimate. We also tried it on segmented clips of just one of the presenters totalling to about 50 seconds of audio, which had the correct subtitle transcriptions, and the WER was 8.03%.

    We tried to test the extent of the mismatch of the acoustic model by constructing a language model just from the sentences in the sample news audio, and the WER decreased to 34.18% for transcribing the clip as a whole and 26.35% for segmenting the clips into the smaller sentences (which includes the partially mismatched subtitles for the interviewee). The segmented clips for the presenter improved to 8.03% WER.

    We also attempted to adapt the default acoustic model with the 400 or so New Zealand dialect recordings on voxforge. When transcribing after this adaptation, we saw improvements in some parts of the transcription, and it also getting worse in other areas. The WER for the adapted model resulted to 74.68%, which is probably not noticeable improvement from the default one.

    Would the adaptation shown here be the correct approach for trying to improve the performance? If so, should we just throw more New Zealand speech at it and how much data would you recommend? It is difficult to find NZ Speech data, but we do have a plethora at our disposable (albeit the subtitles are poor transcriptions and may not work conveniently for the adaptation process) - such as News Recordings, Radio Podcasts etc.

    Additionally, we think it's probably a good idea to extend/improve the dictionary and language models as well. How would you recommend going about such process? The language used for the archive wouldn't be constrained to any specific specialised fields, so would you recommend just throwing some New Zealand-related texts to extend the default model, or should we build one from scratch? Some of the errors are definitely related to the dictionary not including New Zealand names or special nouns.

    Finally, we do realise that the testing done here is quite brief and likely not indicative of the suitability of the default model for our use. Should we do some more tests with the default model first to get a more concrete notion on how well the default models perform against the archive files first, before trying to improve the performance? We are also happy to post the audio files used and their transcriptions if it helps.

    Thanks

     
    • Nickolay V. Shmyrev

      We are really new to the speech recognition business, and was hoping to get some tips on whether we are on the right track for improving the accuracy for the recognizer.

      Business does not work this way you know.

      Some tips on whether we are on the right track for improving the accuracy for the recognizer.

      I don't think you are on the track at all, it seems you are in the random search. You need to use Kaldi toolkit and you need to build new models both acoustic and language models.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.