Sphinx Post processing tool language model and accuracy

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Sphinx Post processing tool language model and accuracy

Forum: Help

Creator: Maathangi Sankar

Created: 2017-12-06

Updated: 2017-12-07

Maathangi Sankar - 2017-12-06

Hi!
I'm interested in adding appropriate punctuations to the transcribed ASR output for English. I used the post-processing framework as part of Sphinx using the Gutenberg lm model.
https://cmusphinx.github.io/2012/08/postprocessing-framework/
Wondering if there's an update to this language model or this branch that I can use for better results?
When I tried this for a passage from the Gutenberg text corpus, it appears that after some initial phrases, commas are getting added ib between every word. Any idea why this might be happening or pointers to what I can do to improve the accuracy here?

Any help regarding this would be super awesome!
Thank you very much!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2017-12-06
  
  https://github.com/ottokart/punctuator2 is much more accurate.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maathangi Sankar - 2017-12-07

Thank you very much!!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.