Hi,
If I were to try to explicitly detect filler words would I have to train my own model? It is my assumption that when the cmu provided acoustic models were trained ,filler words were put in the noise dict.
Alternatively what would be a good approach to detect filled pause dis-fluences with sphinx? My initial approach was to add them to the dictionary in pocket sphinx(tested on android,but can use pc too) and do keyword spotting. I wasn't happy with the accuracy there due to both false positives and accuracy errors. I also did not do any adaptation yet since I didn't know if it would be redundant and if just saying a bunch of filler words for adaptation would be a good idea. Most of the research papers trained on multiple factors such as prosody, lexical position so am not sure if using Sphinx for it will be as effective.
Note:- By filled pauses I mean words such as umm ,uhh which vary a fair bit person to person.
Last edit: Paritosh Gupta 2016-05-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Filler words like "um" and "uh" do not even require specific model, they are composed from the same phones as usual words. Disfluences are also not only filler words, they include half-word restarts, repeats and so on. Longer than usual silence is also a disfluence.
Keyword spotting is not the right approach to disfluency detection, it requires keyword of 3-4 syllables. Instead you need to decode with large vocabulary recognizer and try to detect something from recognizer result.
Before jumping to code you can do a research overview, the papers like
Hi,
If I were to try to explicitly detect filler words would I have to train my own model? It is my assumption that when the cmu provided acoustic models were trained ,filler words were put in the noise dict.
Alternatively what would be a good approach to detect filled pause dis-fluences with sphinx? My initial approach was to add them to the dictionary in pocket sphinx(tested on android,but can use pc too) and do keyword spotting. I wasn't happy with the accuracy there due to both false positives and accuracy errors. I also did not do any adaptation yet since I didn't know if it would be redundant and if just saying a bunch of filler words for adaptation would be a good idea. Most of the research papers trained on multiple factors such as prosody, lexical position so am not sure if using Sphinx for it will be as effective.
Note:- By filled pauses I mean words such as umm ,uhh which vary a fair bit person to person.
Last edit: Paritosh Gupta 2016-05-09
Filler words like "um" and "uh" do not even require specific model, they are composed from the same phones as usual words. Disfluences are also not only filler words, they include half-word restarts, repeats and so on. Longer than usual silence is also a disfluence.
Keyword spotting is not the right approach to disfluency detection, it requires keyword of 3-4 syllables. Instead you need to decode with large vocabulary recognizer and try to detect something from recognizer result.
Before jumping to code you can do a research overview, the papers like
http://www.eecs.berkeley.edu/~gdurrett/papers/ferguson-durrett-klein-naacl2015.pdf
http://languagelog.ldc.upenn.edu/myl/naacl2013-vlsp-disfluency.pdf
should help you to get some ideas.