I am trying to build my own language model from some transcripts. I was wondering that do I need to include the fillers or remove them while building it. For example, is this right
" Um I have a point to make um that I am the pilot "
or
" I have a point to make that I am the pilot "
Any help is appreciated and look forward to your reply.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are various views on this problem. From one hand you reduce the language model quality leaving fillers, from the other hand, fillers often can be predicted by the langauge model too, so it's worth to keep them. You can read related publication on subject:
Another question on the same line, while building the language models would it matter if I have very long sentences (like 200 words in one line) or is it better to break it up into smaller sentences and then train the language model? Look forward to your reply and thanks for the earlier answer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is better to break them and also it is important that those breaks match the actual breaks people make when speaking. It is not a trivial task though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to build my own language model from some transcripts. I was wondering that do I need to include the fillers or remove them while building it. For example, is this right
" Um I have a point to make um that I am the pilot "
or
" I have a point to make that I am the pilot "
Any help is appreciated and look forward to your reply.
There are various views on this problem. From one hand you reduce the language model quality leaving fillers, from the other hand, fillers often can be predicted by the langauge model too, so it's worth to keep them. You can read related publication on subject:
http://www.academia.edu/15462889/Handling_Disfluencies_in_Spontaneous_Language_Models
https://www.sri.com/sites/default/files/publications/enriching_speech_recognition_with_automatic.pdf
I would say if your biggest lm source (contributing most to perplexity) has disfluences, keep them, otherwise remove them.
Another question on the same line, while building the language models would it matter if I have very long sentences (like 200 words in one line) or is it better to break it up into smaller sentences and then train the language model? Look forward to your reply and thanks for the earlier answer.
It is better to break them and also it is important that those breaks match the actual breaks people make when speaking. It is not a trivial task though.
Thanks Nickolay. will do that check the results. Thanks for the help :)