I noticed that the filler dictionary in the English PTM from February doesn't contain
[BREATH] [COUGH] [NOISE] [SMACK] [UH] [UM]
"words", only [NOISE] and [SPEECH].
I did notice that some of the speech fillers falsely trigger recognition of words recognizer is listening for. I was wondering:
1. what is the significance of [NOISE] and [SPEECH]?
2. when doing adaptation, what should be used to transcribe fillers (e.g. breath, cough, lip smack, etc.)? I'm guessing it should be [NOISE] for all of those, and [SPEECH] for background speech, correct?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I noticed that the filler dictionary in the English PTM from February doesn't contain
[BREATH]
[COUGH]
[NOISE]
[SMACK]
[UH]
[UM]
"words", only [NOISE] and [SPEECH].
I did notice that some of the speech fillers falsely trigger recognition of words recognizer is listening for. I was wondering:
1. what is the significance of [NOISE] and [SPEECH]?
2. when doing adaptation, what should be used to transcribe fillers (e.g. breath, cough, lip smack, etc.)? I'm guessing it should be [NOISE] for all of those, and [SPEECH] for background speech, correct?
Right, this is done to reduce model complexity.
All non-speech noises like breath, cough, smack should go to [noise]. All speech noises like uh and um should go to [speech].