I trained a mono-phone(CI stage) ASR system on the syllable level (the data amount is about 0.15 hour so it is too small to train tri-phone system. The basic unit is syllable rather than phoneme). The performance is OK, about 6.7% WER. But the output deteced boundary seems to have a wired problem that it has a leading shift.
Please look at the spectrogram of three samples and their corresponding detection. The example is shown in the picture in the next post. You can see that the first syllable "PA" always leads to the audio. Starting from the second syllable "TA", the boundary seems nice fitting the audio track. The deteced boundary seems to be shifted one segment leading to the audio track.
Could anyone tell me why I have this phenomenon? Is that because I only have the CI system rather than CD system?
Hi, dear all,
I trained a mono-phone(CI stage) ASR system on the syllable level (the data amount is about 0.15 hour so it is too small to train tri-phone system. The basic unit is syllable rather than phoneme). The performance is OK, about 6.7% WER. But the output deteced boundary seems to have a wired problem that it has a leading shift.
Please look at the spectrogram of three samples and their corresponding detection. The example is shown in the picture in the next post. You can see that the first syllable "PA" always leads to the audio. Starting from the second syllable "TA", the boundary seems nice fitting the audio track. The deteced boundary seems to be shifted one segment leading to the audio track.
Could anyone tell me why I have this phenomenon? Is that because I only have the CI system rather than CD system?
Thank you.
Last edit: tfpeach 2016-02-18