We are a non-profit organisation and we had chosen sphinx4 to automate the
transcription process.
It had took us 8 months to finally deocode in sphinx . This Acoustic Model
has been trained on one person's voice for around 40 hours of audio data. We
also
have developed a Trigram LM and a pronunciation dictionary.
The accuracy on the test-data on decoding is 82.2%.
We had initially started this work with the hope that we could completely
automate
the manual transcription work that we do. And we also assumed that decoding a
single persons voice would give us more accurate results using (speaker-
independer) Sphinx.
The Manual Transcriber feels its a very tedious job of correcting the 18%
errors than
transcribing afresh.
We have used Force-align, 32 gaussian states while training, also we have
experimented
using Knesser-Ney coefficients in Language Model. We have also experimented
with
language weight and arrived at the right number.
We want to reach at a point where the correction of errors is very minimal. If
the accuracy
is around 95% it would be much helpful. And we are now running out of ideas on
what is the best
thing to do next.
We want to know what kind of customizations that we can make to achieve the
desired
accuracy. We can check-in/share our findings with the group. I just love what
this
group of people have done and it would be a great thing if our organisation
would
be using Sphinx in their everyday work :)
Thanks,
Dharani
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Dharani, obviosly it's possible to improve accuracy, but to do that we
need to have access to the data you have, otherwise it's basically senseless
to guess. Some of your decisions like 32-gau for single person voice looks
strange, but without looking on actual data and code it's hard to help you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi All,
We are a non-profit organisation and we had chosen sphinx4 to automate the
transcription process.
It had took us 8 months to finally deocode in sphinx . This Acoustic Model
has been trained on one person's voice for around 40 hours of audio data. We
also
have developed a Trigram LM and a pronunciation dictionary.
The accuracy on the test-data on decoding is 82.2%.
We had initially started this work with the hope that we could completely
automate
the manual transcription work that we do. And we also assumed that decoding a
single persons voice would give us more accurate results using (speaker-
independer) Sphinx.
The Manual Transcriber feels its a very tedious job of correcting the 18%
errors than
transcribing afresh.
We have used Force-align, 32 gaussian states while training, also we have
experimented
using Knesser-Ney coefficients in Language Model. We have also experimented
with
language weight and arrived at the right number.
We want to reach at a point where the correction of errors is very minimal. If
the accuracy
is around 95% it would be much helpful. And we are now running out of ideas on
what is the best
thing to do next.
We want to know what kind of customizations that we can make to achieve the
desired
accuracy. We can check-in/share our findings with the group. I just love what
this
group of people have done and it would be a great thing if our organisation
would
be using Sphinx in their everyday work :)
Thanks,
Dharani
Dear Dharani, obviosly it's possible to improve accuracy, but to do that we
need to have access to the data you have, otherwise it's basically senseless
to guess. Some of your decisions like 32-gau for single person voice looks
strange, but without looking on actual data and code it's hard to help you.
And, you could save 6 of your 8 month with sharing this data when project was
started. I hope that will be lesson for you in the future :)