Greetings! I am attempting to write an application that will perform
transcription of various videos. Since the videos could come from anywhere, I
am using the HUB4 model that pocketsphinx seems to use by default, and an
unrestricted grammar. Just to get things started, I am using the
pocketsphinx_continuous app and the -infile option on the latest builds to get
an idea of what accuracy I can get. With fairly high quality audio, in PCM 16
khz mono, I am getting an accuracy less than 60%.
This leaves me with a few questions:
- What type of accuracy can I expect in this scenario?
- How can I improve the accuracy (assuming unrestricted grammar and untrained model)
- Is pocketsphinx or sphinx4 better suited for this use case
- Is there a model that may be better suited for my use case
Any help would be appreciated. Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Decoding is always restricted somehow, for example there is language model or
a grammar. To learn more about the way CMUSphinx decoders work please read the
tutorial
Default language model used by pocketsphinx is not quite good (hub4.5000.DMP),
it has just 5000 words. It makes sense to build your own language model from
existing video transcripts, for example from closed captions.
Default acoustic model of pocketsphinx is quite good, it makes sense to try it
instead of hub4.
What type of accuracy can I expect in this scenario?
Accuracy depends on many factors including for example the decoder used to
extract audio track. Overall, with default models 60% is the expected
accuracy. Further improvements require you to adapt the models and to
implement custom components.
How can I improve the accuracy (assuming unrestricted grammar and
untrained model)
I suggest you to setup a prototype then to improve componets one by one. Try
to build better language model, adapt an acoustic model, implement
imrpovements. For example video decoding often requires specialized music
filtering component.
Is pocketsphinx or sphinx4 better suited for this use case
Greetings! I am attempting to write an application that will perform
transcription of various videos. Since the videos could come from anywhere, I
am using the HUB4 model that pocketsphinx seems to use by default, and an
unrestricted grammar. Just to get things started, I am using the
pocketsphinx_continuous app and the -infile option on the latest builds to get
an idea of what accuracy I can get. With fairly high quality audio, in PCM 16
khz mono, I am getting an accuracy less than 60%.
This leaves me with a few questions:
- What type of accuracy can I expect in this scenario?
- How can I improve the accuracy (assuming unrestricted grammar and untrained model)
- Is pocketsphinx or sphinx4 better suited for this use case
- Is there a model that may be better suited for my use case
Any help would be appreciated. Thanks!
Hello
Decoding is always restricted somehow, for example there is language model or
a grammar. To learn more about the way CMUSphinx decoders work please read the
tutorial
http://cmusphinx.sourceforge.net/wiki/tutorial
Default language model used by pocketsphinx is not quite good (hub4.5000.DMP),
it has just 5000 words. It makes sense to build your own language model from
existing video transcripts, for example from closed captions.
Default acoustic model of pocketsphinx is quite good, it makes sense to try it
instead of hub4.
Accuracy depends on many factors including for example the decoder used to
extract audio track. Overall, with default models 60% is the expected
accuracy. Further improvements require you to adapt the models and to
implement custom components.
I suggest you to setup a prototype then to improve componets one by one. Try
to build better language model, adapt an acoustic model, implement
imrpovements. For example video decoding often requires specialized music
filtering component.
For server-based applications it's better to use sphinx4. For more details see
http://cmusphinx.sourceforge.net/wiki/versions