In order to get a feel of relative accuracies of various open source acoustic
models, we created a Nokia6303
cellphone like speech recognition application. Total vocabulary was 147 words
which contained 29 regular
english words and 118 Indian names (and related ) words. Dictionary was
created using lmtool and "command and
control" grammar was described in JSGF. The grammar permitted control
applications (pure english words/sentences)
and name dialling (Directly speaking the name). Voice samples were collected
from from 4 adults (2 males, 2 females
all with Indian accents) where each individual made 98 utterances including
pure english based "command and
control" and name dialling. Voice recording was done in Mono with a sample
rate of 16 Khz using good quality
headset mics (Logitech and Moserbaer) in a relatively quiet environment.
Pocketsphinx_batch was used. Following
are the observations.
Floating_point and Fixed_point accuracies are practically same.
Accuracies with various acoustic models are as follows:
a) hub4wsj_sc_8k 8 khz semicontinuous model that comes with pocketsphinx6
WIth this model almost everything was recognized wrongly and so it was
dropped.
Q01. As per the results, hub4wsj_sc_8k is outperforming other models. Is this
expected ?
Q02. Continuous models c), e) and f) are actually performing worse than
semicontinous
model a). Shouldn't it be ther otherway round ?
Q03. Model taken from voxforge g) is almost giving 100% error. Any idea why ?
Q04. It should be possible to further improve accuracy of this application by
manually
editing pronunciations of certain Indian names (TBD). Any idea of a
ballpark figure that a commercial application like this would target
(Interestingly
Nokia6303 gives 100% error without training and doesn't work so great even
after
user adaptation training).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Q01. As per the results, hub4wsj_sc_8k is outperforming other models. Is
this expected ?
Yes, it's the latest and the best model. It was trained on the biggest
database.
Q02. Continuous models c), e) and f) are actually performing worse than
semicontinous model a). Shouldn't it be ther otherway round
Actually the thing that matters here is a database size and updated training
procedure.
Q03. Model taken from voxforge g) is almost giving 100% error. Any idea why
?
No idea, probably you aren't using it properly. A sample decoding setup could
help here.
Q04. It should be possible to further improve accuracy of this application
by manually editing pronunciations of certain Indian names (TBD). Any idea of
a ballpark figure that a commercial application like this would target
(Interestingly Nokia6303 gives 100% error without training and doesn't work so
great even after user adaptation training).
95% accuracy is easily reachable in this case
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Manually editing pronunciations for Indian names and adding alternate
pronunciations for some of the english words (To
match them with Indian way of pronouncing), following WER numbers are
obtained...
a) hub4wsj_sc_8k 8 khz semicontinuous model that comes with pocketsphinx6.
WER (For english words) 10.8%
WER (Overall) 22.75%
PS: As the sample size is small and results are particularely bad for one of
the female speakers. Numbers might actually be better than these.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
In order to get a feel of relative accuracies of various open source acoustic
models, we created a Nokia6303
cellphone like speech recognition application. Total vocabulary was 147 words
which contained 29 regular
english words and 118 Indian names (and related ) words. Dictionary was
created using lmtool and "command and
control" grammar was described in JSGF. The grammar permitted control
applications (pure english words/sentences)
and name dialling (Directly speaking the name). Voice samples were collected
from from 4 adults (2 males, 2 females
all with Indian accents) where each individual made 98 utterances including
pure english based "command and
control" and name dialling. Voice recording was done in Mono with a sample
rate of 16 Khz using good quality
headset mics (Logitech and Moserbaer) in a relatively quiet environment.
Pocketsphinx_batch was used. Following
are the observations.
Floating_point and Fixed_point accuracies are practically same.
Accuracies with various acoustic models are as follows:
a) hub4wsj_sc_8k 8 khz semicontinuous model that comes with pocketsphinx6
WER (For english words) 16.2%
WER (Overall) 52.1%
b) 5000 senone 16 khz semicontinuous wsj model taken from http://www.speech.c
s.cmu.edu/sphinx/models/
WER (For english words) 24.3%
WER (Overall) 54.1%
c) 3000 senone 8 khz 32 Gaussian, continuous wsj model taken from http://www.
speech.cs.cmu.edu/sphinx/models/
WER (For english words) 29.7%
WER (Overall) 61.8%
d) wsj1 model that came with pocketsphinx5 (not sure about the samplng freq
and whether it is cont/semic).
WER (For english words) 27.0%
WER (Overall) 60.5%
e) hub4_cd_continuous_8gau_1s_c_d_dd 6000 senone, 8 khz, 8 Gaussian continuous
model taken from
http://www.speech.cs.cmu.edu/sphinx/models/
WER (For english words) 20.3%
WER (Overall) 53.2%
f) wsj_all_cd30.mllt_cd_cont_4000_16k 4000 senone, 16 khz, 4000 senone, 8
Gaussian continuous model taken from
http://www.speech.cs.cmu.edu/sphinx/models/
WER (For english words) 30.4%
WER (Overall) 53.2%
g) voxforge_en_sphinx.cd_cont_3000 8 khz, continuous model taken from
http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphin
x/
WIth this model almost everything was recognized wrongly and so it was
dropped.
Q01. As per the results, hub4wsj_sc_8k is outperforming other models. Is this
expected ?
Q02. Continuous models c), e) and f) are actually performing worse than
semicontinous
model a). Shouldn't it be ther otherway round ?
Q03. Model taken from voxforge g) is almost giving 100% error. Any idea why ?
Q04. It should be possible to further improve accuracy of this application by
manually
editing pronunciations of certain Indian names (TBD). Any idea of a
ballpark figure that a commercial application like this would target
(Interestingly
Nokia6303 gives 100% error without training and doesn't work so great even
after
user adaptation training).
Yes, it's the latest and the best model. It was trained on the biggest
database.
Actually the thing that matters here is a database size and updated training
procedure.
No idea, probably you aren't using it properly. A sample decoding setup could
help here.
95% accuracy is easily reachable in this case
Thanks nshmyrev.
Are even better semi-continuous models expected in any future pocketsphinx
releases ?
Regards.
Yes
Thanks Nsh.
Any timelines for those releases ?
We have the schedule on wiki
http://cmusphinx.sourceforge.net/wiki/releaseschedule
but I don't think anything is definite there.
If you want better models you probably want to contact me directly.
Thanx Nsh.
For now the available models will do. 'll contact you directly in future
should we need best of models or any other
specialized help.
Regards,
FYI...
Manually editing pronunciations for Indian names and adding alternate
pronunciations for some of the english words (To
match them with Indian way of pronouncing), following WER numbers are
obtained...
a) hub4wsj_sc_8k 8 khz semicontinuous model that comes with pocketsphinx6.
WER (For english words) 10.8%
WER (Overall) 22.75%
PS: As the sample size is small and results are particularely bad for one of
the female speakers. Numbers might actually be better than these.