Menu

Relative accuracies of acoustic models.

Help
creative64
2010-06-05
2012-09-22
  • creative64

    creative64 - 2010-06-05

    Hi,

    In order to get a feel of relative accuracies of various open source acoustic
    models, we created a Nokia6303
    cellphone like speech recognition application. Total vocabulary was 147 words
    which contained 29 regular
    english words and 118 Indian names (and related ) words. Dictionary was
    created using lmtool and "command and
    control" grammar was described in JSGF. The grammar permitted control
    applications (pure english words/sentences)
    and name dialling (Directly speaking the name). Voice samples were collected
    from from 4 adults (2 males, 2 females
    all with Indian accents) where each individual made 98 utterances including
    pure english based "command and
    control" and name dialling. Voice recording was done in Mono with a sample
    rate of 16 Khz using good quality
    headset mics (Logitech and Moserbaer) in a relatively quiet environment.
    Pocketsphinx_batch was used. Following
    are the observations.

    1. Floating_point and Fixed_point accuracies are practically same.

    2. Accuracies with various acoustic models are as follows:

    a) hub4wsj_sc_8k 8 khz semicontinuous model that comes with pocketsphinx6

    WER (For english words) 16.2%
    WER (Overall) 52.1%

    b) 5000 senone 16 khz semicontinuous wsj model taken from http://www.speech.c
    s.cmu.edu/sphinx/models/

    WER (For english words) 24.3%
    WER (Overall) 54.1%

    c) 3000 senone 8 khz 32 Gaussian, continuous wsj model taken from http://www.
    speech.cs.cmu.edu/sphinx/models/

    WER (For english words) 29.7%
    WER (Overall) 61.8%

    d) wsj1 model that came with pocketsphinx5 (not sure about the samplng freq
    and whether it is cont/semic).

    WER (For english words) 27.0%
    WER (Overall) 60.5%

    e) hub4_cd_continuous_8gau_1s_c_d_dd 6000 senone, 8 khz, 8 Gaussian continuous
    model taken from
    http://www.speech.cs.cmu.edu/sphinx/models/

    WER (For english words) 20.3%
    WER (Overall) 53.2%

    f) wsj_all_cd30.mllt_cd_cont_4000_16k 4000 senone, 16 khz, 4000 senone, 8
    Gaussian continuous model taken from
    http://www.speech.cs.cmu.edu/sphinx/models/

    WER (For english words) 30.4%
    WER (Overall) 53.2%

    g) voxforge_en_sphinx.cd_cont_3000 8 khz, continuous model taken from
    http://www.repository.voxforge1.org/downloads/Main/Trunk/AcousticModels/Sphin
    x/

    WIth this model almost everything was recognized wrongly and so it was
    dropped.

    Q01. As per the results, hub4wsj_sc_8k is outperforming other models. Is this
    expected ?

    Q02. Continuous models c), e) and f) are actually performing worse than
    semicontinous
    model a). Shouldn't it be ther otherway round ?

    Q03. Model taken from voxforge g) is almost giving 100% error. Any idea why ?

    Q04. It should be possible to further improve accuracy of this application by
    manually
    editing pronunciations of certain Indian names (TBD). Any idea of a
    ballpark figure that a commercial application like this would target
    (Interestingly
    Nokia6303 gives 100% error without training and doesn't work so great even
    after
    user adaptation training).

     
  • Nickolay V. Shmyrev

    Q01. As per the results, hub4wsj_sc_8k is outperforming other models. Is
    this expected ?

    Yes, it's the latest and the best model. It was trained on the biggest
    database.

    Q02. Continuous models c), e) and f) are actually performing worse than
    semicontinous model a). Shouldn't it be ther otherway round

    Actually the thing that matters here is a database size and updated training
    procedure.

    Q03. Model taken from voxforge g) is almost giving 100% error. Any idea why
    ?

    No idea, probably you aren't using it properly. A sample decoding setup could
    help here.

    Q04. It should be possible to further improve accuracy of this application
    by manually editing pronunciations of certain Indian names (TBD). Any idea of
    a ballpark figure that a commercial application like this would target
    (Interestingly Nokia6303 gives 100% error without training and doesn't work so
    great even after user adaptation training).

    95% accuracy is easily reachable in this case

     
  • creative64

    creative64 - 2010-06-07

    Thanks nshmyrev.

    Are even better semi-continuous models expected in any future pocketsphinx
    releases ?

    Regards.

     
  • Nickolay V. Shmyrev

    Are even better semi-continuous models expected in any future pocketsphinx
    releases ?

    Yes

     
  • creative64

    creative64 - 2010-06-08

    Thanks Nsh.

    Any timelines for those releases ?

     
  • Nickolay V. Shmyrev

    We have the schedule on wiki

    http://cmusphinx.sourceforge.net/wiki/releaseschedule

    but I don't think anything is definite there.

    If you want better models you probably want to contact me directly.

     
  • creative64

    creative64 - 2010-06-09

    Thanx Nsh.

    For now the available models will do. 'll contact you directly in future
    should we need best of models or any other
    specialized help.

    Regards,

     
  • creative64

    creative64 - 2010-06-10

    FYI...

    Manually editing pronunciations for Indian names and adding alternate
    pronunciations for some of the english words (To
    match them with Indian way of pronouncing), following WER numbers are
    obtained...

    a) hub4wsj_sc_8k 8 khz semicontinuous model that comes with pocketsphinx6.

    WER (For english words) 10.8%
    WER (Overall) 22.75%

    PS: As the sample size is small and results are particularely bad for one of
    the female speakers. Numbers might actually be better than these.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.