Menu

Acoustic Model for multilingual command app

Help
Anonymous
2012-07-10
2012-09-22
  • Anonymous

    Anonymous - 2012-07-10

    Hi

    I'm looking into development of an iPhone app which will use PocketSphinx to
    recognise a set of around 15 - 20 phrases, with between 1 and 3 words each.
    The app will need to support a variety of languages (e.g. English, German,
    Chinese), but only one language at a time (i.e. the user selects their
    language, and voice recognition is performed only in that language). I was
    hoping someone could answer some questions:

    1) How much recording data would be needed to get a reasonable accuracy for
    recognition? The documentation on training suggests needing many hours of
    recordings, but is this just for recognising a large number of words?

    2) Ideally, we'd like to be able to adapt the acoustic model for the user on
    the device, so that they can re-record certain commands and use these
    recordings to train the model to better suit them if the recognition was
    inaccurate. Would this be possible, or would they need to record a large
    amount of speech? Ideally, we would have them re-record a command once to
    improve its accuracy, though if they need to re-record every command in one
    go, that is also OK.

    3) Following on from that, has anyone managed to run the training tools on an
    iOS device? Is it possible to build them for iOS? Or does PocketSphinx have
    some method of adapting built in that can be used at runtime?

    Any help or suggestions would be very much appreciated.

    Thanks :)

     
  • Nickolay V. Shmyrev

    he documentation on training suggests needing many hours of recordings, but
    is this just for recognising a large number of words?

    You have no reason to think that our documentation is wrong. If something is
    stated there it's usually true.

    Even for 10 words you still need a large amount of data. The acoustic model is
    a statistical model and it needs large amount of data.You still can use
    existing models which we provide for many languages or help to train a models
    for new languages.

    Would this be possible, or would they need to record a large amount of
    speech? Ideally

    For adaptation reasonable improvement starts from 30 seconds of the adaptation
    audio. But for a quick adapatation you either need to use continuous models
    (relatively slow) or implement fast adaptation for semi-continuous models (not
    implemented yet).

    Following on from that, has anyone managed to run the training tools on an
    iOS device?

    Yes

    Is it possible to build them for iOS?

    It's not different. Sphinxtrain uses same configure/make/make install process
    as other packages

    Or does PocketSphinx have some method of adapting built in that can be used
    at runtime?

    No

     
  • Nickolay V. Shmyrev

    There is also some R&D on multilingual acoustic models, that means in theory
    you can train a single model for many languages. But for that you need some
    data for most of the languages and you also need to have an expert in
    phonetics to build a common phoneset. Then such initial model can serve as an
    adaptation starting point.

    However, such model will require extensive R&D and source code modifications
    too.

     
  • Anonymous

    Anonymous - 2012-07-11

    Thanks for the help. Do you know what the license is for the acoustic models
    provided here (https://sourceforge.net/projects/cmusphinx/files/Acoustic%20an
    d%20Language%20Models/)?
    I downloaded a couple but didn't see any
    license in there.

    Also, is there a way to reduce the size of a model? I downloaded the mandarin
    one, and it was around 200mb, which is going to be too big to run on the
    device.

     
  • Nickolay V. Shmyrev

    Thanks for the help. Do you know what the license is for the acoustic models
    provided here (https://sourceforge.net/projects/cmusphinx/files/Acoustic%20an
    d%20Language%20Models/)?
    I downloaded a couple but didn't see any
    license in there.

    Most of the models have same license as CMUSphinx

    Also, is there a way to reduce the size of a model? I downloaded the
    mandarin one, and it was around 200mb, which is going to be too big to run on
    the device.

    The Mandarin Broadcast model is 20mb, not 200. If you are looking for smaller
    size, say 5mb, you need to train another model. Anything less than 5mb is not
    practical for speaker-independent large vocabulary speech recognition.

     

Log in to post a comment.