I'm looking into development of an iPhone app which will use PocketSphinx to
recognise a set of around 15 - 20 phrases, with between 1 and 3 words each.
The app will need to support a variety of languages (e.g. English, German,
Chinese), but only one language at a time (i.e. the user selects their
language, and voice recognition is performed only in that language). I was
hoping someone could answer some questions:
1) How much recording data would be needed to get a reasonable accuracy for
recognition? The documentation on training suggests needing many hours of
recordings, but is this just for recognising a large number of words?
2) Ideally, we'd like to be able to adapt the acoustic model for the user on
the device, so that they can re-record certain commands and use these
recordings to train the model to better suit them if the recognition was
inaccurate. Would this be possible, or would they need to record a large
amount of speech? Ideally, we would have them re-record a command once to
improve its accuracy, though if they need to re-record every command in one
go, that is also OK.
3) Following on from that, has anyone managed to run the training tools on an
iOS device? Is it possible to build them for iOS? Or does PocketSphinx have
some method of adapting built in that can be used at runtime?
Any help or suggestions would be very much appreciated.
Thanks :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
he documentation on training suggests needing many hours of recordings, but
is this just for recognising a large number of words?
You have no reason to think that our documentation is wrong. If something is
stated there it's usually true.
Even for 10 words you still need a large amount of data. The acoustic model is
a statistical model and it needs large amount of data.You still can use
existing models which we provide for many languages or help to train a models
for new languages.
Would this be possible, or would they need to record a large amount of
speech? Ideally
For adaptation reasonable improvement starts from 30 seconds of the adaptation
audio. But for a quick adapatation you either need to use continuous models
(relatively slow) or implement fast adaptation for semi-continuous models (not
implemented yet).
Following on from that, has anyone managed to run the training tools on an
iOS device?
Yes
Is it possible to build them for iOS?
It's not different. Sphinxtrain uses same configure/make/make install process
as other packages
Or does PocketSphinx have some method of adapting built in that can be used
at runtime?
No
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is also some R&D on multilingual acoustic models, that means in theory
you can train a single model for many languages. But for that you need some
data for most of the languages and you also need to have an expert in
phonetics to build a common phoneset. Then such initial model can serve as an
adaptation starting point.
However, such model will require extensive R&D and source code modifications
too.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also, is there a way to reduce the size of a model? I downloaded the mandarin
one, and it was around 200mb, which is going to be too big to run on the
device.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also, is there a way to reduce the size of a model? I downloaded the
mandarin one, and it was around 200mb, which is going to be too big to run on
the device.
The Mandarin Broadcast model is 20mb, not 200. If you are looking for smaller
size, say 5mb, you need to train another model. Anything less than 5mb is not
practical for speaker-independent large vocabulary speech recognition.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi
I'm looking into development of an iPhone app which will use PocketSphinx to
recognise a set of around 15 - 20 phrases, with between 1 and 3 words each.
The app will need to support a variety of languages (e.g. English, German,
Chinese), but only one language at a time (i.e. the user selects their
language, and voice recognition is performed only in that language). I was
hoping someone could answer some questions:
1) How much recording data would be needed to get a reasonable accuracy for
recognition? The documentation on training suggests needing many hours of
recordings, but is this just for recognising a large number of words?
2) Ideally, we'd like to be able to adapt the acoustic model for the user on
the device, so that they can re-record certain commands and use these
recordings to train the model to better suit them if the recognition was
inaccurate. Would this be possible, or would they need to record a large
amount of speech? Ideally, we would have them re-record a command once to
improve its accuracy, though if they need to re-record every command in one
go, that is also OK.
3) Following on from that, has anyone managed to run the training tools on an
iOS device? Is it possible to build them for iOS? Or does PocketSphinx have
some method of adapting built in that can be used at runtime?
Any help or suggestions would be very much appreciated.
Thanks :)
You have no reason to think that our documentation is wrong. If something is
stated there it's usually true.
Even for 10 words you still need a large amount of data. The acoustic model is
a statistical model and it needs large amount of data.You still can use
existing models which we provide for many languages or help to train a models
for new languages.
For adaptation reasonable improvement starts from 30 seconds of the adaptation
audio. But for a quick adapatation you either need to use continuous models
(relatively slow) or implement fast adaptation for semi-continuous models (not
implemented yet).
Yes
It's not different. Sphinxtrain uses same configure/make/make install process
as other packages
No
There is also some R&D on multilingual acoustic models, that means in theory
you can train a single model for many languages. But for that you need some
data for most of the languages and you also need to have an expert in
phonetics to build a common phoneset. Then such initial model can serve as an
adaptation starting point.
However, such model will require extensive R&D and source code modifications
too.
Thanks for the help. Do you know what the license is for the acoustic models
provided here (https://sourceforge.net/projects/cmusphinx/files/Acoustic%20an
d%20Language%20Models/)? I downloaded a couple but didn't see any
license in there.
Also, is there a way to reduce the size of a model? I downloaded the mandarin
one, and it was around 200mb, which is going to be too big to run on the
device.
Most of the models have same license as CMUSphinx
The Mandarin Broadcast model is 20mb, not 200. If you are looking for smaller
size, say 5mb, you need to train another model. Anything less than 5mb is not
practical for speaker-independent large vocabulary speech recognition.