I've followed the tutorial for Adapting an Acoustic Model. It seems to be ok, but I have 3 questions :
1) I am planning to implement different languages for an app on Android/iOS. As my dictionary won't be more than 50 words, I was wondering whether it was a good solution to adapt the en-us model instead of creating a whole new acoustic model for each language ?
2) If I choose the Adaptation solution, can I add some new phonemes ? (for example, the sound "w" in "win" doesn't exist in German, so can I add this phone/phoneme ?). Actually, I don't really understand this sentence :
"Cross-language adaptation also make sense, for example you can adapt English model to sounds of other language by creating a phoneset map and creating other language dictionary with English phoneset." (http://cmusphinx.sourceforge.net/wiki/tutorialadapt)
For example, the acoustic model for en-us has his dictionary written with ArpaBet. The dictionary of French (available on sourceforge too) has a totally different way of writting the phonetic. Would it mean that I have to add all the words said in the new audio files (which would be in French) in the en-us dictionary with the ArpaBet nomenclature ?
3) In the tutorial for Adapting Acoustic Model, it is written "You can delete the files en-us-adapt/mixture_weights and en-us-adapt/mdef.txt to save space if you like, because they are not used by the decoder." I downloaded the demo version for android and the en-us-ptm file is around 7Mo. Can I delete the file mdef (whose size is around 3Mo ) in all the acoustic model we download ?
Thanks :)
Last edit: Paul Rolin 2015-07-03
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Instead of asking random questions you could just describe your technical requirements. What your application is about? what vocabulary size in each language? is vocabulary predefined and fixed or will there be new words, do you have any data for adaptation and what is the size of this data, what amount of memory do you plan to use, what is the accuracy you need. This will enable us to give you more informed answers.
There are multiple solutions here, from striping acoustic model to the words you need to downloading models for required language in runtime from the server. Third possible solution would be to train a joint model for all languages with new merged dictionary and new merged phoneset.
1) I am planning to implement different languages for an app on Android/iOS. As my dictionary won't be more than 50 words, I was wondering whether it was a good solution to adapt the en-us model instead of creating a whole new acoustic model for each language ?
It very much depends on the amount of data you have for training/adaptation. If you want to build a serious application, the data would be critical to get a good accuracy, the memory size can be optimized. Cross-langauge adaptation requires significant amount of data and its not always the best solution. Few megabytes are not worth accuracy drop which would be way more irritating to the users.
Also, our models are not equally good, for example French model is very weak. You have to take into account.
2) If I choose the Adaptation solution, can I add some new phonemes ? (for example, the sound "w" in "win" doesn't exist in German, so can I add this phone/phoneme ?). Actually, I don't really understand this sentence :
No, you can not add phonemes during adaptation.
Would it mean that I have to add all the words said in the new audio files (which would be in French) in the en-us dictionary with the ArpaBet nomenclature ?
Yes, you will have to do that.
Cross-langauge adaptation is helpful for very specific cases, for example if you want to bootstrap a model for new language in unsupervised way, in many other cases it does not work very well. There are multiple publications about models shared across languages, if you are interested in this topic you can review them, there are whole PhD theses:
3) In the tutorial for Adapting Acoustic Model, it is written "You can delete the files en-us-adapt/mixture_weights and en-us-adapt/mdef.txt to save space if you like, because they are not used by the decoder." I downloaded the demo version for android and the en-us-ptm file is around 7Mo. Can I delete the file mdef (whose size is around 3Mo ) in all the acoustic model we download ?
You can not delete mdef altogether, you need to keep mdef file converted to binary. You can remove bigger text mdef file. In en-us mdef file is 3mb.
It is possible to compress mdef file in a more compact form which we are planning to do one day, but that would require significant modification of the pocketsphinx algorithms.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for my late answer, I was away on vacations. I read the first thesis and it is very interesting, though a bit complicated I think considering the use I would like.
To be more precise concerning my issue :
For now, we are making children apps both for iOS and Android with mini-games in them. We would like to implement a Speech Recognizer module in order to have some mini-games being played with voice. (I know that children voice recognition is not as good as the others). The words that should be recognized are then limited (not more than 20) for each app but I would like to have a module that could be used for other apps I would make later. So that means maybe new words (but normally not more than 60-70 words total). Also, the apps we make are in 7 languages (English, French, Spanish, Dutch, German, Italian, Portuguese) so I would like to implement the voice recognition for all the languages.
The language would be/ is chosen at the beginning of the game, so there is no problem of crosslingual recognition.
The main issue is : the app size is already around 20O+ Mb so I would like to avoid having a voice recognition module that would add 100Mb just for one mini-game (and so just to recognize one/two words...).
Edit : according to the Voxforge acoustic models, here are the size I found :
EN : 6,6 Mo
SP : 4,4 Mo
DE : 8,8 Mo
NL : 11,4 Mo
FR : 45,1 Mo
So we get :
EN + DE + SP = 19,7Mo
+ NL = 31,1Mo
+ FR = 76,1 Mo
Am I right ? (I took the size of the model parameters file)
That's why I thought of using one/two/three acoustic models to cover the 7 languages. Indeed, EN,SP,DE acoustic models seem to be good/low-space models. But I don't know how to have a good accuracy for the other languages. (I tried using EN model to recognize French - I'm French - , and indeed I have to speak as an english person speaking French to be able to be well-recognized)
Maybe a solution could be waiting for the french model and only implement the voice recognition for the other languages for now. But I'll still have a problem for Portuguese/Italian languages, so I have to find a solution.
I also have 3 new questions :
1) What about Kaldi ? Would it be a better program for my needs ? (though I don't see any mobile portage for Kaldi)
2) Do you know when the French acoustic model will be available ? In 1 month, 6 months, more ?
3) Can we change the acoustic model in the app. For example, I would like to have a button to choose which acoustic model and which dictionary I want. And then, a button "menu" to be able to choose again the acoustic model and change it. I have no idea how to do it. Do you have an idea ?
Thanks a lot for your answers,
Paul
Will read the 2nd thesis :)
Last edit: Paul Rolin 2015-07-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The main issue is : the app size is already around 20O+ Mb so I would like to avoid having a voice recognition module that would add 100Mb just for one mini-game (and so just to recognize one/two words...).
This is your minor issue. The biggest issue is to recognize children speech which is a very hard problem. It requires children recordings for all the languages you need and more work on the decoder itself to recognize highly unstable children speech.
You will have to perform data collection since none of the publicly available models supports children speech. The only database available for children speech is cmukids database which you can download, however, it is not trivial to train model in this database.
If you will train your own model you will be able to train model of any size you like, like 1-2 mb per language
You can also download models from runtime
What about Kaldi ? Would it be a better program for my needs ? (though I don't see any mobile portage for Kaldi)
The biggest problem is the data, not the software
Do you know when the French acoustic model will be available ? In 1 month, 6 months, more ?
French model will be available in 1 month but it will not support children
Can we change the acoustic model in the app. For example, I would like to have a button to choose which acoustic model and which dictionary I want. And then, a button "menu" to be able to choose again the acoustic model and change it. I have no idea how to do it. Do you have an idea ?
Yes, sure, you can restart the decoder any time with a new model. You can download selected model from the web as Google does for speech recognition.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
According to what you say, it seems that recognizing children speech is very different from adult speech.
And children databases nearly don't exist (or are very expensive, except for the CMUKids one). What do you mean by "it is not trivial to train model in this database" ? Is it very different from training a normal acoustic model like in your tutorial ? Also it seems that the CMUKids corpus was recorded in 16kHz, would it be okay for speech recognition on phones ? Is there a way to transform the frequency, another way than playing the speech on my phone and recording it ?
If I had to get the children speech data myself, how many data should I need ? (I saw that for speech command, it is around 5 hours and 200 people but for children would it be the same or even more ?) Would it be possible to have 2 hours of data for 20-30 children (even if the recognition would then be not perfect).
What kind of data should I record ? (as I will have a more or less small list of words but not decided yet) Should the children read normal sentences or only the list of words I would like to be recognized ?
Thanks !
Last edit: Paul Rolin 2015-07-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
it very different from training a normal acoustic model like in your tutorial ?
It is not different but due to the fact that kids do not speak cleanly and the database does not have a clean transcription the accuracy of the result is not high.
Also it seems that the CMUKids corpus was recorded in 16kHz, would it be okay for speech recognition on phones ? Is there a way to transform the frequency, another way than playing the speech on my phone and recording it ?
Phone can record at 16khz as well, you do not need to make any changes here.
If I had to get the children speech data myself, how many data should I need ? (I saw that for speech command, it is around 5 hours and 200 people but for children would it be the same or even more ?)
This estimation is valid.
Would it be possible to have 2 hours of data for 20-30 children (even if the recognition would then be not perfect).
This is ok, but accuracy will be lower.
What kind of data should I record ? (as I will have a more or less small list of words but not decided yet) Should the children read normal sentences or only the list of words I would like to be recognized ?
You need to record the words you want to recognize. In case you want to make your database extendable for the future use you might also record normal sentences as well.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok thank you very much ! I will try the Adult Acoustic model with some children and will decide whether a children acoustic model is needed for what I want to do.
Last edit: Paul Rolin 2015-07-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a last question : I would like to implement the recognizer both on Android and iOS. For Android, it's ok. But for iOS, I saw that Openears uses the Sphinx program. However, I think that you wrote somewhere that KeywordSpotting is not implemented on Openears. Or is it what they call Rejecto ?
Is it possible to use pocketsphinx for iOS quite easily ? Or would you advise me to use the Openears version ?
Thanks
PS : sorry I posted my message twice and I don't find a way to delete the first message...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Latest OpenEars version must be in sync with latest pocketsphinx, so you can use spotting there, you can also use pocketsphinx as a library, just add it to your iOS project.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As it seems I may have to use the functions from pocketsphinx directly (so in C), I'm searching for a way to easily understand how the functions are used.
And so, I would need the links between the "native" methods in Java JNI and the C methods. And for that, it seems that I need the pre-complied .so file. Is it possible to have these files ? (well I searched in the pocketsphinx folder and I can't find the file used to make the .so ; I guess it includes JNI but I can't find it)
Thanks in advance
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I've followed the tutorial for Adapting an Acoustic Model. It seems to be ok, but I have 3 questions :
1) I am planning to implement different languages for an app on Android/iOS. As my dictionary won't be more than 50 words, I was wondering whether it was a good solution to adapt the en-us model instead of creating a whole new acoustic model for each language ?
2) If I choose the Adaptation solution, can I add some new phonemes ? (for example, the sound "w" in "win" doesn't exist in German, so can I add this phone/phoneme ?). Actually, I don't really understand this sentence :
"Cross-language adaptation also make sense, for example you can adapt English model to sounds of other language by creating a phoneset map and creating other language dictionary with English phoneset." (http://cmusphinx.sourceforge.net/wiki/tutorialadapt)
For example, the acoustic model for en-us has his dictionary written with ArpaBet. The dictionary of French (available on sourceforge too) has a totally different way of writting the phonetic. Would it mean that I have to add all the words said in the new audio files (which would be in French) in the en-us dictionary with the ArpaBet nomenclature ?
3) In the tutorial for Adapting Acoustic Model, it is written "You can delete the files en-us-adapt/mixture_weights and en-us-adapt/mdef.txt to save space if you like, because they are not used by the decoder." I downloaded the demo version for android and the en-us-ptm file is around 7Mo. Can I delete the file mdef (whose size is around 3Mo ) in all the acoustic model we download ?
Thanks :)
Last edit: Paul Rolin 2015-07-03
Instead of asking random questions you could just describe your technical requirements. What your application is about? what vocabulary size in each language? is vocabulary predefined and fixed or will there be new words, do you have any data for adaptation and what is the size of this data, what amount of memory do you plan to use, what is the accuracy you need. This will enable us to give you more informed answers.
There are multiple solutions here, from striping acoustic model to the words you need to downloading models for required language in runtime from the server. Third possible solution would be to train a joint model for all languages with new merged dictionary and new merged phoneset.
It very much depends on the amount of data you have for training/adaptation. If you want to build a serious application, the data would be critical to get a good accuracy, the memory size can be optimized. Cross-langauge adaptation requires significant amount of data and its not always the best solution. Few megabytes are not worth accuracy drop which would be way more irritating to the users.
Also, our models are not equally good, for example French model is very weak. You have to take into account.
No, you can not add phonemes during adaptation.
Yes, you will have to do that.
Cross-langauge adaptation is helpful for very specific cases, for example if you want to bootstrap a model for new language in unsupervised way, in many other cases it does not work very well. There are multiple publications about models shared across languages, if you are interested in this topic you can review them, there are whole PhD theses:
http://mi.eng.cam.ac.uk/~fd257/publications/PhD_FrankDiehl.pdf
or modern publication
http://www.cstr.ed.ac.uk/downloads/publications/2013/Ghoshal_ICASSP2013.pdf
and references therein
You can not delete mdef altogether, you need to keep mdef file converted to binary. You can remove bigger text mdef file. In en-us mdef file is 3mb.
It is possible to compress mdef file in a more compact form which we are planning to do one day, but that would require significant modification of the pocketsphinx algorithms.
Sorry for my late answer, I was away on vacations. I read the first thesis and it is very interesting, though a bit complicated I think considering the use I would like.
To be more precise concerning my issue :
For now, we are making children apps both for iOS and Android with mini-games in them. We would like to implement a Speech Recognizer module in order to have some mini-games being played with voice. (I know that children voice recognition is not as good as the others). The words that should be recognized are then limited (not more than 20) for each app but I would like to have a module that could be used for other apps I would make later. So that means maybe new words (but normally not more than 60-70 words total). Also, the apps we make are in 7 languages (English, French, Spanish, Dutch, German, Italian, Portuguese) so I would like to implement the voice recognition for all the languages.
The language would be/ is chosen at the beginning of the game, so there is no problem of crosslingual recognition.
The main issue is : the app size is already around 20O+ Mb so I would like to avoid having a voice recognition module that would add 100Mb just for one mini-game (and so just to recognize one/two words...).
Edit : according to the Voxforge acoustic models, here are the size I found :
EN : 6,6 Mo
SP : 4,4 Mo
DE : 8,8 Mo
NL : 11,4 Mo
FR : 45,1 Mo
So we get :
EN + DE + SP = 19,7Mo
+ NL = 31,1Mo
+ FR = 76,1 Mo
Am I right ? (I took the size of the model parameters file)
That's why I thought of using one/two/three acoustic models to cover the 7 languages. Indeed, EN,SP,DE acoustic models seem to be good/low-space models. But I don't know how to have a good accuracy for the other languages. (I tried using EN model to recognize French - I'm French - , and indeed I have to speak as an english person speaking French to be able to be well-recognized)
Maybe a solution could be waiting for the french model and only implement the voice recognition for the other languages for now. But I'll still have a problem for Portuguese/Italian languages, so I have to find a solution.
I also have 3 new questions :
1) What about Kaldi ? Would it be a better program for my needs ? (though I don't see any mobile portage for Kaldi)
2) Do you know when the French acoustic model will be available ? In 1 month, 6 months, more ?
3) Can we change the acoustic model in the app. For example, I would like to have a button to choose which acoustic model and which dictionary I want. And then, a button "menu" to be able to choose again the acoustic model and change it. I have no idea how to do it. Do you have an idea ?
Thanks a lot for your answers,
Paul
Will read the 2nd thesis :)
Last edit: Paul Rolin 2015-07-15
This is your minor issue. The biggest issue is to recognize children speech which is a very hard problem. It requires children recordings for all the languages you need and more work on the decoder itself to recognize highly unstable children speech.
You will have to perform data collection since none of the publicly available models supports children speech. The only database available for children speech is cmukids database which you can download, however, it is not trivial to train model in this database.
If you will train your own model you will be able to train model of any size you like, like 1-2 mb per language
You can also download models from runtime
The biggest problem is the data, not the software
French model will be available in 1 month but it will not support children
Yes, sure, you can restart the decoder any time with a new model. You can download selected model from the web as Google does for speech recognition.
Hi,
Thank you for your answer.
According to what you say, it seems that recognizing children speech is very different from adult speech.
And children databases nearly don't exist (or are very expensive, except for the CMUKids one). What do you mean by "it is not trivial to train model in this database" ? Is it very different from training a normal acoustic model like in your tutorial ? Also it seems that the CMUKids corpus was recorded in 16kHz, would it be okay for speech recognition on phones ? Is there a way to transform the frequency, another way than playing the speech on my phone and recording it ?
If I had to get the children speech data myself, how many data should I need ? (I saw that for speech command, it is around 5 hours and 200 people but for children would it be the same or even more ?) Would it be possible to have 2 hours of data for 20-30 children (even if the recognition would then be not perfect).
What kind of data should I record ? (as I will have a more or less small list of words but not decided yet) Should the children read normal sentences or only the list of words I would like to be recognized ?
Thanks !
Last edit: Paul Rolin 2015-07-16
It is not different but due to the fact that kids do not speak cleanly and the database does not have a clean transcription the accuracy of the result is not high.
Phone can record at 16khz as well, you do not need to make any changes here.
This estimation is valid.
This is ok, but accuracy will be lower.
You need to record the words you want to recognize. In case you want to make your database extendable for the future use you might also record normal sentences as well.
Ok thank you very much ! I will try the Adult Acoustic model with some children and will decide whether a children acoustic model is needed for what I want to do.
Last edit: Paul Rolin 2015-07-22
I have a last question : I would like to implement the recognizer both on Android and iOS. For Android, it's ok. But for iOS, I saw that Openears uses the Sphinx program. However, I think that you wrote somewhere that KeywordSpotting is not implemented on Openears. Or is it what they call Rejecto ?
Is it possible to use pocketsphinx for iOS quite easily ? Or would you advise me to use the Openears version ?
Thanks
PS : sorry I posted my message twice and I don't find a way to delete the first message...
Latest OpenEars version must be in sync with latest pocketsphinx, so you can use spotting there, you can also use pocketsphinx as a library, just add it to your iOS project.
Hi again,
As it seems I may have to use the functions from pocketsphinx directly (so in C), I'm searching for a way to easily understand how the functions are used.
And so, I would need the links between the "native" methods in Java JNI and the C methods. And for that, it seems that I need the pre-complied .so file. Is it possible to have these files ? (well I searched in the pocketsphinx folder and I can't find the file used to make the .so ; I guess it includes JNI but I can't find it)
Thanks in advance
We use SWIG http://swig.org to wrap C code into java. Required interface files are in pocketsphinx/swig, they have .i extension
To build pocketsphinx for Android jni files you need to use pocketsphinx-android project, it is on subversion/github.
how can i train language model in android ...i have been searching for since 2 hours, but i did not found any relevent content for that..please help