I'm planning on using Sphinx for an upcoming HLE attempt at emulating the N64 game Hey You, Pikachu!, basically, the game has a list of 640 phrases that can be used in-game, however, many of them are duplicates for varying pronunciations, so the compressed list(duplicates removed) comes out to 459 phrases, 151 of which are Pokemon names.
The issue with emulating this is that the original game came with a mic and a Voice Recognition Unit(VRU) that no one has figured out how to emulate. Rather than trying to emulate the circuitry, I decided that it'd be simpler to just use modern voice recognition technology and map the various voice commands through it to the output of the original VRU for each command.
What I would like to know is which functionality of CMU Sphinx(Pocketsphinx, really) would be ideal for this kind of work, is it keyword recognition, or something more complex?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I noticed in the tutorial it says "live audio input is somewhat platform-specific," is there a tutorial which goes into cross-platform support for live audio, or perhaps a library that abstracts it away from the OS? Ideally I'm looking for something that can handle cross-platform mic abstraction as well, assuming such a thing exists.
One more question: Will the library already detect non-English words, specifically Pokemon names? If not, how do I go about adding them in? Would using a simple English web service for building the DB work in my case?
Very helpful, very fast so far, really appreciate it :D
Last edit: Daniel Eck 2015-04-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is no cross-platform library since platforms we support are very different. For example there is no single way to record audio for Android and iOS and Windows Phone. Some libraries pretend to be cross-platform like portaudio, you can use them if needed.
You can use webservice to create initial dictionary for pokemon names, still, it is recommended to review the dictionary manually afterward because automatic pronunciation might have errors.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So I've put together a library using the web service and the commands listed there, and it works quite well with pocketsphinx_continuous, but it left me with a few questions:
How can I reduce the size of the language model? I only need it to recognize specific keywords, so I don't see a reason to need to include the entire model. Currently it's at 24 meg or so, and I'd like to try to get this whole package to under a meg if possible.
How can I stop it from trying to find multiple phrases with each input? For example, if I say two words, one after the other, it outputs them as two words, even if a phrase with them together doesn't exist. I'd prefer it to simply wait until input is over and pick whatever single phrase seemed to be the closest to what was said, like the original game is.
How do I go about improving the dictionary manually? Pikachu and some other Pokemon names are being troublesome, they work sometimes, but Pikachu in particular should be more easily detected.
Finally, how can I embed the pocketsphinx_continuous.exe's functionality into a program, the tutorial went into doing a program that read from a file, how do I call the built-in mic reading capability and turn it on/off?
Thank you so much, this was actually a lot easier than I expected it to be, you've been very helpful.
Last edit: Daniel Eck 2015-05-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
How can I reduce the size of the language model? I only need it to recognize specific keywords, so I don't see a reason to need to include the entire model. Currently it's at 24 meg or so, and I'd like to try to get this whole package to under a meg if possible.
To search for keyphrases, use keyphrase spotting mode, then you do not need language model at all
ow can I stop it from trying to find multiple phrases with each input? For example, if I say two words, one after the other, it outputs them as two words, even if a phrase with them together doesn't exist. I'd prefer it to simply wait until input is over and pick whatever single phrase seemed to be the closest to what was said, like the original game is.
In keyphrase spotting mode there should be no such issue.
How do I go about improving the dictionary manually? Pikachu and some other Pokemon names are being troublesome, they work sometimes, but Pikachu in particular should be more easily detected.
Open dictionary with text editor and add entries you need
Finally, how can I embed the pocketsphinx_continuous.exe's functionality into a program, the tutorial went into doing a program that read from a file, how do I call the built-in mic reading capability and turn it on/off?
Open pocketsphinx-continuous source code (continuous.c) and see what is going on there.
Last edit: Nickolay V. Shmyrev 2015-05-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And it said that no acoustical model was specified, so I passed in the -hmm to the model and it gave a whole bunch of "kws_search.c" line 165: The word 'ZAPDOS(etc)' is missing in the dictionary." errors.
I've tried passing in the arguments to the dictionary and lm file I had before, but I get the same error.
Last edit: Daniel Eck 2015-05-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah, I've got the latest version, and I used that exact command and I'm still getting a whole bunch of "The word is missing in the dictionary" errors.
I've attached a picture of the exact command I used and the folder layout, I really appreciate the help, sorry to be getting so stuck.
Sorry, but you are using some small hyp.dic and not cmudict-en-us.dict as I wrote.
If your hyp.dic is created with lmtool please make sure that words are case sensitive. So words in your phrases.txt must be uppercase like in the dictionary. You also seem to use other words in phrases, not the ones you posted above. Please also note that phrases should not contain punctuation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh yeah, I had tried both dictionaries, but the case sensitivity was the issue! Thanks a ton! I'll take out the punctuation as well.
It does appear to still be listening for multiple phrases together (#2) and outputting them all at once, is there like a flag or something to eliminate that?
Also, now that I have it using keyphrase spotting with the dictionary I generated, the detection quality is a lot worse(even though I was using the same .dic file before), any reason why that would be? I've actually run them side by side and it's clear the quality has gotten worse.
Last edit: Daniel Eck 2015-05-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Keyphrase detection is controlled by threshold. If there are false alarms you can raise threshold, if detections are not stable you increase the threshold.
Long phrases are hard to detect.
You can provide audio data and phrase lists in order to get help on accuracy.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is not really possible to recognize single phrase at once in this case, it just recognizes what was said. You have to postprocess output to decide what to do with multiple phrases.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah, and it actually comes out to only 459 phrases with repeats for different pronunciations removed. I'll probably need to look into the thresholds to fix my accuracy issues.
So I noticed that I'm still needing to use the -hmm flag to link to the 2MB mdef file in en-us, is there any way this can be reduced or is it required?
Also, thanks a ton again, your help has been amazing, I did not expect to have an essentially working solution this fast.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So I noticed that I'm still needing to use the -hmm flag to link to the 2MB mdef file in en-us, is there any way this can be reduced or is it required?
Acoustic model is required, it describes sounds of the langauge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Forum thread with extra details
I'm planning on using Sphinx for an upcoming HLE attempt at emulating the N64 game Hey You, Pikachu!, basically, the game has a list of 640 phrases that can be used in-game, however, many of them are duplicates for varying pronunciations, so the compressed list(duplicates removed) comes out to 459 phrases, 151 of which are Pokemon names.
The issue with emulating this is that the original game came with a mic and a Voice Recognition Unit(VRU) that no one has figured out how to emulate. Rather than trying to emulate the circuitry, I decided that it'd be simpler to just use modern voice recognition technology and map the various voice commands through it to the output of the original VRU for each command.
What I would like to know is which functionality of CMU Sphinx(Pocketsphinx, really) would be ideal for this kind of work, is it keyword recognition, or something more complex?
You can use pocketsphinx with JSGF grammar for this game. You can check tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialpocketsphinx
http://cmusphinx.sourceforge.net/wiki/tutoriallm
for details
Wow! Thanks for the quick response!
I noticed in the tutorial it says "live audio input is somewhat platform-specific," is there a tutorial which goes into cross-platform support for live audio, or perhaps a library that abstracts it away from the OS? Ideally I'm looking for something that can handle cross-platform mic abstraction as well, assuming such a thing exists.
One more question: Will the library already detect non-English words, specifically Pokemon names? If not, how do I go about adding them in? Would using a simple English web service for building the DB work in my case?
Very helpful, very fast so far, really appreciate it :D
Last edit: Daniel Eck 2015-04-16
Dear Daniel
There is no cross-platform library since platforms we support are very different. For example there is no single way to record audio for Android and iOS and Windows Phone. Some libraries pretend to be cross-platform like portaudio, you can use them if needed.
You can use webservice to create initial dictionary for pokemon names, still, it is recommended to review the dictionary manually afterward because automatic pronunciation might have errors.
So I've put together a library using the web service and the commands listed there, and it works quite well with pocketsphinx_continuous, but it left me with a few questions:
How can I reduce the size of the language model? I only need it to recognize specific keywords, so I don't see a reason to need to include the entire model. Currently it's at 24 meg or so, and I'd like to try to get this whole package to under a meg if possible.
How can I stop it from trying to find multiple phrases with each input? For example, if I say two words, one after the other, it outputs them as two words, even if a phrase with them together doesn't exist. I'd prefer it to simply wait until input is over and pick whatever single phrase seemed to be the closest to what was said, like the original game is.
How do I go about improving the dictionary manually? Pikachu and some other Pokemon names are being troublesome, they work sometimes, but Pikachu in particular should be more easily detected.
Finally, how can I embed the pocketsphinx_continuous.exe's functionality into a program, the tutorial went into doing a program that read from a file, how do I call the built-in mic reading capability and turn it on/off?
Thank you so much, this was actually a lot easier than I expected it to be, you've been very helpful.
Last edit: Daniel Eck 2015-05-05
To search for keyphrases, use keyphrase spotting mode, then you do not need language model at all
In keyphrase spotting mode there should be no such issue.
Open dictionary with text editor and add entries you need
Open pocketsphinx-continuous source code (continuous.c) and see what is going on there.
Last edit: Nickolay V. Shmyrev 2015-05-05
Thanks a lot! I'm just having issues implementing keyphrase spotting mode, I modified my phrase list to look like this:
~~~~~~~~~~~~
lets check it out /1e-1/
sleep tight /1e-1/
see you tomorrow /1e-1/
see you in the morning /1e-1/
~~~~~~~~~~~
and tried putting it in with
And it said that no acoustical model was specified, so I passed in the -hmm to the model and it gave a whole bunch of "kws_search.c" line 165: The word 'ZAPDOS(etc)' is missing in the dictionary." errors.
I've tried passing in the arguments to the dictionary and lm file I had before, but I get the same error.
Last edit: Daniel Eck 2015-05-05
First of all you need to use the latest version 5prealpha:
http://sourceforge.net/projects/cmusphinx/files/pocketsphinx/5prealpha/pocketsphinx-5prealpha-win32.zip/download
Then on Windows you need to specify an acoustic model and the dictionary:
Yeah, I've got the latest version, and I used that exact command and I'm still getting a whole bunch of "The word is missing in the dictionary" errors.
I've attached a picture of the exact command I used and the folder layout, I really appreciate the help, sorry to be getting so stuck.
Sorry, but you are using some small hyp.dic and not cmudict-en-us.dict as I wrote.
If your hyp.dic is created with lmtool please make sure that words are case sensitive. So words in your phrases.txt must be uppercase like in the dictionary. You also seem to use other words in phrases, not the ones you posted above. Please also note that phrases should not contain punctuation.
Oh yeah, I had tried both dictionaries, but the case sensitivity was the issue! Thanks a ton! I'll take out the punctuation as well.
It does appear to still be listening for multiple phrases together (#2) and outputting them all at once, is there like a flag or something to eliminate that?
Also, now that I have it using keyphrase spotting with the dictionary I generated, the detection quality is a lot worse(even though I was using the same .dic file before), any reason why that would be? I've actually run them side by side and it's clear the quality has gotten worse.
Last edit: Daniel Eck 2015-05-05
Keyphrase detection is controlled by threshold. If there are false alarms you can raise threshold, if detections are not stable you increase the threshold.
Long phrases are hard to detect.
You can provide audio data and phrase lists in order to get help on accuracy.
Also, for 640 phrases lm indeed makes more sense, you can probably just use the small language model with phrase you created.
It is not really possible to recognize single phrase at once in this case, it just recognizes what was said. You have to postprocess output to decide what to do with multiple phrases.
Yeah, and it actually comes out to only 459 phrases with repeats for different pronunciations removed. I'll probably need to look into the thresholds to fix my accuracy issues.
So I noticed that I'm still needing to use the -hmm flag to link to the 2MB mdef file in en-us, is there any way this can be reduced or is it required?
Also, thanks a ton again, your help has been amazing, I did not expect to have an essentially working solution this fast.
Acoustic model is required, it describes sounds of the langauge