I'm a student working on a final project for my degree with a fellow alum of mine. We are developing an Android application that needs to be able to process speech-to-text on data from input files. We're still researching exactly how we're going to go about doing this, and came across the pocketsphinx library as a potential solution.
I read through the demo pocketsphinx android application, which seems to me to use the microphone to input speech data directly.
The issue is that our app needs to utilize the microphone to record a transcript of what the user says. This means we're using an AudioRecorder, which ties up the mic, so we can't send a live capture to Google's standard SpeechRecognizer.
My question is, as a result, does anyone know if pocketsphinx on Android has the capability to take a byte stream or something similar, so that we may pass speech files to the pocketsphinx API instead of a capture from the device's microphone? Has this been done before, or, if not, could we somewhat easily override the existing implementation of the pocketsphinx Android SpeechRecognizer?
Thanks for any help or suggestions you'd be willing to offer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not sure if I should move this into another topic, but I was hoping somebody wouldn't mind taking a look at our approach - I downloaded pocketsphinx for android and the demo app, and was able to get through a good chunk of it on my own, but I've been stuck for a day or so.
We capture audio as PCM, so I first convert that data to WAV.
I use this code to set up the Decoder:
publicvoidtranslate(){Configc=Decoder.defaultConfig();Assetsassets=null;FileassetDir=null;try{assets=newAssets(_context);assetDir=assets.syncAssets();Log.d("DEBUG",assetDir.toString());}catch(IOExceptione){Log.d("DEBUG","Assetscouldn'tbecreated");e.printStackTrace();return;}c.setString("-hmm",newFile(assetDir,"/en-us").toString());c.setString("-lm",newFile(assetDir,"en-us.lm.dmp").toString());c.setString("-dict",newFile(assetDir,"cmudict-en-us.dict").toString());Decoderd=newDecoder(c);FileInputStreamstream=null;URItestwav=null;try{testwav=newURI("file:" + _wavFileName); } catch (URISyntaxException e) { e.printStackTrace(); Log.d("DEBUG", "URIbroke"); return; } try { stream = new FileInputStream(new File(testwav)); } catch (FileNotFoundException e) { e.printStackTrace(); Log.d("DEBUG", "Filestreambroke"); return; } d.startUtt(); byte[] b = new byte[4096]; try { int nbytes; while ((nbytes = stream.read(b)) >= 0) { ByteBuffer bb = ByteBuffer.wrap(b, 0, nbytes); // Not needed on desktop but required on android bb.order(ByteOrder.LITTLE_ENDIAN); short[] s = new short[nbytes/2]; bb.asShortBuffer().get(s); d.processRaw(s, nbytes/2, false, false); } } catch (IOException e) { e.printStackTrace(); Log.d("DEBUG", "iobroken");return;}d.endUtt();System.out.println(d.hyp().getHypstr());for(Segmentseg:d.seg()){_results.add(seg.getWord());}}
When we call translate() after recording our PCM and converting to a WAV file, we get this logcat output:
Well, most likely its the problem with stream byte endianess. You need to check if you actually feed in the data. You can also add -rawlogdir option to store the data which recognizer actually receives and analyze it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Terribly sorry about the long turnaround on my reply.
We've been hard at work on various parts of our app and the Transcript feature got pushed to the back burner. If you'd like this opened in a new thread, feel free to move it.
I determined at some point that we were forming our .wav files incorrectly. I've since modified the code that produces them.
However, I see currently the same error that I did above - that < s > could not be found in the first frame. I believe this has something to do with our dictionaries being malformed.
I would like to better understand what this error means, and how I can correct it. Thanks for all your help so far, Nikolay, and sorry again about taking so long to get back to you.
EDIT: To clarify, I am now reasonably sure that the .wav files have correct endianess with the correct file headers. We produce and consume little-endian files. I saw similar issues to ours being the result of < s > entries not existing in the dictionary. Also, I'm unsure how to add the -rawlogdir option you mention to our run configuration - is it as simple as adding it to the run configuration? Thanks again!
-Chris F
Last edit: Christopher Farnsworth 2016-01-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
However, I see currently the same error that I did above - that < s > could not be found in the first frame. I believe this has something to do with our dictionaries being malformed.
Well, you need to show your language model and the dictionary then. It is hard to say why <s> is missing. Are you sure it works on desktop with the same models?
Also, I'm unsure how to add the -rawlogdir option you mention to our run configuration - is it as simple as adding it to the run configuration?
c.setString("-rawlogdir", new File(assetDir).toString());
should work
Last edit: Nickolay V. Shmyrev 2016-01-23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for all the suggestions so far. We managed to get the system working somewhat!
However, the speech recognition is pretty inaccurate. For instance, if I speak into the mic the following phrase: "Hello phone", I sometimes get a correct result, but other times get "a low bone" or a similar-sounding phrase. With that in mind, I'm attaching the assets directory that we're using. Maybe you'll notice if something looks wrong?
EDIT: I wanted to clarify the problem a bit, so here goes.
Here's the code that does the decoding. I know our WAV files are correct because they play back with exactly the same audio that we get with the raw PCM (no listenable difference, so I think that's a good sign that they're in working order.
publicvoidtranslate(){Configc=Decoder.defaultConfig();Assetsassets=null;FileassetDir=null;try{assets=newAssets(_context);assetDir=assets.syncAssets();Log.d("DEBUG",assetDir.toString());}catch(IOExceptione){Log.d("DEBUG","Assetscouldn'tbecreated");e.printStackTrace();return;}c.setString("-hmm",newFile(assetDir,"/en-us").toString());c.setString("-lm",newFile(assetDir,"en-us.lm.dmp").toString());c.setString("-dict",newFile(assetDir,"cmudict-en-us.dict").toString());Decoderd=newDecoder(c);FileInputStreamstream=null;URItestwav=null;try{testwav=newURI("file:" + _wavFileName); } catch (URISyntaxException e) { e.printStackTrace(); Log.d("DEBUG", "URIcreationfailed"); return; } try { stream = new FileInputStream(new File(testwav)); } catch (FileNotFoundException e) { e.printStackTrace(); Log.d("DEBUG", "Filestreaminitializationfailed"); return; } d.startUtt(); byte[] b = new byte[4096]; try { int nbytes; while ((nbytes = stream.read(b)) >= 0) { ByteBuffer bb = ByteBuffer.wrap(b, 0, nbytes); // Not needed on desktop but required on android bb.order(ByteOrder.LITTLE_ENDIAN); short[] s = new short[nbytes/2]; bb.asShortBuffer().get(s); d.processRaw(s, nbytes/2, false, false); } } catch (IOException e) { e.printStackTrace(); Log.d("DEBUG", "IOFailed");return;}d.endUtt();//System.out.println(d.hyp().getHypstr());//_results.add(d.hyp().getHypstr());SegmentListsegments=d.seg();for(Segmentseg:segments){_results.add(seg.getWord());}}
This is mostly based on code I pulled from a few sources, adapted to fit our logic a bit..
Right now, I got some strange results. I think I might be misunderstanding a segment conceptually.
If I say "TEST" into the microphone, I get very strange results.
The strings currently produced by speaking "TEST" are "al opt out". Is that a good sign that the dictionary is off? Alternatively, do you notice anything obviously wrong with our code?
I'm sorry, you write about microphone many times but you actually recognizing from a file. How microphone is related to that?
It is better to provide logcat messages to make me understand what is going on.
Large vocabulary recognition on Android phone is going to be slow and I doubt you'll be able to do it in realtime, you need to somehow restrict the vocabulary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry, I didn't mean to confuse - we're originally capturing data from the mic, saving that to a file, and that gets passed to Sphinx. I should have said "we provide a file containing the spoken phrase 'hello'."
I'm attaching the logcat output from a recent run in case it helps. That said, because the accuracy of recognition has greatly improved, I think we may simply be working with too limited a dictionary. I am pretty sure we are still using the dictionary from the PocketSphinx demo, if that helps.
We did recently turn the bitrate we capture audio at down from 44100 to 16000 and saw dramatic improvements. It seems like we're missing some words in our dictionary possibly. Considering we're doing a transcript, I would prefer to limit the vocabulary as little as possible.
The data I've attached was from a test run in which I gave it a .wav file containing the spoken words "testing hello testing" with a brief pause in between each word. I got back the results "test hello for test". This is really close to being accurate, I just think perhaps "testing" is missing from the dictionary.
Realtime audio recognition isn't an issue. We are fine with running the recognition as a background service and notifying the user when it finishes up.
Sorry for any confusion my last post caused, and thanks again for all your help. I know this has been a long process, and I know we're just one of many small teams you're assisting. I really appreciate it.
I have to do some tests using Pocketsphinx with regard to a project of mine. So, I want to give as input .wav files, but I have problems with defining the path of a .wav file (since firstly I tried experimenting only with a single wav file). I always get FileNotFoundException. I have tried different ways to access... (by accessing from the external dircerory, by accessing from the assets directory, etc.), but I cannot. Could anyone of you please let me know how can I get the path of a wav file?
Thank you very much!
Leutrim
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am also trying to transcribe an audio file using similar code and getting the same logs as "Christopher Farnsworth - 2015-11-13" is getting. Is there any other way to transcribe an audio file or someone got result using above discussed code?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm a student working on a final project for my degree with a fellow alum of mine. We are developing an Android application that needs to be able to process speech-to-text on data from input files. We're still researching exactly how we're going to go about doing this, and came across the pocketsphinx library as a potential solution.
I read through the demo pocketsphinx android application, which seems to me to use the microphone to input speech data directly.
The issue is that our app needs to utilize the microphone to record a transcript of what the user says. This means we're using an AudioRecorder, which ties up the mic, so we can't send a live capture to Google's standard SpeechRecognizer.
My question is, as a result, does anyone know if pocketsphinx on Android has the capability to take a byte stream or something similar, so that we may pass speech files to the pocketsphinx API instead of a capture from the device's microphone? Has this been done before, or, if not, could we somewhat easily override the existing implementation of the pocketsphinx Android SpeechRecognizer?
Thanks for any help or suggestions you'd be willing to offer.
http://stackoverflow.com/questions/29008111/give-a-file-as-input-to-pocketsphinx-on-android
Thank you! This looks like exactly what we had in mind.
I'm not sure if I should move this into another topic, but I was hoping somebody wouldn't mind taking a look at our approach - I downloaded pocketsphinx for android and the demo app, and was able to get through a good chunk of it on my own, but I've been stuck for a day or so.
We capture audio as PCM, so I first convert that data to WAV.
I use this code to set up the Decoder:
When we call translate() after recording our PCM and converting to a WAV file, we get this logcat output:
I suspect we're not setting something up about the libraries correctly. I'm not exactly sure what, though.
If I need to provide more information, just let me know what you'd like to see.
Thanks for any help!
-Chris
Well, most likely its the problem with stream byte endianess. You need to check if you actually feed in the data. You can also add -rawlogdir option to store the data which recognizer actually receives and analyze it.
Terribly sorry about the long turnaround on my reply.
We've been hard at work on various parts of our app and the Transcript feature got pushed to the back burner. If you'd like this opened in a new thread, feel free to move it.
I determined at some point that we were forming our .wav files incorrectly. I've since modified the code that produces them.
However, I see currently the same error that I did above - that < s > could not be found in the first frame. I believe this has something to do with our dictionaries being malformed.
I would like to better understand what this error means, and how I can correct it. Thanks for all your help so far, Nikolay, and sorry again about taking so long to get back to you.
EDIT: To clarify, I am now reasonably sure that the .wav files have correct endianess with the correct file headers. We produce and consume little-endian files. I saw similar issues to ours being the result of < s > entries not existing in the dictionary. Also, I'm unsure how to add the -rawlogdir option you mention to our run configuration - is it as simple as adding it to the run configuration? Thanks again!
-Chris F
Last edit: Christopher Farnsworth 2016-01-22
Hello Chris
It's great you've got back to this problem.
Well, you need to show your language model and the dictionary then. It is hard to say why
<s>
is missing. Are you sure it works on desktop with the same models?should work
Last edit: Nickolay V. Shmyrev 2016-01-23
Hello Nickolay,
Thank you for all the suggestions so far. We managed to get the system working somewhat!
However, the speech recognition is pretty inaccurate. For instance, if I speak into the mic the following phrase: "Hello phone", I sometimes get a correct result, but other times get "a low bone" or a similar-sounding phrase. With that in mind, I'm attaching the assets directory that we're using. Maybe you'll notice if something looks wrong?
EDIT: I wanted to clarify the problem a bit, so here goes.
Here's the code that does the decoding. I know our WAV files are correct because they play back with exactly the same audio that we get with the raw PCM (no listenable difference, so I think that's a good sign that they're in working order.
This is mostly based on code I pulled from a few sources, adapted to fit our logic a bit..
Right now, I got some strange results. I think I might be misunderstanding a segment conceptually.
If I say "TEST" into the microphone, I get very strange results.
The strings currently produced by speaking "TEST" are "al opt out". Is that a good sign that the dictionary is off? Alternatively, do you notice anything obviously wrong with our code?
Thanks again,
Chris F
Last edit: Christopher Farnsworth 2016-01-26
I'm sorry, you write about microphone many times but you actually recognizing from a file. How microphone is related to that?
It is better to provide logcat messages to make me understand what is going on.
Large vocabulary recognition on Android phone is going to be slow and I doubt you'll be able to do it in realtime, you need to somehow restrict the vocabulary.
Sorry, I didn't mean to confuse - we're originally capturing data from the mic, saving that to a file, and that gets passed to Sphinx. I should have said "we provide a file containing the spoken phrase 'hello'."
I'm attaching the logcat output from a recent run in case it helps. That said, because the accuracy of recognition has greatly improved, I think we may simply be working with too limited a dictionary. I am pretty sure we are still using the dictionary from the PocketSphinx demo, if that helps.
We did recently turn the bitrate we capture audio at down from 44100 to 16000 and saw dramatic improvements. It seems like we're missing some words in our dictionary possibly. Considering we're doing a transcript, I would prefer to limit the vocabulary as little as possible.
The data I've attached was from a test run in which I gave it a .wav file containing the spoken words "testing hello testing" with a brief pause in between each word. I got back the results "test hello for test". This is really close to being accurate, I just think perhaps "testing" is missing from the dictionary.
Realtime audio recognition isn't an issue. We are fine with running the recognition as a background service and notifying the user when it finishes up.
Sorry for any confusion my last post caused, and thanks again for all your help. I know this has been a long process, and I know we're just one of many small teams you're assisting. I really appreciate it.
Hello everyone,
I have to do some tests using Pocketsphinx with regard to a project of mine. So, I want to give as input .wav files, but I have problems with defining the path of a .wav file (since firstly I tried experimenting only with a single wav file). I always get FileNotFoundException. I have tried different ways to access... (by accessing from the external dircerory, by accessing from the assets directory, etc.), but I cannot. Could anyone of you please let me know how can I get the path of a wav file?
Thank you very much!
Leutrim
Hello Leutrim
First of all please avoid hijacking old unrelated threads. If you have a question, start a new thread.
You can explore device filesystem and find the file with adb shell. See more details here:
http://developer.android.com/tools/help/shell.html
Hello,
I am also trying to transcribe an audio file using similar code and getting the same logs as "Christopher Farnsworth - 2015-11-13" is getting. Is there any other way to transcribe an audio file or someone got result using above discussed code?
You need to share the log you get.