CMU Sphinx / Forums / Help: Using file as input data for pocketsphinx

Christopher Farnsworth - 2015-11-09

Hello,

I'm a student working on a final project for my degree with a fellow alum of mine. We are developing an Android application that needs to be able to process speech-to-text on data from input files. We're still researching exactly how we're going to go about doing this, and came across the pocketsphinx library as a potential solution.

I read through the demo pocketsphinx android application, which seems to me to use the microphone to input speech data directly.

The issue is that our app needs to utilize the microphone to record a transcript of what the user says. This means we're using an AudioRecorder, which ties up the mic, so we can't send a live capture to Google's standard SpeechRecognizer.

My question is, as a result, does anyone know if pocketsphinx on Android has the capability to take a byte stream or something similar, so that we may pass speech files to the pocketsphinx API instead of a capture from the device's microphone? Has this been done before, or, if not, could we somewhat easily override the existing implementation of the pocketsphinx Android SpeechRecognizer?

Thanks for any help or suggestions you'd be willing to offer.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-11-09
  
  http://stackoverflow.com/questions/29008111/give-a-file-as-input-to-pocketsphinx-on-android
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Farnsworth - 2015-11-09

Thank you! This looks like exactly what we had in mind.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I'm not sure if I should move this into another topic, but I was hoping somebody wouldn't mind taking a look at our approach - I downloaded pocketsphinx for android and the demo app, and was able to get through a good chunk of it on my own, but I've been stuck for a day or so.

We capture audio as PCM, so I first convert that data to WAV.

I use this code to set up the Decoder:

    public void translate() {
        Config c = Decoder.defaultConfig();
        Assets assets = null;
        File assetDir = null;
       try {
            assets = new Assets(_context);
            assetDir = assets.syncAssets();
            Log.d("DEBUG", assetDir.toString());
        } catch (IOException e) {
            Log.d("DEBUG", "Assets couldn't be created");
           e.printStackTrace();
           return;
        }

        c.setString("-hmm", new File(assetDir, "/en-us").toString());
        c.setString("-lm", new File(assetDir, "en-us.lm.dmp").toString());
        c.setString("-dict", new File(assetDir, "cmudict-en-us.dict").toString());
        Decoder d = new Decoder(c);
        FileInputStream stream = null;
        URI testwav = null;
        try {
            testwav = new URI("file:" + _wavFileName);
        } catch (URISyntaxException e) {
            e.printStackTrace();
            Log.d("DEBUG", "URI broke");
            return;
        }
        try {
            stream = new FileInputStream(new File(testwav));
        } catch (FileNotFoundException e) {
            e.printStackTrace();
            Log.d("DEBUG", "File stream broke");
            return;
        }
        d.startUtt();
        byte[] b = new byte[4096];
        try {
            int nbytes;
            while ((nbytes = stream.read(b)) >= 0) {
                ByteBuffer bb = ByteBuffer.wrap(b, 0, nbytes);

                // Not needed on desktop but required on android
                bb.order(ByteOrder.LITTLE_ENDIAN);

                short[] s = new short[nbytes/2];
                bb.asShortBuffer().get(s);
                d.processRaw(s, nbytes/2, false, false);
            }
        } catch (IOException e) {
            e.printStackTrace();
            Log.d("DEBUG", "io broken");
            return;
        }
        d.endUtt();
        System.out.println(d.hyp().getHypstr());
        for (Segment seg : d.seg()) {
            _results.add(seg.getWord());
        }
    }

When we call translate() after recording our PCM and converting to a WAV file, we get this logcat output:

11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset cmudict-en-us.dict: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/noisedict: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/means: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/feat.params: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/README: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/sendump: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/variances: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/transition_matrices: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us.lm.dmp: checksums are equal
11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/mdef: checksums are equal
11-13 13:38:34.616  30220-30220/test.stagecoach D/DEBUG﹕ /storage/emulated/0/Android/data/test.stagecoach/files/sync
11-13 13:38:34.616  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: pocketsphinx.c(145): Parsed model-specific feature parameters from /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/feat.params
11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none'
11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: cmn.c(143): mean[0]= 12.00, mean[1..12]= 0.0
11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: acmod.c(164): Using subvector specification 0-12/13-25/26-38
11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: mdef.c(518): Reading model definition: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/mdef
11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef file
11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: bin_mdef.c(336): Reading binary model definition: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/mdef
11-13 13:38:34.736  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: bin_mdef.c(516): 42 CI-phone, 137053 CD-phone, 3 emitstate/phone, 126 CI-sen, 5126 Sen, 29324 Sen-Seq
11-13 13:38:34.736  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: tmat.c(206): Reading HMM transition probability matrices: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/transition_matrices
11-13 13:38:34.736  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: acmod.c(117): Attempting to use PTM computation module
11-13 13:38:34.746  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/means
11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(292): 42 codebook, 3 feature, size:
11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/variances
11-13 13:38:34.786  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(292): 42 codebook, 3 feature, size:
11-13 13:38:34.786  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
11-13 13:38:34.786  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
11-13 13:38:34.786  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(354): 222 variance values floored
11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(476): Loading senones from dump file /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/sendump
11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(500): BEGIN FILE FORMAT DESCRIPTION
11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(563): Rows: 128, Columns: 5126
11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(595): Using memory-mapped I/O for senones
11-13 13:38:34.866  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(835): Maximum top-N: 4
11-13 13:38:34.866  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
11-13 13:38:34.986  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(320): Allocating 137526 * 20 bytes (2686 KiB) for word entries
11-13 13:38:34.986  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(333): Reading main dictionary: /storage/emulated/0/Android/data/test.stagecoach/files/sync/cmudict-en-us.dict
11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(213): Allocated 1007 KiB for strings, 1662 KiB for phones
11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(336): 133425 words read
11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(358): Reading filler dictionary: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/noisedict
11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(213): Allocated 0 KiB for strings, 0 KiB for phones
11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(361): 5 words read
11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict2pid.c(396): Building PID tables for dictionary
11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict2pid.c(406): Allocating 42^3 * 2 bytes (144 KiB) for word-initial triphones
11-13 13:38:35.896  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict2pid.c(132): Allocated 21336 bytes (20 KiB) for word-final triphones
11-13 13:38:35.906  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict2pid.c(196): Allocated 21336 bytes (20 KiB) for single-phone word triphones
11-13 13:38:35.906  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(399): Trying to read LM in trie binary format
11-13 13:38:35.916  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(410): Header doesn't match
11-13 13:38:35.916  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(177): Trying to read LM in arpa format
11-13 13:38:36.386  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(69): No \data\ mark in LM file
11-13 13:38:36.386  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(489): Trying to read LM in DMP format
11-13 13:38:36.406  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(562): ngrams 1=19794, 2=1377200, 3=3178194
11-13 13:38:58.346  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: lm_trie.c(317): Training quantizer
11-13 13:39:01.516  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: lm_trie.c(323): Building LM trie
11-13 13:39:15.806  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(99): 788 unique initial diphones
11-13 13:39:15.816  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(148): 0 root, 0 non-root channels, 56 single-phone words
11-13 13:39:15.816  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(186): Creating search tree
11-13 13:39:15.816  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(192): before: 0 root, 0 non-root channels, 56 single-phone words
11-13 13:39:16.076  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(326): after: max nonroot chan increased to 44782
11-13 13:39:16.076  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(339): after: 573 root, 44654 non-root channels, 47 single-phone words
11-13 13:39:16.076  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
11-13 13:39:16.076  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 0 words
11-13 13:47:08.986    1986-1986/test.stagecoach E/cmusphinx﹕ ERROR: "ngram_search.c", line 1142: Couldn't find <s> in first frame

I suspect we're not setting something up about the libraries correctly. I'm not exactly sure what, though.

If I need to provide more information, just let me know what you'd like to see.

Thanks for any help!
-Chris

Nickolay V. Shmyrev - 2015-11-15

Well, most likely its the problem with stream byte endianess. You need to check if you actually feed in the data. You can also add -rawlogdir option to store the data which recognizer actually receives and analyze it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Farnsworth - 2016-01-22

Terribly sorry about the long turnaround on my reply.

We've been hard at work on various parts of our app and the Transcript feature got pushed to the back burner. If you'd like this opened in a new thread, feel free to move it.

I determined at some point that we were forming our .wav files incorrectly. I've since modified the code that produces them.

However, I see currently the same error that I did above - that < s > could not be found in the first frame. I believe this has something to do with our dictionaries being malformed.

I would like to better understand what this error means, and how I can correct it. Thanks for all your help so far, Nikolay, and sorry again about taking so long to get back to you.

EDIT: To clarify, I am now reasonably sure that the .wav files have correct endianess with the correct file headers. We produce and consume little-endian files. I saw similar issues to ours being the result of < s > entries not existing in the dictionary. Also, I'm unsure how to add the -rawlogdir option you mention to our run configuration - is it as simple as adding it to the run configuration? Thanks again!

-Chris F

Last edit: Christopher Farnsworth 2016-01-22

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-01-23
  
  Hello Chris
  
  It's great you've got back to this problem.
  
  However, I see currently the same error that I did above - that < s > could not be found in the first frame. I believe this has something to do with our dictionaries being malformed.
  
  Well, you need to show your language model and the dictionary then. It is hard to say why <s> is missing. Are you sure it works on desktop with the same models?
  
  Also, I'm unsure how to add the -rawlogdir option you mention to our run configuration - is it as simple as adding it to the run configuration?
  
  c.setString("-rawlogdir", new File(assetDir).toString());
  
  should work
  
  Last edit: Nickolay V. Shmyrev 2016-01-23
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hello Nickolay,

Thank you for all the suggestions so far. We managed to get the system working somewhat!

However, the speech recognition is pretty inaccurate. For instance, if I speak into the mic the following phrase: "Hello phone", I sometimes get a correct result, but other times get "a low bone" or a similar-sounding phrase. With that in mind, I'm attaching the assets directory that we're using. Maybe you'll notice if something looks wrong?

EDIT: I wanted to clarify the problem a bit, so here goes.

Here's the code that does the decoding. I know our WAV files are correct because they play back with exactly the same audio that we get with the raw PCM (no listenable difference, so I think that's a good sign that they're in working order.

    public void translate() {
        Config c = Decoder.defaultConfig();
        Assets assets = null;
        File assetDir = null;
       try {
            assets = new Assets(_context);
            assetDir = assets.syncAssets();
            Log.d("DEBUG", assetDir.toString());
        } catch (IOException e) {
            Log.d("DEBUG", "Assets couldn't be created");
           e.printStackTrace();
           return;
        }

        c.setString("-hmm", new File(assetDir, "/en-us").toString());
        c.setString("-lm", new File(assetDir, "en-us.lm.dmp").toString());
        c.setString("-dict", new File(assetDir, "cmudict-en-us.dict").toString());
        Decoder d = new Decoder(c);
        FileInputStream stream = null;
        URI testwav = null;
        try {
            testwav = new URI("file:" + _wavFileName);
        } catch (URISyntaxException e) {
            e.printStackTrace();
            Log.d("DEBUG", "URI creation failed");
            return;
        }
        try {
            stream = new FileInputStream(new File(testwav));
        } catch (FileNotFoundException e) {
            e.printStackTrace();
            Log.d("DEBUG", "File stream initialization failed");
            return;
        }
        d.startUtt();
        byte[] b = new byte[4096];
        try {
            int nbytes;
            while ((nbytes = stream.read(b)) >= 0) {
                ByteBuffer bb = ByteBuffer.wrap(b, 0, nbytes);

                // Not needed on desktop but required on android
                bb.order(ByteOrder.LITTLE_ENDIAN);

                short[] s = new short[nbytes/2];
                bb.asShortBuffer().get(s);
                d.processRaw(s, nbytes/2, false, false);
            }
        } catch (IOException e) {
            e.printStackTrace();
            Log.d("DEBUG", "IO Failed");
            return;
        }
        d.endUtt();
        //System.out.println(d.hyp().getHypstr());
        //_results.add(d.hyp().getHypstr());
        SegmentList segments = d.seg();
        for (Segment seg : segments) {
            _results.add(seg.getWord());
        }
    }

This is mostly based on code I pulled from a few sources, adapted to fit our logic a bit..

Right now, I got some strange results. I think I might be misunderstanding a segment conceptually.

If I say "TEST" into the microphone, I get very strange results.

The strings currently produced by speaking "TEST" are "al opt out". Is that a good sign that the dictionary is off? Alternatively, do you notice anything obviously wrong with our code?

Thanks again,
Chris F

Last edit: Christopher Farnsworth 2016-01-26

assets.zip

Nickolay V. Shmyrev - 2016-01-28

I'm sorry, you write about microphone many times but you actually recognizing from a file. How microphone is related to that?

It is better to provide logcat messages to make me understand what is going on.

Large vocabulary recognition on Android phone is going to be slow and I doubt you'll be able to do it in realtime, you need to somehow restrict the vocabulary.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Farnsworth - 2016-01-29

Sorry, I didn't mean to confuse - we're originally capturing data from the mic, saving that to a file, and that gets passed to Sphinx. I should have said "we provide a file containing the spoken phrase 'hello'."

I'm attaching the logcat output from a recent run in case it helps. That said, because the accuracy of recognition has greatly improved, I think we may simply be working with too limited a dictionary. I am pretty sure we are still using the dictionary from the PocketSphinx demo, if that helps.

We did recently turn the bitrate we capture audio at down from 44100 to 16000 and saw dramatic improvements. It seems like we're missing some words in our dictionary possibly. Considering we're doing a transcript, I would prefer to limit the vocabulary as little as possible.

The data I've attached was from a test run in which I gave it a .wav file containing the spoken words "testing hello testing" with a brief pause in between each word. I got back the results "test hello for test". This is really close to being accurate, I just think perhaps "testing" is missing from the dictionary.

Realtime audio recognition isn't an issue. We are fine with running the recognition as a background service and notifying the user when it finishes up.

Sorry for any confusion my last post caused, and thanks again for all your help. I know this has been a long process, and I know we're just one of many small teams you're assisting. I really appreciate it.

logcat_cmusphinx.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leo - 2016-02-17

Hello everyone,

I have to do some tests using Pocketsphinx with regard to a project of mine. So, I want to give as input .wav files, but I have problems with defining the path of a .wav file (since firstly I tried experimenting only with a single wav file). I always get FileNotFoundException. I have tried different ways to access... (by accessing from the external dircerory, by accessing from the assets directory, etc.), but I cannot. Could anyone of you please let me know how can I get the path of a wav file?

Thank you very much!
Leutrim

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-02-18
  
  Hello Leutrim
  
  First of all please avoid hijacking old unrelated threads. If you have a question, start a new thread.
  
  You can explore device filesystem and find the file with adb shell. See more details here:
  
  http://developer.android.com/tools/help/shell.html
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Murtuza Vhora - 2017-05-23

Hello,
I am also trying to transcribe an audio file using similar code and getting the same logs as "Christopher Farnsworth - 2015-11-13" is getting. Is there any other way to transcribe an audio file or someone got result using above discussed code?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2017-05-23
  
  You need to share the log you get.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Using file as input data for pocketsphinx

Speech Recognition Toolkit

Forums

Help

Using file as input data for pocketsphinx document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Using file as input data for pocketsphinx