Menu

Using file as input data for pocketsphinx

Help
2015-11-09
2016-02-18
  • Christopher Farnsworth

    Hello,

    I'm a student working on a final project for my degree with a fellow alum of mine. We are developing an Android application that needs to be able to process speech-to-text on data from input files. We're still researching exactly how we're going to go about doing this, and came across the pocketsphinx library as a potential solution.

    I read through the demo pocketsphinx android application, which seems to me to use the microphone to input speech data directly.

    The issue is that our app needs to utilize the microphone to record a transcript of what the user says. This means we're using an AudioRecorder, which ties up the mic, so we can't send a live capture to Google's standard SpeechRecognizer.

    My question is, as a result, does anyone know if pocketsphinx on Android has the capability to take a byte stream or something similar, so that we may pass speech files to the pocketsphinx API instead of a capture from the device's microphone? Has this been done before, or, if not, could we somewhat easily override the existing implementation of the pocketsphinx Android SpeechRecognizer?

    Thanks for any help or suggestions you'd be willing to offer.

     
  • Christopher Farnsworth

    Thank you! This looks like exactly what we had in mind.

     
  • Christopher Farnsworth

    I'm not sure if I should move this into another topic, but I was hoping somebody wouldn't mind taking a look at our approach - I downloaded pocketsphinx for android and the demo app, and was able to get through a good chunk of it on my own, but I've been stuck for a day or so.

    We capture audio as PCM, so I first convert that data to WAV.

    I use this code to set up the Decoder:

        public void translate() {
            Config c = Decoder.defaultConfig();
            Assets assets = null;
            File assetDir = null;
           try {
                assets = new Assets(_context);
                assetDir = assets.syncAssets();
                Log.d("DEBUG", assetDir.toString());
            } catch (IOException e) {
                Log.d("DEBUG", "Assets couldn't be created");
               e.printStackTrace();
               return;
            }
    
            c.setString("-hmm", new File(assetDir, "/en-us").toString());
            c.setString("-lm", new File(assetDir, "en-us.lm.dmp").toString());
            c.setString("-dict", new File(assetDir, "cmudict-en-us.dict").toString());
            Decoder d = new Decoder(c);
            FileInputStream stream = null;
            URI testwav = null;
            try {
                testwav = new URI("file:" + _wavFileName);
            } catch (URISyntaxException e) {
                e.printStackTrace();
                Log.d("DEBUG", "URI broke");
                return;
            }
            try {
                stream = new FileInputStream(new File(testwav));
            } catch (FileNotFoundException e) {
                e.printStackTrace();
                Log.d("DEBUG", "File stream broke");
                return;
            }
            d.startUtt();
            byte[] b = new byte[4096];
            try {
                int nbytes;
                while ((nbytes = stream.read(b)) >= 0) {
                    ByteBuffer bb = ByteBuffer.wrap(b, 0, nbytes);
    
                    // Not needed on desktop but required on android
                    bb.order(ByteOrder.LITTLE_ENDIAN);
    
                    short[] s = new short[nbytes/2];
                    bb.asShortBuffer().get(s);
                    d.processRaw(s, nbytes/2, false, false);
                }
            } catch (IOException e) {
                e.printStackTrace();
                Log.d("DEBUG", "io broken");
                return;
            }
            d.endUtt();
            System.out.println(d.hyp().getHypstr());
            for (Segment seg : d.seg()) {
                _results.add(seg.getWord());
            }
        }
    

    When we call translate() after recording our PCM and converting to a WAV file, we get this logcat output:

    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset cmudict-en-us.dict: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/noisedict: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/means: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/feat.params: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/README: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/sendump: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/variances: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/transition_matrices: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us.lm.dmp: checksums are equal
    11-13 13:38:34.606  30220-30220/test.stagecoach I/Assets﹕ Skipping asset en-us/mdef: checksums are equal
    11-13 13:38:34.616  30220-30220/test.stagecoach D/DEBUG﹕ /storage/emulated/0/Android/data/test.stagecoach/files/sync
    11-13 13:38:34.616  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: pocketsphinx.c(145): Parsed model-specific feature parameters from /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/feat.params
    11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none'
    11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: cmn.c(143): mean[0]= 12.00, mean[1..12]= 0.0
    11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: acmod.c(164): Using subvector specification 0-12/13-25/26-38
    11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: mdef.c(518): Reading model definition: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/mdef
    11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef file
    11-13 13:38:34.666  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: bin_mdef.c(336): Reading binary model definition: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/mdef
    11-13 13:38:34.736  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: bin_mdef.c(516): 42 CI-phone, 137053 CD-phone, 3 emitstate/phone, 126 CI-sen, 5126 Sen, 29324 Sen-Seq
    11-13 13:38:34.736  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: tmat.c(206): Reading HMM transition probability matrices: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/transition_matrices
    11-13 13:38:34.736  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: acmod.c(117): Attempting to use PTM computation module
    11-13 13:38:34.746  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/means
    11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(292): 42 codebook, 3 feature, size:
    11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
    11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
    11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
    11-13 13:38:34.766  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/variances
    11-13 13:38:34.786  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(292): 42 codebook, 3 feature, size:
    11-13 13:38:34.786  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
    11-13 13:38:34.786  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
    11-13 13:38:34.786  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(294):  128x13
    11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ms_gauden.c(354): 222 variance values floored
    11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(476): Loading senones from dump file /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/sendump
    11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(500): BEGIN FILE FORMAT DESCRIPTION
    11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(563): Rows: 128, Columns: 5126
    11-13 13:38:34.856  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(595): Using memory-mapped I/O for senones
    11-13 13:38:34.866  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ptm_mgau.c(835): Maximum top-N: 4
    11-13 13:38:34.866  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
    11-13 13:38:34.986  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(320): Allocating 137526 * 20 bytes (2686 KiB) for word entries
    11-13 13:38:34.986  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(333): Reading main dictionary: /storage/emulated/0/Android/data/test.stagecoach/files/sync/cmudict-en-us.dict
    11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(213): Allocated 1007 KiB for strings, 1662 KiB for phones
    11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(336): 133425 words read
    11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(358): Reading filler dictionary: /storage/emulated/0/Android/data/test.stagecoach/files/sync/en-us/noisedict
    11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(213): Allocated 0 KiB for strings, 0 KiB for phones
    11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict.c(361): 5 words read
    11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict2pid.c(396): Building PID tables for dictionary
    11-13 13:38:35.776  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict2pid.c(406): Allocating 42^3 * 2 bytes (144 KiB) for word-initial triphones
    11-13 13:38:35.896  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict2pid.c(132): Allocated 21336 bytes (20 KiB) for word-final triphones
    11-13 13:38:35.906  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: dict2pid.c(196): Allocated 21336 bytes (20 KiB) for single-phone word triphones
    11-13 13:38:35.906  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(399): Trying to read LM in trie binary format
    11-13 13:38:35.916  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(410): Header doesn't match
    11-13 13:38:35.916  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(177): Trying to read LM in arpa format
    11-13 13:38:36.386  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(69): No \data\ mark in LM file
    11-13 13:38:36.386  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(489): Trying to read LM in DMP format
    11-13 13:38:36.406  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_model_trie.c(562): ngrams 1=19794, 2=1377200, 3=3178194
    11-13 13:38:58.346  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: lm_trie.c(317): Training quantizer
    11-13 13:39:01.516  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: lm_trie.c(323): Building LM trie
    11-13 13:39:15.806  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(99): 788 unique initial diphones
    11-13 13:39:15.816  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(148): 0 root, 0 non-root channels, 56 single-phone words
    11-13 13:39:15.816  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(186): Creating search tree
    11-13 13:39:15.816  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(192): before: 0 root, 0 non-root channels, 56 single-phone words
    11-13 13:39:16.076  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(326): after: max nonroot chan increased to 44782
    11-13 13:39:16.076  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdtree.c(339): after: 573 root, 44654 non-root channels, 47 single-phone words
    11-13 13:39:16.076  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
    11-13 13:39:16.076  30220-30220/test.stagecoach I/cmusphinx﹕ INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 0 words
    11-13 13:47:08.986    1986-1986/test.stagecoach E/cmusphinx﹕ ERROR: "ngram_search.c", line 1142: Couldn't find <s> in first frame
    

    I suspect we're not setting something up about the libraries correctly. I'm not exactly sure what, though.

    If I need to provide more information, just let me know what you'd like to see.

    Thanks for any help!
    -Chris

     
    • Nickolay V. Shmyrev

      Well, most likely its the problem with stream byte endianess. You need to check if you actually feed in the data. You can also add -rawlogdir option to store the data which recognizer actually receives and analyze it.

       
  • Christopher Farnsworth

    Terribly sorry about the long turnaround on my reply.

    We've been hard at work on various parts of our app and the Transcript feature got pushed to the back burner. If you'd like this opened in a new thread, feel free to move it.

    I determined at some point that we were forming our .wav files incorrectly. I've since modified the code that produces them.

    However, I see currently the same error that I did above - that < s > could not be found in the first frame. I believe this has something to do with our dictionaries being malformed.

    I would like to better understand what this error means, and how I can correct it. Thanks for all your help so far, Nikolay, and sorry again about taking so long to get back to you.

    EDIT: To clarify, I am now reasonably sure that the .wav files have correct endianess with the correct file headers. We produce and consume little-endian files. I saw similar issues to ours being the result of < s > entries not existing in the dictionary. Also, I'm unsure how to add the -rawlogdir option you mention to our run configuration - is it as simple as adding it to the run configuration? Thanks again!

    -Chris F

     

    Last edit: Christopher Farnsworth 2016-01-22
    • Nickolay V. Shmyrev

      Hello Chris

      It's great you've got back to this problem.

      However, I see currently the same error that I did above - that < s > could not be found in the first frame. I believe this has something to do with our dictionaries being malformed.

      Well, you need to show your language model and the dictionary then. It is hard to say why <s> is missing. Are you sure it works on desktop with the same models?

      Also, I'm unsure how to add the -rawlogdir option you mention to our run configuration - is it as simple as adding it to the run configuration?

          c.setString("-rawlogdir", new File(assetDir).toString());
      

      should work

       

      Last edit: Nickolay V. Shmyrev 2016-01-23
  • Christopher Farnsworth

    Hello Nickolay,

    Thank you for all the suggestions so far. We managed to get the system working somewhat!

    However, the speech recognition is pretty inaccurate. For instance, if I speak into the mic the following phrase: "Hello phone", I sometimes get a correct result, but other times get "a low bone" or a similar-sounding phrase. With that in mind, I'm attaching the assets directory that we're using. Maybe you'll notice if something looks wrong?

    EDIT: I wanted to clarify the problem a bit, so here goes.

    Here's the code that does the decoding. I know our WAV files are correct because they play back with exactly the same audio that we get with the raw PCM (no listenable difference, so I think that's a good sign that they're in working order.

        public void translate() {
            Config c = Decoder.defaultConfig();
            Assets assets = null;
            File assetDir = null;
           try {
                assets = new Assets(_context);
                assetDir = assets.syncAssets();
                Log.d("DEBUG", assetDir.toString());
            } catch (IOException e) {
                Log.d("DEBUG", "Assets couldn't be created");
               e.printStackTrace();
               return;
            }
    
            c.setString("-hmm", new File(assetDir, "/en-us").toString());
            c.setString("-lm", new File(assetDir, "en-us.lm.dmp").toString());
            c.setString("-dict", new File(assetDir, "cmudict-en-us.dict").toString());
            Decoder d = new Decoder(c);
            FileInputStream stream = null;
            URI testwav = null;
            try {
                testwav = new URI("file:" + _wavFileName);
            } catch (URISyntaxException e) {
                e.printStackTrace();
                Log.d("DEBUG", "URI creation failed");
                return;
            }
            try {
                stream = new FileInputStream(new File(testwav));
            } catch (FileNotFoundException e) {
                e.printStackTrace();
                Log.d("DEBUG", "File stream initialization failed");
                return;
            }
            d.startUtt();
            byte[] b = new byte[4096];
            try {
                int nbytes;
                while ((nbytes = stream.read(b)) >= 0) {
                    ByteBuffer bb = ByteBuffer.wrap(b, 0, nbytes);
    
                    // Not needed on desktop but required on android
                    bb.order(ByteOrder.LITTLE_ENDIAN);
    
                    short[] s = new short[nbytes/2];
                    bb.asShortBuffer().get(s);
                    d.processRaw(s, nbytes/2, false, false);
                }
            } catch (IOException e) {
                e.printStackTrace();
                Log.d("DEBUG", "IO Failed");
                return;
            }
            d.endUtt();
            //System.out.println(d.hyp().getHypstr());
            //_results.add(d.hyp().getHypstr());
            SegmentList segments = d.seg();
            for (Segment seg : segments) {
                _results.add(seg.getWord());
            }
        }
    

    This is mostly based on code I pulled from a few sources, adapted to fit our logic a bit..

    Right now, I got some strange results. I think I might be misunderstanding a segment conceptually.

    If I say "TEST" into the microphone, I get very strange results.

    The strings currently produced by speaking "TEST" are "al opt out". Is that a good sign that the dictionary is off? Alternatively, do you notice anything obviously wrong with our code?

    Thanks again,
    Chris F

     

    Last edit: Christopher Farnsworth 2016-01-26
    • Nickolay V. Shmyrev

      I'm sorry, you write about microphone many times but you actually recognizing from a file. How microphone is related to that?

      It is better to provide logcat messages to make me understand what is going on.

      Large vocabulary recognition on Android phone is going to be slow and I doubt you'll be able to do it in realtime, you need to somehow restrict the vocabulary.

       
  • Christopher Farnsworth

    Sorry, I didn't mean to confuse - we're originally capturing data from the mic, saving that to a file, and that gets passed to Sphinx. I should have said "we provide a file containing the spoken phrase 'hello'."

    I'm attaching the logcat output from a recent run in case it helps. That said, because the accuracy of recognition has greatly improved, I think we may simply be working with too limited a dictionary. I am pretty sure we are still using the dictionary from the PocketSphinx demo, if that helps.

    We did recently turn the bitrate we capture audio at down from 44100 to 16000 and saw dramatic improvements. It seems like we're missing some words in our dictionary possibly. Considering we're doing a transcript, I would prefer to limit the vocabulary as little as possible.

    The data I've attached was from a test run in which I gave it a .wav file containing the spoken words "testing hello testing" with a brief pause in between each word. I got back the results "test hello for test". This is really close to being accurate, I just think perhaps "testing" is missing from the dictionary.

    Realtime audio recognition isn't an issue. We are fine with running the recognition as a background service and notifying the user when it finishes up.

    Sorry for any confusion my last post caused, and thanks again for all your help. I know this has been a long process, and I know we're just one of many small teams you're assisting. I really appreciate it.

     
  • Leo

    Leo - 2016-02-17

    Hello everyone,

    I have to do some tests using Pocketsphinx with regard to a project of mine. So, I want to give as input .wav files, but I have problems with defining the path of a .wav file (since firstly I tried experimenting only with a single wav file). I always get FileNotFoundException. I have tried different ways to access... (by accessing from the external dircerory, by accessing from the assets directory, etc.), but I cannot. Could anyone of you please let me know how can I get the path of a wav file?

    Thank you very much!
    Leutrim

     
    • Nickolay V. Shmyrev

      Hello Leutrim

      First of all please avoid hijacking old unrelated threads. If you have a question, start a new thread.

      You can explore device filesystem and find the file with adb shell. See more details here:

      http://developer.android.com/tools/help/shell.html

       
  • Murtuza Vhora

    Murtuza Vhora - 2017-05-23

    Hello,
    I am also trying to transcribe an audio file using similar code and getting the same logs as "Christopher Farnsworth - 2015-11-13" is getting. Is there any other way to transcribe an audio file or someone got result using above discussed code?

     
    • Nickolay V. Shmyrev

      You need to share the log you get.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.