I wanted to create an application in Objective-C

where
1. I will give it few audio clips.
2. It will spit out the transcripts.

The Obvious choise is to go with Pocket Sphinx. But the problem is the accuracy. I ran few tests.

First I found some clips (16 Bit, 16KHz, Mono, lil endian) online and tried it. Then I tried mine. Online clips were nicely recognized,but my ones were really bad.

Results:
Clip 1: Found Online, Native Speaker
Original Script : once there was a young rat named arthur who never could make up his mind
PocketSphinx : once there was a young rat named arthur who never could make up his mind
Accuracy: Fantasitic

Clip 2: Found Online, Native Speaker
Original Script : whenever his friends ask him if he would like to go with them
PocketSphinx : whenever his friends ask him if you would like to call with them
Accuracy: Very Good

Clip 3: Found Online, Native Speaker
Original Script : he would only answer i don't know. He wouldn't say yes or no either
PocketSphinx : you would only answer i don't know what you wouldn't say yes or no either
Accuracy: Very Good

Clip 4: Youtube, Native Speaker
Original Script : let's talk about merge sort. So far you've seen bubble sort, insertion sort and selection sort. Although all, I kind of waive my hand at what i mean by better, merger sort generally performs better than any of these three sorts.
PocketSphinx : let's talk about words were so far are you see all story user shoes or in selections or although all kind of waive my hand what i mean i'd better words are generally performs better in any of these resorts
Accuracy: Lol

Clip 5: Mine, Non native Speaker
Original Script : Over the years, there have been many frameworks, using javascript to create ios applications. So what make React native Special?
PocketSphinx : or three years the army navy frame looks using jobless rate to create a pilot had editions so what makes recreate in spaceship
Accuracy: Lol

Based on the results above I thought something wrong with my Audio format. But all files return the following when I type, $file filename.wav

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz

So, without any clue, I thought I should either train the pocket sphinx to understand my accent better (which i have no clue how to do) or limit pocket sphinx's vocabulary by giving it the transcript of my audio clips. The second option has to be done at runtime, but I am not aware of any api that create a model dynamically.

What is the best way to achieve I am after?
What is wrong with Clip 5 compared to Clip 1? Clip 1 has noice, still, it is better recognised.
If I have a transcript of my scpeech, Is there a way I could feed both Audio and Transcript in the system so that I could get a better accuracy?

Appricate your time guys. Thank you very much.

Last edit: Bavan Palan 2016-06-13

Clip 4: Youtube, Native Speaker

This one was compressed heavily with a codec, so audio was corrupted. We are not that great on compressed audio yet.

Clip 5: Mine, Non native Speaker

Yes, our models are not great for non-natives.

A specialized topic contributes to the accuracy in two samples above as well. Generic model is more biased to a simple langauge.

The second option has to be done at runtime, but I am not aware of any api that create a model dynamically.

It was discussed on the list, the API is simple but not implemented yet. You have to reimplement quick_lm.pl perl script yourself. You also need to integrate a g2p component to assign pronunciation to unknown words.

https://sourceforge.net/p/cmusphinx/discussion/help/thread/dd998add/

PocketSphinx Accuray

Speech Recognition Toolkit

Forums

Help

PocketSphinx Accuray

I wanted to create an application in Objective-C

PocketSphinx Accuray

Speech Recognition Toolkit

Forums

Help

PocketSphinx Accuray document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

I wanted to create an application in Objective-C

PocketSphinx Accuray