where
1. I will give it few audio clips.
2. It will spit out the transcripts.
The Obvious choise is to go with Pocket Sphinx. But the problem is the accuracy. I ran few tests.
First I found some clips (16 Bit, 16KHz, Mono, lil endian) online and tried it. Then I tried mine. Online clips were nicely recognized,but my ones were really bad.
Results: Clip 1: Found Online, Native Speaker Original Script : once there was a young rat named arthur who never could make up his mind PocketSphinx : once there was a young rat named arthur who never could make up his mind Accuracy: Fantasitic
Clip 2: Found Online, Native Speaker Original Script : whenever his friends ask him if he would like to go with them PocketSphinx : whenever his friends ask him if you would like to call with them Accuracy: Very Good
Clip 3: Found Online, Native Speaker Original Script : he would only answer i don't know. He wouldn't say yes or no either PocketSphinx :you would only answer i don't know what you wouldn't say yes or no either Accuracy: Very Good
Clip 4: Youtube, Native Speaker Original Script : let's talk about merge sort. So far you've seen bubble sort, insertion sort and selection sort. Although all, I kind of waive my hand at what i mean by better, merger sort generally performs better than any of these three sorts. PocketSphinx : let's talk about words were so far are you see all story user shoes or in selections or although all kind of waive my hand what i mean i'd better words are generally performs better in any of these resorts Accuracy: Lol
Clip 5: Mine, Non native Speaker Original Script : Over the years, there have been many frameworks, using javascript to create ios applications. So what make React native Special? PocketSphinx : or three years the army navy frame looks using jobless rate to create a pilot had editions so what makes recreate in spaceship Accuracy: Lol
Based on the results above I thought something wrong with my Audio format. But all files return the following when I type, $file filename.wav
So, without any clue, I thought I should either train the pocket sphinx to understand my accent better (which i have no clue how to do) or limit pocket sphinx's vocabulary by giving it the transcript of my audio clips. The second option has to be done at runtime, but I am not aware of any api that create a model dynamically.
What is the best way to achieve I am after?
What is wrong with Clip 5 compared to Clip 1? Clip 1 has noice, still, it is better recognised.
If I have a transcript of my scpeech, Is there a way I could feed both Audio and Transcript in the system so that I could get a better accuracy?
Appricate your time guys. Thank you very much.
Last edit: Bavan Palan 2016-06-13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This one was compressed heavily with a codec, so audio was corrupted. We are not that great on compressed audio yet.
Clip 5: Mine, Non native Speaker
Yes, our models are not great for non-natives.
A specialized topic contributes to the accuracy in two samples above as well. Generic model is more biased to a simple langauge.
The second option has to be done at runtime, but I am not aware of any api that create a model dynamically.
It was discussed on the list, the API is simple but not implemented yet. You have to reimplement quick_lm.pl perl script yourself. You also need to integrate a g2p component to assign pronunciation to unknown words.
I wanted to create an application in Objective-C
where
1. I will give it few audio clips.
2. It will spit out the transcripts.
The Obvious choise is to go with Pocket Sphinx. But the problem is the accuracy. I ran few tests.
First I found some clips (16 Bit, 16KHz, Mono, lil endian) online and tried it. Then I tried mine. Online clips were nicely recognized,but my ones were really bad.
Results:
Clip 1: Found Online, Native Speaker
Original Script : once there was a young rat named arthur who never could make up his mind
PocketSphinx : once there was a young rat named arthur who never could make up his mind
Accuracy: Fantasitic
Clip 2: Found Online, Native Speaker
Original Script : whenever his friends ask him if he would like to go with them
PocketSphinx : whenever his friends ask him if you would like to call with them
Accuracy: Very Good
Clip 3: Found Online, Native Speaker
Original Script : he would only answer i don't know. He wouldn't say yes or no either
PocketSphinx : you would only answer i don't know what you wouldn't say yes or no either
Accuracy: Very Good
Clip 4: Youtube, Native Speaker
Original Script : let's talk about merge sort. So far you've seen bubble sort, insertion sort and selection sort. Although all, I kind of waive my hand at what i mean by better, merger sort generally performs better than any of these three sorts.
PocketSphinx : let's talk about words were so far are you see all story user shoes or in selections or although all kind of waive my hand what i mean i'd better words are generally performs better in any of these resorts
Accuracy: Lol
Clip 5: Mine, Non native Speaker
Original Script : Over the years, there have been many frameworks, using javascript to create ios applications. So what make React native Special?
PocketSphinx : or three years the army navy frame looks using jobless rate to create a pilot had editions so what makes recreate in spaceship
Accuracy: Lol
Based on the results above I thought something wrong with my Audio format. But all files return the following when I type, $file filename.wav
So, without any clue, I thought I should either train the pocket sphinx to understand my accent better (which i have no clue how to do) or limit pocket sphinx's vocabulary by giving it the transcript of my audio clips. The second option has to be done at runtime, but I am not aware of any api that create a model dynamically.
Appricate your time guys. Thank you very much.
Last edit: Bavan Palan 2016-06-13
This one was compressed heavily with a codec, so audio was corrupted. We are not that great on compressed audio yet.
Yes, our models are not great for non-natives.
A specialized topic contributes to the accuracy in two samples above as well. Generic model is more biased to a simple langauge.
It was discussed on the list, the API is simple but not implemented yet. You have to reimplement quick_lm.pl perl script yourself. You also need to integrate a g2p component to assign pronunciation to unknown words.
https://sourceforge.net/p/cmusphinx/discussion/help/thread/dd998add/
Brilliant Explanation. Thank you so much Nickolay.