I'm still getting used to Sphinx, its a very well organized piece of software. I'm evaluating its potential use for transcribing spoken speech to text (for something like automatic closed captioning). I've copied the transcriber demo, modified it a bit. I'm trying to transcribe the audio file found here:
I've also been trying to use some TTS generated audio files with better success. If anyone would like those files, please let me know.
The audio on the file should read:
"Once upon a time there were four little rabbits. [and] There names were Flopsy, Mopsy, Cotton-tail, and Peter."
Currently, the program (VansDecoder.jar) gives the following results:
"one second and are west lower little rather just
very at shade lot see cone table and to third
middle not and a rear to further it between"
I've been changing the configuration file a lot and trying a lot of different things, beamwidths and what not. I haven't been getting much luck from that. I'm not too sure what particular settings I can change to maximize the level of recognition from Sphinx. Should I customize a JSGF grammar for the story? Customize a language model? Retrain the acoustic model 0_0? Any help is greatly appreciated. Thanks for your time.
Best Regards,
Van Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello Van,
I could not find your files from the link you have provided! All the links leads to these error:
The requested URL /sphinx4/pass6/Pg4Sent2.txt was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.
I will be grateful if you could upload those files again for our help.
Regards
Maks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I was able to the lmtool to get a language model (0980.lm) and a new dictionary (0980.dic). I modified the config to use those new files as well as the wip modification. Updated files are in the following folder:
Updated Results:
whom who a thing a long
hung nothing looked
11:50.891 INFO wordPruningSearchMa Average Tokens/State: 695
whom whom on out
whom hung
whom who
could hoeing
whom hide
Did I use the new files incorrectly? Thanks for your help!!!!!!
Best Regards,
Van Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Wow, the result is impeccable with those settings. That was thrilling :D. But it only works for that file. I have a larger file containing the one above:
It would be interesting to know if Sphinx capable of printing out time stamps from the file between words? For the sake of building a karaoke-like system? How does Sphinx determine word boundaries? Very fascinating stuff! :D Thank you for your help!!!!!!!!!!!!!!!!!!!!!
Best Regards,
Van Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I try to run VansDecoder, even if it gives correct results, the program just stalls at the end no matter how long I let it run. Is there some kind of infinite looping going on? Error in my code? I'm not sure :(.
Best Regards,
Van Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> How do you determine the values to configure the beams for each audio file? What if I have to work with a different set with a different speaker?
Beams are actually not the most important thing, they only restrict search space. They affect the speed of recognition. You can select them so that system will be more or less precise and will work in a reasonable time.
> It would be interesting to know if Sphinx capable of printing out time stamps from the file between words? For the sake of building a karaoke-like system? How does Sphinx determine word boundaries? Very fascinating stuff! :D Thank you for your help!!!!!!!!!!!!!!!!!!!!!
sure, you can use Result.getTimedBestResult() in your java code. If you specifically look on the task of transcribing the speech from existing text, say for subtitles, this task is called "forced alignment" and can be done a little bit differently.
> If I try to run VansDecoder, even if it gives correct results, the program just stalls at the end no matter how long I let it run
If you are using latest trunk, it's a known bug not solved yet. Released version should perform better.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> sure, you can use Result.getTimedBestResult() in your java code
Using the following code for Result.getTimedBestResult(), the getTBR's results get printed as blank lines? I tried different possibilities for parameters but it seems to be blank no matter what. Did I do something wrong? I tried to use the code based on the javadoc but its not working right :(.
Result result = recognizer.recognize();
if (result != null) {
String resultText = result.getBestResultNoFiller();
String timedResult = result.getTimedBestResult(false, true);
System.out.println(resultText);
System.out.println(timedResult);
unitTestBuffer.add(result);
} else {
done = true;
}
> this task is called "forced alignment" and can be done a little bit differently.
If I want to do forced alignment, should I still be using Sphinx4? I'll dig into FA some more, thanks for the name! I wouldn't have found it on my own! Thanks again :D
> If you are using latest trunk, it's a known bug not solved yet. Released version should perform better.
Awesome, phew. I thought I put in an infinite loop or something. Thanks for some peace of mind.
Best Regards,
Van Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is there more I have to add or do to get result.getBestTimedResult(bool, bool) to work? I've read the javadocs, tried different combinations, used eclipse debugger. I'm not good enough with java to tell 0_0. Any ideas? :-\
Thanks for your help, as always! :D
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The numeric results are somewhat inaccurate (especially looking at some of the -1.0's). So more experimentation needs to be done before its perfected. Nickolay, do you have intuition about why this is happening? Please let me know.
If you go by the first number in parentheses as the range in seconds for the next word in the list. Its pretty close to the right timestamps. I suppose that the "-1.0"'s could just be ignored that way, skip straight to a non neg number, get your right timestamp. Is that how its supposed to be?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I put the comments around the emitting token chunk so that things work out properly. If I call getTimedBestResults with wordTokenFirst set to false, nothing would be printed out. I haven't figured that out yet.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To fix the wordTokenFirst issue, I added the line with the comment "VL's Addition".
private String getTimedWordTokenLastPath(Token token, boolean wantFiller) {
StringBuffer sb = new StringBuffer();
Word word = null;
Data lastFeature = null;
Data lastWordFirstFeature = null;
I used lmtool to make a new dictionary and language model but when I plug them in, I get the following error:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at edu.cmu.sphinx.linguist.lextree.HMMTree.collectEntryAndExitUnits(HMMTree.java:198)
at edu.cmu.sphinx.linguist.lextree.HMMTree.compile(HMMTree.java:152)
at edu.cmu.sphinx.linguist.lextree.HMMTree.<init>(HMMTree.java:73)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.generateHmmTree(LexTreeLinguist.java:366)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.compileGrammar(LexTreeLinguist.java:353)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:279)
at edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager.allocate(SimpleBreadthFirstSearchManager.java:567)
at edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:66)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:158)
at demo.sphinx.vansdecoder.VansDecoder.main(VansDecoder.java:64)
Does anyone have a clue what's causing it? My current code can be found here:
Its set up using a custom dictionary/lm from lmtool specific to this file only. peterrabbit.* are general ones for the entire corpus. But they don't work as well.
The audio file says:
"She went through the woods to the bakery where she bought some bread."
Sphinx4 thinks it says:
"she went through the woods to to bakery where she bought to bread"
What can I do to get that last inkling of accuracy? I've tried a variety of beam settings and other settings. Do I need to change my search module? Linguist? I have no clue. Feels like making progress though! hehehe
Thanks for reading!
Best Regards,
Van Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1) can you please post the changes you've made so far as patches? With some description of the solution probably. There is sense to commit them into trunk really
2) What sentences you generated language model from?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> 1) can you please post the changes you've made so far as patches? With some description of the solution probably. There is sense to commit them into trunk really
What do I do to create a folder I can commit with? I have a lot of extraneous stuff but would love to add the bugfixes and help others out. But I have zero experience helping with patches.
> 3) can you please try with the old ActiveListManager? Probably it will not dump times but it will be more precise.
Ah, that's a good point. Thanks! I had abandoned the ALM but I could probably use both separately to get accuracy and times with some subtle but (mostly) acceptable mismatches.
I know its possible to get the pronunciations (phonemes) printed out via Sphinx but is there a way to get those AND time stamps? That would be nice. Is that functionality included in Sphinx4 at the moment? Or is that something I would have to put together?
As always, thanks for your help, Nickolay! I hope to learn from your example :D.
Best Regards,
Van Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi everyone,
I'm still getting used to Sphinx, its a very well organized piece of software. I'm evaluating its potential use for transcribing spoken speech to text (for something like automatic closed captioning). I've copied the transcriber demo, modified it a bit. I'm trying to transcribe the audio file found here:
http://www.thegoleffect.com/sphinx4/peterRabbit1_humanread.wav
I've also been trying to use some TTS generated audio files with better success. If anyone would like those files, please let me know.
The audio on the file should read:
"Once upon a time there were four little rabbits. [and] There names were Flopsy, Mopsy, Cotton-tail, and Peter."
Currently, the program (VansDecoder.jar) gives the following results:
"one second and are west lower little rather just
very at shade lot see cone table and to third
middle not and a rear to further it between"
The rest of the files (including the code) is available here:
http://www.thegoleffect.com/sphinx4/
I've been changing the configuration file a lot and trying a lot of different things, beamwidths and what not. I haven't been getting much luck from that. I'm not too sure what particular settings I can change to maximize the level of recognition from Sphinx. Should I customize a JSGF grammar for the story? Customize a language model? Retrain the acoustic model 0_0? Any help is greatly appreciated. Thanks for your time.
Best Regards,
Van Nguyen
Hello Van,
I could not find your files from the link you have provided! All the links leads to these error:
The requested URL /sphinx4/pass6/Pg4Sent2.txt was not found on this server.
Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request.
I will be grateful if you could upload those files again for our help.
Regards
Maks
You need another language model and another dictionary. You can create basic ones with online lmtool:
http://www.speech.cs.cmu.edu/tools/lmtool.html
and for more advanced one you'll need cmuclmtk. Other tuning will be in setting wip to 0.7:
<property name="wordInsertionProbability" value="0.7"/>
Try these changes first please.
Your file is encoded at 44100 Hz, one you'll convert it to 16 kHz as it should be you'll get with this config:
RESULT: once upon a time there were four little rabbits
RESULT: and then names were
RESULT: got thing mopsy
RESULT: cottontail and peter
Hello Mr. Shmyrev,
Thank you for your super fast response :O.
I was able to the lmtool to get a language model (0980.lm) and a new dictionary (0980.dic). I modified the config to use those new files as well as the wip modification. Updated files are in the following folder:
http://www.thegoleffect.com/sphinx4/pass2/
Updated Results:
whom who a thing a long
hung nothing looked
11:50.891 INFO wordPruningSearchMa Average Tokens/State: 695
whom whom on out
whom hung
whom who
could hoeing
whom hide
Did I use the new files incorrectly? Thanks for your help!!!!!!
Best Regards,
Van Nguyen
And once you'll adjust beams a bit:
The result will be precise
Wow, the result is impeccable with those settings. That was thrilling :D. But it only works for that file. I have a larger file containing the one above:
http://www.thegoleffect.com/sphinx4/pass3/peterRabbit_gutenberg_part1.wav
But it is a very long file, so here is a cut portion that follows the file given earlier:
http://www.thegoleffect.com/sphinx4/pass3/peterRabbit1_humanread2.wav
How do you determine the values to configure the beams for each audio file? What if I have to work with a different set with a different speaker?
Regardless, I'm thoroughly impressed with Sphinx4 now :-D. Thank you very much for your time, again and again.
Best Regards,
Van Nguyen
Hello again,
It would be interesting to know if Sphinx capable of printing out time stamps from the file between words? For the sake of building a karaoke-like system? How does Sphinx determine word boundaries? Very fascinating stuff! :D Thank you for your help!!!!!!!!!!!!!!!!!!!!!
Best Regards,
Van Nguyen
Hello,
If I try to run VansDecoder, even if it gives correct results, the program just stalls at the end no matter how long I let it run. Is there some kind of infinite looping going on? Error in my code? I'm not sure :(.
Best Regards,
Van Nguyen
> How do you determine the values to configure the beams for each audio file? What if I have to work with a different set with a different speaker?
Beams are actually not the most important thing, they only restrict search space. They affect the speed of recognition. You can select them so that system will be more or less precise and will work in a reasonable time.
> It would be interesting to know if Sphinx capable of printing out time stamps from the file between words? For the sake of building a karaoke-like system? How does Sphinx determine word boundaries? Very fascinating stuff! :D Thank you for your help!!!!!!!!!!!!!!!!!!!!!
sure, you can use Result.getTimedBestResult() in your java code. If you specifically look on the task of transcribing the speech from existing text, say for subtitles, this task is called "forced alignment" and can be done a little bit differently.
> If I try to run VansDecoder, even if it gives correct results, the program just stalls at the end no matter how long I let it run
If you are using latest trunk, it's a known bug not solved yet. Released version should perform better.
> sure, you can use Result.getTimedBestResult() in your java code
Using the following code for Result.getTimedBestResult(), the getTBR's results get printed as blank lines? I tried different possibilities for parameters but it seems to be blank no matter what. Did I do something wrong? I tried to use the code based on the javadoc but its not working right :(.
Result result = recognizer.recognize();
if (result != null) {
String resultText = result.getBestResultNoFiller();
String timedResult = result.getTimedBestResult(false, true);
System.out.println(resultText);
System.out.println(timedResult);
unitTestBuffer.add(result);
} else {
done = true;
}
> this task is called "forced alignment" and can be done a little bit differently.
If I want to do forced alignment, should I still be using Sphinx4? I'll dig into FA some more, thanks for the name! I wouldn't have found it on my own! Thanks again :D
> If you are using latest trunk, it's a known bug not solved yet. Released version should perform better.
Awesome, phew. I thought I put in an infinite loop or something. Thanks for some peace of mind.
Best Regards,
Van Nguyen
Is there more I have to add or do to get result.getBestTimedResult(bool, bool) to work? I've read the javadocs, tried different combinations, used eclipse debugger. I'm not good enough with java to tell 0_0. Any ideas? :-\
Thanks for your help, as always! :D
It works with WavFile demo, it seems it depends on the decoder or grammar or other bits in config. More close investigation will take more time.
Hi,
By changing the activeList setup from the activeListManager (with factories) to one based on the activeList from WavFile, the results I got were:
once(0.54,0.86) upon(0.86,0.91) a(0.91,1.5) time(1.5,1.71) there(1.89,2.03) were(2.03,2.37) four(2.37,2.61) little(2.61,-1.0) rabbits(-1.0,-1.0)
and(3.91,4.06) then(4.06,4.41) names(4.41,-1.0) were(-1.0,-1.0)
flopsy(6.18,6.45) mopsy(-1.0,-1.0)
cottontail(8.23,8.79) and(8.79,-1.0) peter(-1.0,-1.0)
The numeric results are somewhat inaccurate (especially looking at some of the -1.0's). So more experimentation needs to be done before its perfected. Nickolay, do you have intuition about why this is happening? Please let me know.
I have uploaded a copy of my files here:
http://www.thegoleffect.com/sphinx4/pass4
Thanks for your time.
Best Regards,
Van Nguyen
Also, as a happy side-effect, the program no longer hangs upon completion.
If you go by the first number in parentheses as the range in seconds for the next word in the list. Its pretty close to the right timestamps. I suppose that the "-1.0"'s could just be ignored that way, skip straight to a non neg number, get your right timestamp. Is that how its supposed to be?
I forgot to mention:
I modified Result.java like so:
private String getTimedWordPath(Token token, boolean wantFiller) {
StringBuffer sb = new StringBuffer();
I put the comments around the emitting token chunk so that things work out properly. If I call getTimedBestResults with wordTokenFirst set to false, nothing would be printed out. I haven't figured that out yet.
To fix the wordTokenFirst issue, I added the line with the comment "VL's Addition".
private String getTimedWordTokenLastPath(Token token, boolean wantFiller) {
StringBuffer sb = new StringBuffer();
Word word = null;
Data lastFeature = null;
Data lastWordFirstFeature = null;
Using this method results in highly accurate time stamps.
I haven't tested any corner cases but this stuff might be useful to add to the sphinx4 code base, I suppose.
Best Regards,
Van Nguyen
Hello,
I used lmtool to make a new dictionary and language model but when I plug them in, I get the following error:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at edu.cmu.sphinx.linguist.lextree.HMMTree.collectEntryAndExitUnits(HMMTree.java:198)
at edu.cmu.sphinx.linguist.lextree.HMMTree.compile(HMMTree.java:152)
at edu.cmu.sphinx.linguist.lextree.HMMTree.<init>(HMMTree.java:73)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.generateHmmTree(LexTreeLinguist.java:366)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.compileGrammar(LexTreeLinguist.java:353)
at edu.cmu.sphinx.linguist.lextree.LexTreeLinguist.allocate(LexTreeLinguist.java:279)
at edu.cmu.sphinx.decoder.search.SimpleBreadthFirstSearchManager.allocate(SimpleBreadthFirstSearchManager.java:567)
at edu.cmu.sphinx.decoder.AbstractDecoder.allocate(AbstractDecoder.java:66)
at edu.cmu.sphinx.recognizer.Recognizer.allocate(Recognizer.java:158)
at demo.sphinx.vansdecoder.VansDecoder.main(VansDecoder.java:64)
Does anyone have a clue what's causing it? My current code can be found here:
http://www.thegoleffect.com/sphinx4/pass5
I appreciate any assistance you can offer. Thanks for your time.
Best Regards,
Van Nguyen
ArrayIndex problem fixed. For now.
Okay, so current status:
I have a new recorded audio sample. I've put all the necessary files here:
http://www.thegoleffect.com/sphinx4/pass6
Its set up using a custom dictionary/lm from lmtool specific to this file only. peterrabbit.* are general ones for the entire corpus. But they don't work as well.
The audio file says:
"She went through the woods to the bakery where she bought some bread."
Sphinx4 thinks it says:
"she went through the woods to to bakery where she bought to bread"
What can I do to get that last inkling of accuracy? I've tried a variety of beam settings and other settings. Do I need to change my search module? Linguist? I have no clue. Feels like making progress though! hehehe
Thanks for reading!
Best Regards,
Van Nguyen
3) can you please try with the old ActiveListManager? Probably it will not dump times but it will be more precise.
1) can you please post the changes you've made so far as patches? With some description of the solution probably. There is sense to commit them into trunk really
2) What sentences you generated language model from?
> 2) What sentences you generated language model from?
for the general peterrabbit.*
http://www.thegoleffect.com/sphinx4/pass6/PeterRabbitSimplified.txt
for the Pg4Sent2.wav file:
http://www.thegoleffect.com/sphinx4/pass6/Pg4Sent2.txt
> 1) can you please post the changes you've made so far as patches? With some description of the solution probably. There is sense to commit them into trunk really
What do I do to create a folder I can commit with? I have a lot of extraneous stuff but would love to add the bugfixes and help others out. But I have zero experience helping with patches.
> 3) can you please try with the old ActiveListManager? Probably it will not dump times but it will be more precise.
Ah, that's a good point. Thanks! I had abandoned the ALM but I could probably use both separately to get accuracy and times with some subtle but (mostly) acceptable mismatches.
I know its possible to get the pronunciations (phonemes) printed out via Sphinx but is there a way to get those AND time stamps? That would be nice. Is that functionality included in Sphinx4 at the moment? Or is that something I would have to put together?
As always, thanks for your help, Nickolay! I hope to learn from your example :D.
Best Regards,
Van Nguyen
Is there a way to have lmtool create batches of dictionaries and language models? Or to have it run locally for scripting purposes?