Hy, I'm trying to convert audio from a wav file to text using Sphinx4. I've built a very small database for testing, and adapted the WavFile demo to do the task. When I decode the same files used in training every thing goes right, but if I record the same word in another file, using the same method I did for the training files, the result is always <SIL>.
How can I solve this ? Thank you.
Gustavo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The database consists of only 4 words, my program don't need to write big speechs, it will execute some tasks, associated with the spoken words.
You mean the program output log ? The output is the audio file atributes and the text decoded, as in the WavFile demo.
WITH THE ORIGINAL FILE USED FOR TRAINING--------------------
Can you please paste prompts you used for recording and a dictionary. It's recommended to have at least 10 examples of each word in a dictionary so your prompts should have at least 40 words. Is it true?
Another way to proceed - iterative process. Once your test word is not recognized, add it to training set and record another test word. Such process should converge to working system.
I also wonder why absolute beam is -1:
<property name="absoluteBeamWidth" value="-1"/>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
AbsoluteBeamWidth of -1 means that no absolute beam width is used. I frequently use this setting and rely solely on the RelativeBeamWidth--so do many of the example configurations in the code base.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks a lot Nickolay.
I had only 1 wav file for each word, now I use 10 and the results are much better, though not perfect. How many different voices and wave files per voice do I need to use, in order to make the recognition "speaker independent" ? Do I need to use male and female voices ?
I don't know what value AbsoluteBeamWidth should have, I use the same value as the demos, as Robbie said. What AbsoluteBeamWidth and RelativeBeamWidth are used for?
I have one more problem now (looks like they will never end ^^), the recognition still returning <SIL> or erroneous results if I use the microphone. I guess my microphone output SampleRate is not 16KHz as it should be, do you know how to change the microphone properties in Windows XP?
And the result shouldn't be always a word in the dicionary ?
Nickolay, if you need more information to discover what is happening would be possible for us to talk through MSN ? This would speed things up. If you can add me, my MSN e-mail is [gustavobap#yahoo.com.br].
The dictionary uses english phonems to simulate the portuguese ones,
since there is no portuguese language model available.
THIS IS THE DICTIONARY=============================
ALO AH L OW
FOCAS F AA K AH S
LONTRAS L OW N T R AH S
TESTE T EH S T EH
/** * Main method for running the WavFile demo. */publicstaticvoidmain(String[]args){try{URLaudioFileURL;if(args.length>0){audioFileURL=newFile(args[0]).toURI().toURL();}else{audioFileURL=GTeste.class.getResource("testeR.wav");}URLconfigURL=GTeste.class.getResource("/g_teste_config.xml");System.out.println("LoadingRecognizer...\n");ConfigurationManagercm=newConfigurationManager(configURL);Recognizerrecognizer=(Recognizer)cm.lookup("recognizer");/* allocate the resource necessary for the recognizer */recognizer.allocate();System.out.println("Decoding"+audioFileURL.getFile());System.out.println(AudioSystem.getAudioFileFormat(audioFileURL));StreamDataSourcereader=(StreamDataSource)cm.lookup("streamDataSource");AudioInputStreamais=AudioSystem.getAudioInputStream(audioFileURL);/* set the stream data source to read from the audio file */reader.setInputStream(ais,audioFileURL.getFile());/* decode the audio file */Resultresult=recognizer.recognize();/* print out the results */if(result!=null){System.out.println("\nRESULT:"+result.getBestFinalResultNoFiller()+"\n");}else{System.out.println("Result:null\n");}}catch(IOExceptione){System.err.println("ProblemwhenloadingWavFile:"+e);e.printStackTrace();}catch(PropertyExceptione){System.err.println("ProblemconfiguringWavFile:"+e);e.printStackTrace();}catch(InstantiationExceptione){System.err.println("ProblemcreatingWavFile:"+e);e.printStackTrace();}catch(UnsupportedAudioFileExceptione){System.err.println("Audiofileformatnotsupported:"+e);e.printStackTrace();}}
> I had only 1 wav file for each word, now I use 10 and the results are much better, though not perfect. How many different voices and wave files per voice do I need to use, in order to make the recognition "speaker independent" ? Do I need to use male and female voices ?
Well, I suggest you to look at tidigits database design:
of course you don't need 300 speakers but at least 100 is required for true independence. Another details you can take from tidigits (see sphinx4 example) is the word-oriented dictionary:
or something like that. Check if such dictionary/phoneset will improve WER.
> if you need more information to discover what is happening would be possible for us to talk through MSN ?
Well, it's possible but I suppose we should create irc channel like #cmusphinx on freenode. Although I don't know if people here will support this. Probably it should be discussed in separate thread.
> I guess my microphone output SampleRate is not 16KHz as it should be, do you know how to change the microphone properties in Windows XP?
I suppose you should just set microphone properties:
Correcting the last post: when I use the microphone the recognition returns ALWAYS <SIL>, and not erroneous results.
The configuration file when I'm using the microphone is like the one before, but the property "frontend" is changed
<property name="frontend" value="epFrontEnd"/>
and the part
<!-- ******** -->
<!-- The frontend configuration -->
<!-- ******** -->
is replaced by this:
================================================
=======================================================
THIS IS THE JAVA MAIN CLASS, USING MICROPHONE==========
/
* Copyright 1999-2004 Carnegie Mellon University.
* Portions Copyright 2004 Sun Microsystems, Inc.
* Portions Copyright 2004 Mitsubishi Electric Research Laboratories.
* All Rights Reserved. Use is subject to license terms.
* See the file "license.terms" for information on usage and
* redistribution of this file, and for a DISCLAIMER OF ALL
* WARRANTIES. /
/*
* A simple HelloWorld demo showing a simple speech application
* built using Sphinx-4. This application uses the Sphinx-4 endpointer,
* which automatically segments incoming audio into utterances and silences. /
public class GTeste2 {
/** * Main method for running the HelloWorld demo. */publicstaticvoidmain(String[]args){try{URLurl;if(args.length>0){url=newFile(args[0]).toURI().toURL();}else{url=GTeste2.class.getResource("/g_teste_config2.xml");}System.out.println("Loading...");ConfigurationManagercm=newConfigurationManager(url);Recognizerrecognizer=(Recognizer)cm.lookup("recognizer");Microphonemicrophone=(Microphone)cm.lookup("microphone");/* allocate the resource necessary for the recognizer */recognizer.allocate();/* the microphone will keep recording until the program exits */if(microphone.startRecording()){System.out.println("Say:(alo|lontras|teste|focas)");while(true){System.out.println("Startspeaking.PressCtrl-Ctoquit.\n");/* * This method will return when the end of speech * is reached. Note that the endpointer will determine * the end of speech. */Resultresult=recognizer.recognize();if(result!=null){StringresultText=result.getBestFinalResultNoFiller();System.out.println("Yousaid:"+resultText+"\n");}else{System.out.println("Ican'thearwhatyousaid.\n");}}}else{System.out.println("Cannotstartmicrophone.");recognizer.deallocate();System.exit(1);}}catch(IOExceptione){System.err.println("ProblemwhenloadingHelloWorld:"+e);e.printStackTrace();}catch(PropertyExceptione){System.err.println("ProblemconfiguringHelloWorld:"+e);e.printStackTrace();}catch(InstantiationExceptione){System.err.println("ProblemcreatingHelloWorld:"+e);e.printStackTrace();}}
the problem is, even with this property set
<property name="bigEndianData" value="false"/>
the microphone is initialized like this:
================================================================
PCM_SIGNED 16000.0 Hz, 16 bit, mono, 2 bytes/frame, big-endian
================================================================
it is BIG-ENDIAN, and my trained files are LITTLE-ENDIAN.
Do you know how to solve this?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just discovered the property should be set like this:
<property name="bigEndian" value="false"/>
The audio format is correct now, but the results still wrong =[.
The recognizer is taking too long to return, and the recognition
is <SIL> or erroneous.
Something weird is happening.. I guess I'm doing something really wrong.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hy, I'm trying to convert audio from a wav file to text using Sphinx4. I've built a very small database for testing, and adapted the WavFile demo to do the task. When I decode the same files used in training every thing goes right, but if I record the same word in another file, using the same method I did for the training files, the result is always <SIL>.
How can I solve this ? Thank you.
Gustavo
"very small" database for testing will not work. Usually test set is 10% of training data. Is it your case?
If you need more details, paste complete log.
Hello again Nickolay ^^, thanks for answering.
The database consists of only 4 words, my program don't need to write big speechs, it will execute some tasks, associated with the spoken words.
You mean the program output log ? The output is the audio file atributes and the text decoded, as in the WavFile demo.
WITH THE ORIGINAL FILE USED FOR TRAINING--------------------
Loading Recognizer...
Decoding /C:/Documents/TesteSphinx2/src/teste.wav
WAVE (.wav) file, byte length: 40048, data format: PCM_SIGNED 16000.0 Hz, 16 bit, mono, 2 bytes/frame, little-endian, frame length: 20002
RESULT: teste
WITH THE NEW FILE, WITH THE SAME WORD RECORDED--------------
Loading Recognizer...
Decoding /C:/Documents/TesteSphinx2/src/teste2.wav
WAVE (.wav) file, byte length: 48058, data format: PCM_SIGNED 16000.0 Hz, 16 bit, mono, 2 bytes/frame, little-endian, frame length: 24000
RESULT:
THIS IS MY CONFIGURATION FILE-------------------------------
<?xml version="1.0" encoding="UTF-8"?>
<!--
Sphinx-4 Configuration file
-->
<!-- ******** -->
<!-- g_teste configuration file -->
<!-- ******** -->
<config>
</config>
THIS IS MY GRAMMAR-------------------------------------------
JSGF V1.0;
grammar g_teste_grammar;
public <teste> = (alo | lontras | teste | focas) * ;
Can you please paste prompts you used for recording and a dictionary. It's recommended to have at least 10 examples of each word in a dictionary so your prompts should have at least 40 words. Is it true?
Another way to proceed - iterative process. Once your test word is not recognized, add it to training set and record another test word. Such process should converge to working system.
I also wonder why absolute beam is -1:
<property name="absoluteBeamWidth" value="-1"/>
AbsoluteBeamWidth of -1 means that no absolute beam width is used. I frequently use this setting and rely solely on the RelativeBeamWidth--so do many of the example configurations in the code base.
Thanks a lot Nickolay.
I had only 1 wav file for each word, now I use 10 and the results are much better, though not perfect. How many different voices and wave files per voice do I need to use, in order to make the recognition "speaker independent" ? Do I need to use male and female voices ?
I don't know what value AbsoluteBeamWidth should have, I use the same value as the demos, as Robbie said. What AbsoluteBeamWidth and RelativeBeamWidth are used for?
I have one more problem now (looks like they will never end ^^), the recognition still returning <SIL> or erroneous results if I use the microphone. I guess my microphone output SampleRate is not 16KHz as it should be, do you know how to change the microphone properties in Windows XP?
And the result shouldn't be always a word in the dicionary ?
Nickolay, if you need more information to discover what is happening would be possible for us to talk through MSN ? This would speed things up. If you can add me, my MSN e-mail is [gustavobap#yahoo.com.br].
The dictionary uses english phonems to simulate the portuguese ones,
since there is no portuguese language model available.
THIS IS THE DICTIONARY=============================
ALO AH L OW
FOCAS F AA K AH S
LONTRAS L OW N T R AH S
TESTE T EH S T EH
===================================================
THIS IS THE FILLER DICTIONARY======================
<s> SIL
</s> SIL
<sil> SIL
==================================================
THIS IS THE TRANSCRIPTION=========================
<s> ALO </s> (alo0)
<s> ALO </s> (alo1)
<s> ALO </s> (alo2)
<s> ALO </s> (alo3)
<s> ALO </s> (alo4)
<s> ALO </s> (alo5)
<s> ALO </s> (alo6)
<s> ALO </s> (alo7)
<s> ALO </s> (alo8)
<s> ALO </s> (alo9)
<s> FOCAS </s> (focas0)
<s> FOCAS </s> (focas1)
<s> FOCAS </s> (focas2)
<s> FOCAS </s> (focas3)
<s> FOCAS </s> (focas4)
<s> FOCAS </s> (focas5)
<s> FOCAS </s> (focas6)
<s> FOCAS </s> (focas7)
<s> FOCAS </s> (focas8)
<s> FOCAS </s> (focas9)
<s> LONTRAS </s> (lontras0)
<s> LONTRAS </s> (lontras1)
<s> LONTRAS </s> (lontras2)
<s> LONTRAS </s> (lontras3)
<s> LONTRAS </s> (lontras4)
<s> LONTRAS </s> (lontras5)
<s> LONTRAS </s> (lontras6)
<s> LONTRAS </s> (lontras7)
<s> LONTRAS </s> (lontras8)
<s> LONTRAS </s> (lontras9)
<s> TESTE </s> (teste0)
<s> TESTE </s> (teste1)
<s> TESTE </s> (teste2)
<s> TESTE </s> (teste3)
<s> TESTE </s> (teste4)
<s> TESTE </s> (teste5)
<s> TESTE </s> (teste6)
<s> TESTE </s> (teste7)
<s> TESTE </s> (teste8)
<s> TESTE </s> (teste9)
==================================================
THIS IS THE MAIN JAVA CLASS=======================
package src;
import java.io.File;
import java.io.IOException;
import java.net.URL;
import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.UnsupportedAudioFileException;
import edu.cmu.sphinx.frontend.util.StreamDataSource;
import edu.cmu.sphinx.recognizer.Recognizer;
import edu.cmu.sphinx.result.Result;
import edu.cmu.sphinx.util.props.ConfigurationManager;
import edu.cmu.sphinx.util.props.PropertyException;
public class GTeste {
}
===============================================================
> I had only 1 wav file for each word, now I use 10 and the results are much better, though not perfect. How many different voices and wave files per voice do I need to use, in order to make the recognition "speaker independent" ? Do I need to use male and female voices ?
Well, I suggest you to look at tidigits database design:
http://www.ldc.upenn.edu/Catalog/docs/LDC93S10/tidigits.readme.html
of course you don't need 300 speakers but at least 100 is required for true independence. Another details you can take from tidigits (see sphinx4 example) is the word-oriented dictionary:
ALO A_1 L_1 O_1
FOCAS F_2 O_2 K_2 A_2 S_2
LONTRAS L_3 O_3 N_3 T_3 R_3 A_3 S_3
TESTE T_4 E_4 S_4 T_4 E_4
or something like that. Check if such dictionary/phoneset will improve WER.
> if you need more information to discover what is happening would be possible for us to talk through MSN ?
Well, it's possible but I suppose we should create irc channel like #cmusphinx on freenode. Although I don't know if people here will support this. Probably it should be discussed in separate thread.
> I guess my microphone output SampleRate is not 16KHz as it should be, do you know how to change the microphone properties in Windows XP?
I suppose you should just set microphone properties:
But of course it's better to set 16000 rate somehow, not sure if all hardware supports this.
Hi again,
Correcting the last post: when I use the microphone the recognition returns ALWAYS <SIL>, and not erroneous results.
The configuration file when I'm using the microphone is like the one before, but the property "frontend" is changed
<property name="frontend" value="epFrontEnd"/>
and the part
<!-- ******** -->
<!-- The frontend configuration -->
<!-- ******** -->
is replaced by this:
================================================
=======================================================
THIS IS THE JAVA MAIN CLASS, USING MICROPHONE==========
/
* Copyright 1999-2004 Carnegie Mellon University.
* Portions Copyright 2004 Sun Microsystems, Inc.
* Portions Copyright 2004 Mitsubishi Electric Research Laboratories.
* All Rights Reserved. Use is subject to license terms.
* See the file "license.terms" for information on usage and
* redistribution of this file, and for a DISCLAIMER OF ALL
* WARRANTIES.
/
package src;
import edu.cmu.sphinx.frontend.util.Microphone;
import edu.cmu.sphinx.recognizer.Recognizer;
import edu.cmu.sphinx.result.Result;
import edu.cmu.sphinx.util.props.ConfigurationManager;
import edu.cmu.sphinx.util.props.PropertyException;
import java.io.File;
import java.io.IOException;
import java.net.URL;
/*
* A simple HelloWorld demo showing a simple speech application
* built using Sphinx-4. This application uses the Sphinx-4 endpointer,
* which automatically segments incoming audio into utterances and silences.
/
public class GTeste2 {
}
==========================================================================
I discovered what is wrong with the microphone, I've set the properties this way:
====================================================================================
like it should be, to be used whit a database where files are:
WAVE (.wav) file, byte length: 32048, data format: PCM_SIGNED 16000.0 Hz, 16 bit, mono, 2 bytes/frame, little-endian, frame length: 16002
the problem is, even with this property set
<property name="bigEndianData" value="false"/>
the microphone is initialized like this:
================================================================
PCM_SIGNED 16000.0 Hz, 16 bit, mono, 2 bytes/frame, big-endian
================================================================
it is BIG-ENDIAN, and my trained files are LITTLE-ENDIAN.
Do you know how to solve this?
Just discovered the property should be set like this:
<property name="bigEndian" value="false"/>
The audio format is correct now, but the results still wrong =[.
The recognizer is taking too long to return, and the recognition
is <SIL> or erroneous.
Something weird is happening.. I guess I'm doing something really wrong.