CMU Sphinx / Forums / Help: WARNING: This phone (SIL) occurs in the phonelist but not in any word in the transcription

Suranga Premakumara - 2015-12-09

I got this error when I try to train the acoustic model

WARNING: This phone (SIL) occurs in the phonelist (/home/suranga/Downloads/Final_Dev/an4/etc/an4.phone), but not in any word in the transcription (/home/suranga/Downloads/Final_Dev/an4/etc/an4_train.transcription)

my phonemes file contain the SIL phoneme and there is no any usage of SIL in transcription file. (I use sinhala unicode transcription file no english words)

Link for my filler file image : https://lh3.googleusercontent.com/-d0WgJbv6xfA/VmfXncKuk2I/AAAAAAAABGc/ZOU3TYx2m5E/s957-Ic42/filler%252520%2525282%252529.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-12-09
  
  Each transcription in your training transcription file must start with <s> and end with </s>.
  
  Last edit: Nickolay V. Shmyrev 2015-12-09
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

When I add ~~tags to my transcription file it gives me compile error ,~~

Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain Running the training MODULE: 000 Computing feature from audio files Extracting features from segments starting at (part 1 of 1) Extracting features from segments starting at (part 1 of 1) Feature extraction is done MODULE: 00 verify training files Phase 1: Checking to see if the dict and filler dict agrees with the phonelist file. Found 2948 words using 42 phones Phase 2: Checking to make sure there are not duplicate entries in the dictionary Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist Phase 4: Checking number of lines in the transcript file should match lines in fileids file Phase 5: Determine amount of training data, see if n_tied_states seems reasonable. Estimated Total Hours Training: 1.03019166666667 This is a small amount of data, no comment at this time Phase 6: Checking that all the words in the transcript are in the dictionary Words in dictionary: 2945 Words in filler dictionary: 3 WARNING: This word: <s> was in the transcript file, but is not in the dictionary (<s> ගම්බද පාසල් බේරා දෙන්න මාතර පාතේගම දෙව්සිරිගම වල්පිට පාසල පසුගිය වසරේ වැසී ගියේය මෙම පාසල පමණක් නොව තවත් බොහෝ ගම්බද පාසල්වල ළමයින් සංඛ්‍යාව එන්න එන්නම අඩු වී යමින් ඒවාද වැසී යන තත්ත්වයක් දකින්නට අසන්නට ලැබේ </s> ). Do cases match? WARNING: Utterance ID mismatch on line 5: User1/SentNum_5 vs SentNum_4 Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once

~~And my transcription file , filler file , dictionary, language model , field files are attached here.~~

an4.dic

an4.filler

an4.lm.DMP

an4_train.fileids

an4_train.transcription

Suranga Premakumara - 2015-12-09

ahh it seems okay... mistakenly I put same id twice. sorry for disturb you.
this error is okay now. thak for the help.

Last edit: Suranga Premakumara 2015-12-09

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Suranga Premakumara - 2015-12-09

After this changes I got error ,

ODULE: 90 deleted interpolation Skipped for continuous models MODULE: DECODE Decoding using models previously trained Decoding 154 segments starting at 0 (part 1 of 1) 0% Aligning results to find error rate word_align.pl failed with error code 65280 at /usr/local/lib/sphinxtrain/scripts/decode/slave.pl line 173.

in my sphinx_train.cfg file i changed,
$CFG_N_TIED_STATES = 1
$CFG_N_TIED_STATES = 2
$CFG_N_TIED_STATES = 4
$CFG_N_TIED_STATES = 8
$CFG_N_TIED_STATES = 200
$CFG_N_TIED_STATES = 1000

** but error still there.
I have 1 hour training data **

In my decode log file shows Warning called,
WARN: "ms_mgau.c", line 145: -topn argument (4) invalid or > #density codewords (1); set to latter

Last edit: Suranga Premakumara 2015-12-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-12-09
  
  This is just a warning, it should not affect results. Alignment failed for some other reason which you need to find in the logs.
  
  You can share the acoustic model training folder in order to get help on this issue.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Suranga Premakumara - 2015-12-09

Here I have attached my acoustic model, (not include wav and feat folder )
and there is no error or warning in log files.

Last edit: Suranga Premakumara 2015-12-09

acoustic_model.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2015-12-09

You have empty lines and extra UTF-8 BOF symbols in the file an4_test.transcription. You need to remove them. Number of lines must match the lines in fileids file exactly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Suranga Premakumara - 2015-12-10

i used notepad ++ encoding convert to utf-8 only to remove UTF-8 BOF
I removed whitespaces and build acoustic model.
sentenses error rate 90% and word error rate 10%
when I use it on netbeans,
my code,

System.out.println("Loading models..."); Configuration configuration = new Configuration(); configuration .setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/aa"); configuration .setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/an4.dic"); configuration .setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/an4.lm"); LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration); recognizer.startRecognition(true); SpeechResult result = recognizer.getResult(); System.out.println(result.getResult()); System.out.println("outside loop"); recognizer.stopRecognition(); }

and System.out.println(result.getResult()); line prints,
<s> à¶¶à·?à¶½à·”à·€ </s>

what are those charactors (à¶¶à·?à¶½à·”à·€) still I am wrong ?
no error, warning or exception in console and I expected result like <s> අම්මා </s>
( I think " à¶¶à·?à¶½à·”à·€ " are ANSI values correspond to unicode-8 )

here I have attached my acoustic model files ,language model and dictionary file

Last edit: Suranga Premakumara 2015-12-10

an4.cd_cont_200.zip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-12-10
  
  This is just an output in wrong encoding. You can change console encoding to utf-8 or output to file and open with text editor with encoding specification. You can also modify encoding to the one you need before you output the result.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Suranga Premakumara - 2015-12-10

I use this code segment to write output to file,

SpeechResult result = recognizer.getResult(); String resultText = result.getHypothesis(); PrintWriter writer = new PrintWriter("the-file-name.txt", "UTF-8"); writer.println("The first line: "+resultText); writer.println("The second line සිංහල"); writer.close();

and here I have attache my output file.
and convert the encording using notepad++ did not make human readable format

Last edit: Suranga Premakumara 2015-12-10

the-file-name.txt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-12-10
  
  You can add -Dfile.encoding=UTF-8 to java options when you run your code to force it use UTF-8.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Suranga Premakumara - 2015-12-10

As you told me I create system variables for -Dfile.encoding=UTF-8. and restart the computer. then is work fine. Thank you vary much for your Kind help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Suranga Premakumara - 2015-12-11

i want to know , use same sentenses list with different speakers trainig audio files help to improve accuracy? or use different sentenses list with different speakers help to improve accuracy?
or both ?

Because I have 150 sentences (one hour audio) and I decided to get the recorde clips with different users for above sentenses.

Last edit: Suranga Premakumara 2015-12-12

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-12-12
  
  You need to use different sentences
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

WARNING: This phone (SIL) occurs in the phonelist but not in any word in...

Speech Recognition Toolkit

Forums

Help

WARNING: This phone (SIL) occurs in the phonelist but not in any word in the transcription