Menu

WARNING: This phone (SIL) occurs in the phonelist but not in any word in the transcription

Help
2015-12-09
2015-12-12
  • Suranga Premakumara

    I got this error when I try to train the acoustic model

    WARNING: This phone (SIL) occurs in the phonelist (/home/suranga/Downloads/Final_Dev/an4/etc/an4.phone), but not in any word in the transcription (/home/suranga/Downloads/Final_Dev/an4/etc/an4_train.transcription)

    my phonemes file contain the SIL phoneme and there is no any usage of SIL in transcription file. (I use sinhala unicode transcription file no english words)

    Link for my filler file image : https://lh3.googleusercontent.com/-d0WgJbv6xfA/VmfXncKuk2I/AAAAAAAABGc/ZOU3TYx2m5E/s957-Ic42/filler%252520%2525282%252529.png

     
    • Nickolay V. Shmyrev

      Each transcription in your training transcription file must start with <s> and end with </s>.

       

      Last edit: Nickolay V. Shmyrev 2015-12-09
  • Suranga Premakumara

    When I add tags to my transcription file it gives me compile error ,

    Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain
    Running the training
    MODULE: 000 Computing feature from audio files
    Extracting features from  segments starting at  (part 1 of 1) 
    Extracting features from  segments starting at  (part 1 of 1) 
    Feature extraction is done
    MODULE: 00 verify training files
        Phase 1: Checking to see if the dict and filler dict agrees with the phonelist file.
            Found 2948 words using 42 phones
        Phase 2: Checking to make sure there are not duplicate entries in the dictionary
        Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist
        Phase 4: Checking number of lines in the transcript file should match lines in fileids file
        Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
            Estimated Total Hours Training: 1.03019166666667
            This is a small amount of data, no comment at this time
        Phase 6: Checking that all the words in the transcript are in the dictionary
            Words in dictionary: 2945
            Words in filler dictionary: 3
    WARNING: This word: <s> was in the transcript file, but is not in the dictionary (<s> ගම්බද පාසල් බේරා දෙන්න මාතර පාතේගම දෙව්සිරිගම වල්පිට පාසල පසුගිය වසරේ වැසී ගියේය මෙම පාසල පමණක් නොව තවත් බොහෝ ගම්බද පාසල්වල ළමයින් සංඛ්‍යාව එන්න එන්නම අඩු වී යමින් ඒවාද වැසී යන තත්ත්වයක් දකින්නට අසන්නට ලැබේ </s> ). Do cases match?
    WARNING: Utterance ID mismatch on line 5: User1/SentNum_5 vs SentNum_4
        Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
    

    And my transcription file , filler file , dictionary, language model , field files are attached here.

     
  • Suranga Premakumara

    ahh it seems okay... mistakenly I put same id twice. sorry for disturb you.
    this error is okay now. thak for the help.

     

    Last edit: Suranga Premakumara 2015-12-09
  • Suranga Premakumara

    After this changes I got error ,

    ODULE: 90 deleted interpolation
    Skipped for continuous models
    MODULE: DECODE Decoding using models previously trained
            Decoding 154 segments starting at 0 (part 1 of 1) 
            0% 
            Aligning results to find error rate
    word_align.pl failed with error code 65280 at /usr/local/lib/sphinxtrain/scripts/decode/slave.pl line 173.
    

    in my sphinx_train.cfg file i changed,
    $CFG_N_TIED_STATES = 1
    $CFG_N_TIED_STATES = 2
    $CFG_N_TIED_STATES = 4
    $CFG_N_TIED_STATES = 8
    $CFG_N_TIED_STATES = 200
    $CFG_N_TIED_STATES = 1000

    ** but error still there.
    I have 1 hour training data **

    In my decode log file shows Warning called,
    WARN: "ms_mgau.c", line 145: -topn argument (4) invalid or > #density codewords (1); set to latter

     

    Last edit: Suranga Premakumara 2015-12-09
    • Nickolay V. Shmyrev

      This is just a warning, it should not affect results. Alignment failed for some other reason which you need to find in the logs.

      You can share the acoustic model training folder in order to get help on this issue.

       
  • Suranga Premakumara

    Here I have attached my acoustic model, (not include wav and feat folder )
    and there is no error or warning in log files.

     

    Last edit: Suranga Premakumara 2015-12-09
  • Nickolay V. Shmyrev

    You have empty lines and extra UTF-8 BOF symbols in the file an4_test.transcription. You need to remove them. Number of lines must match the lines in fileids file exactly.

     
  • Suranga Premakumara

    i used notepad ++ encoding convert to utf-8 only to remove UTF-8 BOF
    I removed whitespaces and build acoustic model.
    sentenses error rate 90% and word error rate 10%
    when I use it on netbeans,
    my code,

     System.out.println("Loading models...");
            Configuration configuration = new Configuration();
            configuration
                    .setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/aa");
                 configuration
                    .setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/an4.dic");
            configuration
                    .setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/an4.lm");
            LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
            recognizer.startRecognition(true);
            SpeechResult result = recognizer.getResult();
     System.out.println(result.getResult());
            System.out.println("outside loop");
            recognizer.stopRecognition();
        }
    

    and System.out.println(result.getResult()); line prints,
    <s> බ�ලුව </s>

    what are those charactors (බ�ලුව) still I am wrong ?
    no error, warning or exception in console and I expected result like <s> අම්මා </s>
    ( I think " බ�ලුව " are ANSI values correspond to unicode-8 )

    here I have attached my acoustic model files ,language model and dictionary file

     

    Last edit: Suranga Premakumara 2015-12-10
    • Nickolay V. Shmyrev

      This is just an output in wrong encoding. You can change console encoding to utf-8 or output to file and open with text editor with encoding specification. You can also modify encoding to the one you need before you output the result.

       
  • Suranga Premakumara

    I use this code segment to write output to file,

    SpeechResult result = recognizer.getResult();
    String resultText = result.getHypothesis(); 
    PrintWriter writer = new PrintWriter("the-file-name.txt", "UTF-8");
    writer.println("The first line: "+resultText);
    writer.println("The second line සිංහල");
    writer.close();
    

    and here I have attache my output file.
    and convert the encording using notepad++ did not make human readable format

     

    Last edit: Suranga Premakumara 2015-12-10
    • Nickolay V. Shmyrev

      You can add -Dfile.encoding=UTF-8 to java options when you run your code to force it use UTF-8.

       
  • Suranga Premakumara

    As you told me I create system variables for -Dfile.encoding=UTF-8. and restart the computer. then is work fine. Thank you vary much for your Kind help.

     
  • Suranga Premakumara

    i want to know , use same sentenses list with different speakers trainig audio files help to improve accuracy? or use different sentenses list with different speakers help to improve accuracy?
    or both ?

    Because I have 150 sentences (one hour audio) and I decided to get the recorde clips with different users for above sentenses.

     

    Last edit: Suranga Premakumara 2015-12-12
    • Nickolay V. Shmyrev

      You need to use different sentences

       

Log in to post a comment.