Menu

Problems creating my acoustic model

Help
2016-12-22
2017-01-04
  • Ayelet Goldstein

    I am trying to build a small acoustic model.
    When I run sphinxtrain run I receive the following error:

    *MODULE: DECODE Decoding using models previously trained
    Decoding 8 segments starting at 0 (part 1 of 1)
    0% ERROR: FATAL: "batch.c", line 822: PocketSphinx decoder init failed

    ERROR: This step had 1 ERROR messages and 0 WARNING messages. Please check the log file for details.
    ERROR: Failed to start pocketsphinxbatch
    Aligning results to find error rate
    *

    The log in logDir/Decode folder shows the following error:
    *INFO: feat.c(715): Initializing feature stream to type: '1scddd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
    ERROR: "acmod.c", line 79: Folder [...]sphinxtrain/scripts/an4/model_parameters/an4.cd_cont_256' does not contain acoustic model definition 'mdef'
    FATAL: "batch.c", line 822: PocketSphinx decoder init failed

    I tried checking similar messages in the forum but nothing seemed to work for my case.

    I need help (!)
    Thank you very much in advance !
    ps. the link to my logDir folder is https://drive.google.com/open?id=0B01ecFOycElEaWJkYklNUHFoUUE
    pss. Here the link to my an4 folder: https://drive.google.com/open?id=0B01ecFOycElENTlHZnlpeDROSFk

     

    Last edit: Ayelet Goldstein 2016-12-23
  • Ayelet Goldstein

    Thank you !!

     
  • Ayelet Goldstein

    Dear Nickolay, Thank you for refering me to the source where I could solve my previous bug.
    I succeded now to train my acoustic model, but now i have a new problem: I have 0 % accuracy.
    Following the tutorials, I build a phoneme-based language model, since I want to try to recognize some given phrase (check the user's fluency in reading hebrew).
    I am converting the hebrew text to phonemes to train the language model.
    I also trained my acoustic model, which I expect to recognize only myself now, using only words that I have used in the training.
    What I am doing wrong ?
    Again a link to my an4 folder: https://drive.google.com/open?id=0B01ecFOycElEUjU2azJrTWFOVjA

    When i run:
    $ pocketsphinx_continuous -hmm model_parameters/an4.ci_cont -lm etc/an4.lm.bin -dict etc/an4.dic -infile wav/Ayelet/file_2.wav - I can't see the expected phrase converted in phonemes.
    and when I run:
    $ sphinxtrain -s decode run
    I get 100% error rate in both sentences and words.
    I am sure missing something in the process.
    Thank you very much for the patience !

     
  • Ayelet Goldstein

    I found the following error in the file an4.g2p.evaluate.log:
    IOError: [Errno 2] No such file or directory: '/.../cmusphinx-code/sphinxtrain/scripts/an4/g2p/an4.test'
    Who / what was suppossed to generate this file ?

     
  • Ayelet Goldstein

    I tried to do everything from the beginning but Iam still getting a 0% accuracy and I am not sure what I am doing wrong, or maybe I just need much more data ?
    But I want to build a small model, using just predefined words, meaning, I want to recognize only a relatively small number of words (~200).
    My objective is to recognize if a person is reading correctly a phrase that is displayed to him/her (in hebrew). So the program knows what it is expected to recognize.

    So what I did is the following:
    1) I created a phonetic language model, as described in http://cmusphinx.sourceforge.net/wiki/phonemerecognition
    I used SRILM to build my model using the command:
    ngram-count -text phoneticTranslationText.txt -lm myPhoneticModel.lm
    when phoneticTranslationText.txt is the hebrew text writen as phonemes, for example the first sentence is: "SIL O RE N TA MI Y D RA TZA H KE LE V SIL" which is the phoneme writing to the phrase ארן תמיד רצה כלב.
    I have 86 sentences like this.
    My phonetic model looks like:
    \data\ ngram 1=81
    ngram 2=582
    ngram 3=199

    \1-grams:
    -1.223562
    -99 -1.884235
    -1.442057 A -0.2517919
    -2.857031 B -0.626858
    -1.534811 BA -0.07855582
    -2.680939 BE -0.2802781
    -2.857031 BI -0.3330438

    2) Then I built the acoustic model:
    Following the tutorial in http://cmusphinx.sourceforge.net/wiki/tutorialam I built the following files:
    - transcription file (for train and test)
    - dictionary - with the hebrew words and their phonemes (e.g. אֹרֶן O RE N)
    - the filler dictionary
    - the fileids (both for train and test)
    - The file listing all the phonemes (A, BA, ...)

    And I edited the "sphinx_train.cfg" as described.

    Because it such a small model, I understand I have to do it context independent. (Is it indeed a requirement here?). So I changed the $CFG_CD_TRAIN to 'no'. And i learned from your previous post that I need also to change $DEC_CFG_MODEL_NAME to "$CFG_EXPTNAME.ci_cont" in this case.

    I trained the model succesfully, without any error messages, but I get 0% accuracy.
    I tried having 8 sentences in the test, then having 1 sentence in the test, but still nothing improved.

    I am not sure what I am missing and your help or direction to sources where I can learn more to understand what I am doing wrong would be greatly appreciatted.

    Thank you,
    Ayelet

     

    Last edit: Ayelet Goldstein 2016-12-27
    • Nickolay V. Shmyrev

      You do not have enough data, data requirements are listed in the beginning of the acoustic model training tutorial.

       
  • Ayelet Goldstein

    Thank you for the fast reply!
    But is indeed necessary so much data even if my requirements are so low ?
    I want to recognize always the same speaker I used for the train, the phrases I am going to recognize are exactly the same ones as the phrases I am training, , I know which phrase is being displayed, and in hebrew there are no different ways of pronouncing the same letters (for example the "א with a trace below will always be pronouced like "A", different from english that is sometimes A sometimes O, etc.).
    Maybe there is an easier way to do see if a phrase is being read correctly without having to build a robustic acoustic model ?

     

    Last edit: Ayelet Goldstein 2016-12-27
    • Nickolay V. Shmyrev

      But is indeed necessary so much data even if my requirements are so low ?

      Yes

      Maybe there is an easier way to do see if a phrase is being read correctly without having to build a robustic acoustic model ?

      No

       
      • Ayelet Goldstein

        Thank you again for the fast reply !
        Seems I will have to continue fighting my way towards a robust model then :)

         
  • Ayelet Goldstein

    Ok, I have now 0.5h of speech according to sphinxtrain. I am having only 9 different phrases, each phrase I recorded multiple times. I just want to make sure I am on the right way, to see if the accuracy gets somehow improved.
    If my vocabulary is relatively small, then I might need less data - I am right ? (just as is stated in the manual that for "commands, single speaker, we need about 1 hour of data")
    I trained the model on about 20-40 recordings of each sentence, and tested in one sentence of each type (a total of 9 sentences tested).
    I still get a round ZERO in my accuracy.
    I tried transliterating all the text, still no improvement in accuracy.
    I tried training with context-dependent set to YES, still no improvement.
    When training, I get a message saying I have 2 errors, but I can't see them in the log:
    Phase 3: Forward-Backward
    Baum welch starting for 1 Gaussian(s), iteration: 1 (1 of 1)
    0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
    ERROR: This step had 2 ERROR messages and 0 WARNING messages. Please check the log file for details.

    A few questions:
    1) Does my zero-percent accuracy is still due to not enough data in the training, or I am missing something else here ?
    2) Where can i see these 2 errors mentioned during the training ?
    3) In general, should I perform "alignment" during training ?
    4) Should I run with context dependent set to yes ?
    5) Is it necessary to perform transliteration of the hebrew text or it is not required (meaning, converting all the hebrew words to their correspondent english form) ?
    6) To be able to recognize X predefined sentences,(single speaker), what's the size of my training data, or more specifically, how many times I have to record each sentence ? And what if I want it to be recognized by multiple speakers ?
    7) In the language model (using SRILM), should I use word-to-phoneme or sentence-to-phoneme ?
    ngram-count -text (filename of file containing the sentences written using their phonemes) OR ngram-count -vocab (filename of file containing words mapped to their specific phonemes)
    And should the text used to train the language model also contain all the duplicated sentences or only distinct sentences ? (if I want to recognize only let's say 5 sentences, should the model contain duplications of these senteces, or only the 5 specific sentences ?)

    Thank you very much,
    Ayelet

    Ps. A link to my DB: https://drive.google.com/open?id=0B01ecFOycElEUWxZZktDRHBVUW8

     

    Last edit: Ayelet Goldstein 2017-01-02
    • Nickolay V. Shmyrev

      1) Does my zero-percent accuracy is still due to not enough data in the training, or I am missing something else here ?

      Default training process is not supposed to evaluate phonetic decoding accuracy, it evaluates word decoding accuracy. You have to modify the test reference to list phonemes and decoding script to use allphone decoding mode. Or you can modify scoring script to score phonemes and modify decoding script to use allphone decoding mode.

      2) Where can i see these 2 errors mentioned during the training ?

      In logs in logdir folder

      3) In general, should I perform "alignment" during training ?

      It depends on the quality of your data.

      4) Should I run with context dependent set to yes ?

      Context-dependent decoding is usually slow, it depends if you want to tolerate that.

      5) Is it necessary to perform transliteration of the hebrew text or it is not required (meaning, converting all the hebrew words to their correspondent english form) ?

      No

      6) To be able to recognize X predefined sentences,(single speaker), what's the size of my training data, or more specifically, how many times I have to record each sentence ? And what if I want it to >be recognized by multiple speakers ?

      This question is answered in the first line of the acoustic model training tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam

      7) In the language model (using SRILM), should I use word-to-phoneme or sentence-to-phoneme ?
      ngram-count -text (filename of file containing the sentences written using their phonemes) OR ngram-count -vocab (filename of file containing words mapped to their specific phonemes)

      Phonetic model trainign is covered in details on the wiki:

      http://cmusphinx.sourceforge.net/wiki/phonemerecognition#training_phonetic_language_model_for_decoding

      And should the text used to train the language model also contain all the duplicated sentences or only distinct sentences ? (if I want to recognize only let's say 5 sentences, should the model contain duplications of these senteces, or only the 5 specific sentences ?)

      This question is easy to answer if you try to understand how language models and machine learning work, there are many tutorials you can find in the web, you need to read them first. Without that knowledge there is not much sense to continue, you will wander blindly.

       
  • Ayelet Goldstein

    "Default training process is not supposed to evaluate phonetic decoding accuracy, it evaluates word decoding accuracy. You have to modify the test reference to list phonemes and decoding script to use allphone decoding mode. Or you can modify scoring script to score phonemes and modify decoding script to use allphone decoding mode."

    Can you guide me on how I can modify them ? Or where I can find instructions / manual on how to modify them ?

    "should I perform "alignment" during training ?" - It depends on the quality of your data.

    So if the quality is good alignment is not required ? Or where i can find more information about what is the alignment process and its effects ?

    "To be able to recognize X predefined sentences,(single speaker), what's the size of my training data, or more specifically, how many times I have to record each sentence ? And what if I want it to >be recognized by multiple speakers ?" - This question is answered in the first line of the acoustic model training tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
    > I trained the acoustic model based on this tutorial, and I found there that for single speaker I need 1 hour of recording for "command and control", 10 hours for dictation. But I didn't find anything written regarding a few sentences. Like 5-10 sentences, how many hours are required ?

    I read the training acoustic model tutorial and also the building a language model tutorial and followed all its steps.
    I am still having problems and would be very happy if you could help me or direct me to additional sources where I can find how I can progress here. Are additional tutorials that explain how to use the CMU sphix library ?

    Thank you,
    Ayelet

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.