Menu

Need help in training DB

Help
shashank
2015-04-22
2015-04-22
  • shashank

    shashank - 2015-04-22

    Hi,

    I am currently trying to train the model. I have few questions regarding this.

    1. How do we prepare the transcription file? Manually or are there any tools to do that?
      Can we have three lines between and

    2. How much should be the training data in minutes? I have a recording of 50 minutes. Will a chunk of 3 minutes is sufficient to train the model?

    3. I tried running sphinxtrain run command. I got this error.

    Use of uninitialized value $_[0] in substitution (s///) at /usr/share/perl/5.20/File/Basename.pm line 341, <trn> line 6.
    fileparse(): need a valid pathname at /usr/local/lib/sphinxtrain/scripts/00.verify/verify_all.pl line 352.</trn>

    What does this mean by?

    Please let me know. Your help is very much appreciated.

    Thanks,
    Shashank

     
    • bic-user

      bic-user - 2015-04-22

      How do we prepare the transcription file?

      no specific tools for that

      duration and train procedure is covered in tutorial. Did you read http://cmusphinx.sourceforge.net/wiki/tutorialam carefully?

       
      • shashank

        shashank - 2015-04-22

        Yes, I read them. But I have a question. Can you please let me know whether the attached file is the valid transcription file?

        Also, I want to know what is the minimum amount of training data required.I mean, in minutes.

        Thank you

         
        • bic-user

          bic-user - 2015-04-22

          Can you please let me know whether the attached file is the valid transcription file?

          transcription file looks ok.

          I want to know what is the minimum amount of training data required.I mean, in minutes

          tutorial gives clear suggestions on training data amount. Not sure what confuses you

           
  • shashank

    shashank - 2015-04-22

    Really sorry for troubling. I got a better picture now. Few steps have failed in verifying training files.

    Phase 4: Checking number of lines in the transcript file should match lines in fileids file
    FAILED
    Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
    Estimated Total Hours Training: 0.0990166666666667
    WARNING: Not enough data for the training
    FAILED
    Phase 6: Checking that all the words in the transcript are in the dictionary
    Words in dictionary: 133709
    Words in filler dictionary: 3
    WARNING: Bad line in transcript:
    FAILED
    Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
    passed

    And, it stopped. I understood that training data is not sufficient. I will work on that. But, why it has failed in phase 4 and phase 6? Anything wrong with the transcription file?

    Thank you

     
    • bic-user

      bic-user - 2015-04-22

      I think "Checking number of lines in the transcript file should match lines in fileids file" is self descriptive enough. Or you need to provide both fileids and transcription so I can take a look

       
  • shashank

    shashank - 2015-04-22

    Phase 2: Checking to make sure there are not duplicate entries in the dictionary
    passed
    Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist
    passed
    Phase 4: Checking number of lines in the transcript file should match lines in fileids file
    passed
    Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
    Estimated Total Hours Training: 0.542222222222222
    This is a small amount of data, no comment at this time
    WARNING
    Phase 6: Checking that all the words in the transcript are in the dictionary
    Words in dictionary: 133707
    Words in filler dictionary: 3
    passed
    Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
    passed

    I was able to fix the problems. This is what I got now. So, the only thing left is the required amount of data, right? Please let me know.

    Thank you so much for your quick responses.

     
  • bic-user

    bic-user - 2015-04-22

    So, the only thing left is the required amount of data, right?

    According to info you providing - yes

     
  • shashank

    shashank - 2015-04-22

    I'm currently using only 5 minutes of recordings for training. That could be the problem. I will work on it to increase.

    Thank you so much again.

     

Log in to post a comment.