CMU Sphinx / Forums / Help: Need help in training DB

shashank - 2015-04-22

Hi,

I am currently trying to train the model. I have few questions regarding this.

How do we prepare the transcription file? Manually or are there any tools to do that?
Can we have three lines between ~~and~~

How much should be the training data in minutes? I have a recording of 50 minutes. Will a chunk of 3 minutes is sufficient to train the model?

I tried running sphinxtrain run command. I got this error.

Use of uninitialized value $_[0] in substitution (s///) at /usr/share/perl/5.20/File/Basename.pm line 341, <trn> line 6.
fileparse(): need a valid pathname at /usr/local/lib/sphinxtrain/scripts/00.verify/verify_all.pl line 352.</trn>

What does this mean by?

Please let me know. Your help is very much appreciated.

Thanks,
Shashank
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- bic-user - 2015-04-22
  
  How do we prepare the transcription file?
  
  no specific tools for that
  
  duration and train procedure is covered in tutorial. Did you read http://cmusphinx.sourceforge.net/wiki/tutorialam carefully?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - shashank - 2015-04-22
    
    Yes, I read them. But I have a question. Can you please let me know whether the attached file is the valid transcription file?
    
    Also, I want to know what is the minimum amount of training data required.I mean, in minutes.
    
    Thank you
    
    transcription
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - bic-user - 2015-04-22
      
      Can you please let me know whether the attached file is the valid transcription file?
      
      transcription file looks ok.
      
      I want to know what is the minimum amount of training data required.I mean, in minutes
      
      tutorial gives clear suggestions on training data amount. Not sure what confuses you
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

shashank - 2015-04-22

Really sorry for troubling. I got a better picture now. Few steps have failed in verifying training files.

Phase 4: Checking number of lines in the transcript file should match lines in fileids file
FAILED
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.0990166666666667
WARNING: Not enough data for the training
FAILED
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 133709
Words in filler dictionary: 3
WARNING: Bad line in transcript:
FAILED
Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
passed

And, it stopped. I understood that training data is not sufficient. I will work on that. But, why it has failed in phase 4 and phase 6? Anything wrong with the transcription file?

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- bic-user - 2015-04-22
  
  I think "Checking number of lines in the transcript file should match lines in fileids file" is self descriptive enough. Or you need to provide both fileids and transcription so I can take a look
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

shashank - 2015-04-22

Phase 2: Checking to make sure there are not duplicate entries in the dictionary
passed
Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist
passed
Phase 4: Checking number of lines in the transcript file should match lines in fileids file
passed
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.542222222222222
This is a small amount of data, no comment at this time
WARNING
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 133707
Words in filler dictionary: 3
passed
Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
passed

I was able to fix the problems. This is what I got now. So, the only thing left is the required amount of data, right? Please let me know.

Thank you so much for your quick responses.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

bic-user - 2015-04-22

So, the only thing left is the required amount of data, right?

According to info you providing - yes

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

shashank - 2015-04-22

I'm currently using only 5 minutes of recordings for training. That could be the problem. I will work on it to increase.

Thank you so much again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Need help in training DB

Speech Recognition Toolkit

Forums

Help

Need help in training DB document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Need help in training DB