I am currently trying to train the model. I have few questions regarding this.
How do we prepare the transcription file? Manually or are there any tools to do that?
Can we have three lines between and
How much should be the training data in minutes? I have a recording of 50 minutes. Will a chunk of 3 minutes is sufficient to train the model?
I tried running sphinxtrain run command. I got this error.
Use of uninitialized value $_[0] in substitution (s///) at /usr/share/perl/5.20/File/Basename.pm line 341, <trn> line 6.
fileparse(): need a valid pathname at /usr/local/lib/sphinxtrain/scripts/00.verify/verify_all.pl line 352.</trn>
What does this mean by?
Please let me know. Your help is very much appreciated.
Thanks,
Shashank
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Really sorry for troubling. I got a better picture now. Few steps have failed in verifying training files.
Phase 4: Checking number of lines in the transcript file should match lines in fileids file
FAILED
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.0990166666666667
WARNING: Not enough data for the training
FAILED
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 133709
Words in filler dictionary: 3
WARNING: Bad line in transcript:
FAILED
Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
passed
And, it stopped. I understood that training data is not sufficient. I will work on that. But, why it has failed in phase 4 and phase 6? Anything wrong with the transcription file?
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think "Checking number of lines in the transcript file should match lines in fileids file" is self descriptive enough. Or you need to provide both fileids and transcription so I can take a look
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Phase 2: Checking to make sure there are not duplicate entries in the dictionary
passed
Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist
passed
Phase 4: Checking number of lines in the transcript file should match lines in fileids file
passed
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.542222222222222
This is a small amount of data, no comment at this time
WARNING
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 133707
Words in filler dictionary: 3
passed
Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
passed
I was able to fix the problems. This is what I got now. So, the only thing left is the required amount of data, right? Please let me know.
Thank you so much for your quick responses.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am currently trying to train the model. I have few questions regarding this.
How do we prepare the transcription file? Manually or are there any tools to do that?
Can we have three lines between
andHow much should be the training data in minutes? I have a recording of 50 minutes. Will a chunk of 3 minutes is sufficient to train the model?
I tried running sphinxtrain run command. I got this error.
Use of uninitialized value $_[0] in substitution (s///) at /usr/share/perl/5.20/File/Basename.pm line 341, <trn> line 6.
fileparse(): need a valid pathname at /usr/local/lib/sphinxtrain/scripts/00.verify/verify_all.pl line 352.</trn>
What does this mean by?
Please let me know. Your help is very much appreciated.
Thanks,
Shashank
no specific tools for that
duration and train procedure is covered in tutorial. Did you read http://cmusphinx.sourceforge.net/wiki/tutorialam carefully?
Yes, I read them. But I have a question. Can you please let me know whether the attached file is the valid transcription file?
Also, I want to know what is the minimum amount of training data required.I mean, in minutes.
Thank you
transcription file looks ok.
tutorial gives clear suggestions on training data amount. Not sure what confuses you
Really sorry for troubling. I got a better picture now. Few steps have failed in verifying training files.
Phase 4: Checking number of lines in the transcript file should match lines in fileids file
FAILED
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.0990166666666667
WARNING: Not enough data for the training
FAILED
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 133709
Words in filler dictionary: 3
WARNING: Bad line in transcript:
FAILED
Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
passed
And, it stopped. I understood that training data is not sufficient. I will work on that. But, why it has failed in phase 4 and phase 6? Anything wrong with the transcription file?
Thank you
I think "Checking number of lines in the transcript file should match lines in fileids file" is self descriptive enough. Or you need to provide both fileids and transcription so I can take a look
Phase 2: Checking to make sure there are not duplicate entries in the dictionary
passed
Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist
passed
Phase 4: Checking number of lines in the transcript file should match lines in fileids file
passed
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.542222222222222
This is a small amount of data, no comment at this time
WARNING
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 133707
Words in filler dictionary: 3
passed
Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
passed
I was able to fix the problems. This is what I got now. So, the only thing left is the required amount of data, right? Please let me know.
Thank you so much for your quick responses.
According to info you providing - yes
I'm currently using only 5 minutes of recordings for training. That could be the problem. I will work on it to increase.
Thank you so much again.