what is an efficient way of choosing the audios for training and testting?
for now I have 1267 audios in female voice as well as 1267 audios in male voice with same transcriptions. i have taken the same 1013(wich equals to 2026 total training audio's) audio's as train from female and male and the remaining as the test data. Please let me know if this is the right way?(80% train and 20% for test). The data is nearly around 5.8hrs.
initially my audio was
$:soxi text_1.wav
Though the training was successful for the word model(with some errors related to some audio files not reaching final state, where i removed such audios) I'm getting an word error rate of 93%.
I assume that as the data increases the WER should decrease? But I've seen an increase in the WER which leads to poor accuracy
is it because I have opted a wrong way of dividing train and test?(Should there be some sentences from train in the test set?)
or is it becoz i downsampled the audio to 16kHz that the recognition is so horrible?
or is it due to the feature extraction?
Please let me know if this is the right way?(80% train and 20% for test).
This ratio seems OK. You can even do 10% for testing to have more training data.
I'm getting an word error rate of 93%.
This is too much. Something likely was wrong. You should probably analyse alignment files in the decoding directory. You can also share your working directory for further analysis
(Should there be some sentences from train in the test set?)
No. Moreover, it is not recommended to have the same speaker in train and test. But it is good to have train and test phonetically balanced.
or is it becoz i downsampled the audio to 16kHz that the recognition is so horrible?
unlikely
or is it due to the feature extraction?
could be. but it is difficult to say without having your working directory to replicate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When i checked the perplexity of my language model on the test set i got a perplexity of 5700 and the OOV rate is 53%
Is it because of this my WER is 93%?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
what is an efficient way of choosing the audios for training and testting?
for now I have 1267 audios in female voice as well as 1267 audios in male voice with same transcriptions. i have taken the same 1013(wich equals to 2026 total training audio's) audio's as train from female and male and the remaining as the test data. Please let me know if this is the right way?(80% train and 20% for test). The data is nearly around 5.8hrs.
initially my audio was
$:soxi text_1.wav
Input File : 'text_1.wav'
Channels : 1
Sample Rate : 48000
Precision : 16-bit
Duration : 00:00:12.39 = 594854 samples ~ 929.459 CDDA sectors
File Size : 1.19M
Bit Rate : 768k
Sample Encoding: 16-bit Signed Integer PCM
This is the command i used to downsample to 16kHz
sox text_1.wav -b 16 text_1-f.wav rate 16k
After resampling
Input File : 'text_1-f.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:12.39 = 198285 samples ~ 929.461 CDDA sectors
File Size : 397k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Though the training was successful for the word model(with some errors related to some audio files not reaching final state, where i removed such audios) I'm getting an word error rate of 93%.
I assume that as the data increases the WER should decrease? But I've seen an increase in the WER which leads to poor accuracy
is it because I have opted a wrong way of dividing train and test?(Should there be some sentences from train in the test set?)
or is it becoz i downsampled the audio to 16kHz that the recognition is so horrible?
or is it due to the feature extraction?
Last edit: Tania Mendonca 2017-03-22
Quite many questions in one place. Still:
This ratio seems OK. You can even do 10% for testing to have more training data.
This is too much. Something likely was wrong. You should probably analyse alignment files in the decoding directory. You can also share your working directory for further analysis
No. Moreover, it is not recommended to have the same speaker in train and test. But it is good to have train and test phonetically balanced.
unlikely
could be. but it is difficult to say without having your working directory to replicate
When i checked the perplexity of my language model on the test set i got a perplexity of 5700 and the OOV rate is 53%
Is it because of this my WER is 93%?
Yes
Is there a way i can reduce the perplexity of the language model
Use more relevant data in language model training.