Menu

Acoustic Model training for 8khz british english telephonic audio, with accuracy as a priority

Help
Orest
2015-04-02
2015-04-24
  • Orest

    Orest - 2015-04-02

    Hi Nickolay, I have a few questions regarding sphinxtrain, I am trying cmu sphinx with the aim of creating a good acoustic model for 8khz telephone transcriptions (large vocabulary) recorded with the British English accent, and then use pocketsphinx for processing audio to text. I am able to use 1000+ hours of audio recorded in british english + transcriptions.
    The Model should be able to transcribe spontaneous British English telephonic audio (with the average length of 10 seconds, occasionally, 1 out of 10 times the audio is more than 20 seconds), with a satisfactory accuracy, so it's not dictation, the context can be any, and the speaker is not known

    I tried model adaptation, and by experimenting with the parameters and a custom language model, the best configuration I could achieve is an adapted model from en-us-8khz with a custom language model generated from the transcriptions, the accuracy is 52% on testing sets (using word_align.pl from sphinxtrain) (this number is inconclusive, because the testing set was very small) , but that's not enough, so I am trying model training by following the tutorial situated in http://cmusphinx.sourceforge.net/wiki/tutorialam.

    The aim is to tune training (and decoding in pocket sphinx) for Accuracy rather than speed, the program needs to process audio files, not in a live mode, so speed is not the priority, and saving computational power at the expense of accuracy is not needed either.
    20 xRT on average is still acceptable.

    $CFG_STATESPERHMM = 3;
    

    does it make sense to increase it to 5 in order to improve accuracy? does pocket sphinx support models with HMM STATE of 5 ?

    $CFG_HMM_TYPE = '.cont.'; # Sphinx 4, PocketSphinx
    

    I'm choosing continuous models for accuracy, because I read that PTM models are less accurate, but on the other side, I read that PTM models will be more supported in the future, because of their good speed/accuracy ratio, can a good PTM model match a good continuous model in terms of WER if they are both trained in the same training database? or the PTM is likely to always be approximately 10% less accurate?

    $CFG_FINAL_NUM_DENSITIES = 32;
    

    by reading the tutorial, a value of either 32 or 64 seems reasonable to me, would you try any other value?

    $CFG_N_TIED_STATES = 8000;
    

    a value in the range of 6000 to 12000 seems reasonable to me, would you try any other value?

    # (yes/no) Train multiple-gaussian context-independent models (useful
    # for alignment, use 'no' otherwise) in the models created
    # specifically for forced alignment
    $CFG_FALIGN_CI_MGAU = 'yes';
    # (yes/no) Train multiple-gaussian context-independent models (useful
    # for alignment, use 'no' otherwise)
    $CFG_CI_MGAU = 'yes';
    
    
    $CFG_FORCEDALIGN = 'yes';
    

    I'm keeping this values set to yes, together with # Use force-aligned transcripts (if available) as input to training

    from what I understood, setting this values to yes (together with CFG_FALIGN_CI_MGAU and CFG_CI_MGAU ), causes sphinxtrain to filter out the samples that are not considered to be a good fit in the transcription, (to use it, I compiled sphinx3_align from sphinx 3 and inserted it in my sphinxtrain installation folder + libexec/sphinxtrain as instructed by the corresponding script)
    This seems to increase the quality of the training in my case (I don't have conclusive results yet), because the transcriptions I have are not always accurate (approximately 1 out of 8 transcriptions are not 100% accurate), so this would filter out the audio samples that don't seem to correspond to the transcription with a good ratio (and the ratio depends from the $CFG_FORCE_ALIGN_BEAM value). Did I interpret this functionality correctly? Is there any place in the logdir where I can see the number of discarded sentences in the form of percentage?

    Is trainingTask/falignout/trainingTask.alignedfiles the list of files that survived the discarding/filtering of sphinx3_align when force alignment is enabled?
    If I want to calculate the percentage of discarded samples, is it correct to move to trainingTask/falignout/trainingTask.alignedfiles, get the number of Lines and compare that number with the number of lines in trainingTask/etc/trainingTask_train.fileids ?

    I would like to make the process of forced alignment more strict (with the idea of keeping only the samples with a perfect match in the transcription) by modifying $CFG_FORCE_ALIGN_BEAM. In my opinion, I don't mind discarding 50% of the samples, at the cost of keeping only the good ones for better WER/Accuracy on testing sets, from my understanding, it's worth discarding 50% of samples to make sure I keep only the fit ones, if there are enough samples in the database, do you think it's a good idea to have a strict filtering for this task?

    Sometimes training crashes at module 00 Phase 3 (Check general format for the fileids file; utterance length (must be positive); files exist)

    I get a long list of

    "WARNING: Error in '(trainingTask_train.fileids)'', the feature file '(path to file.mfc)' does not exist, or is empty
    

    so when I check the log in logdir/000.comp_feat I notice that at some position sphinx_fe fails giving :

    INFO: sphinx_fe.c(1043): Processing 7674 utterances at position 145806
    ERROR: "sphinx_fe.c", line 129: Failed to read RIFF header: File exists
    Thu Apr  2 11:18:35 2015
    

    and this causes all the subsequent IDs in that part (I'm using $CFG_NPART with values > 1) to fail. From my understanding this happens because there are some files on my database (maybe one out of 200000 on average) that are somehow corrupted:

    (the ID that fails is not the first )

    so I "soxi" the ".wav" file ID corresponding to the first error in the list of errors and in fact it says:

    soxi FAIL formats: can't open input file `631163.wav': WAVE: RIFF header not found
    

    (sometimes the incriminating audio file that breaks the chain says "input/output error")

    If I soxi the other files in the list of errors (for example, the following one, corresponding to the second error(the feature file '(path to file.mfc)' does not exist, or is empty), they seem to be ok, according to soxi, for example:

    Input File     : '14141785.wav'
    Channels       : 1
    Sample Rate    : 8000
    Precision      : 16-bit
    Duration       : 00:00:11.48 = 91842 samples ~ 861.019 CDDA sectors
    File Size      : 184k
    Bit Rate       : 128k
    Sample Encoding: 16-bit Signed Integer PCM
    

    so my guess is that once an ID fails, it breaks the chain for all the other IDs in that NPART list, and this error goes unnoticed from the scripts until MODULE 00 phase 3.
    I could just delete the ID from the _train.fileids and _train.transcription but I'm using a script that creates the training database with different files every time, so just deleting the ID would not solve my issue
    My question is: is there any easy way to let MODULE 000 discard IDs that give errors without crashing the training?

    For the language model, I normally generate a language model with MITLM, I create it using all the transcriptions in the training data (excluding testing transcriptions), I noticed that there are tools that allow me to optimize the language model to minimize perplexity, would this process of optimization improve significantly my final Accuracy on testing sets? (for example more than 5% improvement ? )

    I had a look at the RNNLM Toolkit, if I understood correctly, it can create language models using a recurrent neural network approach, is it likely to create language models that are better (in decoding testing sets) than the ones MITLM would create with default parameters using the same input text?

    Do you think training a model with beep dictionary is a better alternative than cmudict0.7 for spontaneous British English 8khz telephonic audio?

    What other values/parameters would you experiment with to increase accuracy other than the ones above?

    another last question, I had in mind to also try Kaldi, do you know about Kaldi? can Kaldi be a better fit for my task (prioritizing Accuracy over Speed for audio to text conversion in offline mode, speaker independent, large vocabulary, 8khz unpredictable speech)?
    Do you think using DNN's with Kaldi is a good choice for my task? is it likely to have a better WER/Accuracy rate for my task?

    my first and current training results (Estimated Total Hours Training: 377.89) give me:

    MODULE: DECODE Decoding using models previously trained
            Aligning results to find error rate
            SENTENCE ERROR: 89.3% (620/694)   WORD ERROR RATE: 64.5% (5486/8502)
    

    as seen from sphinxtrain output on screen

    TOTAL Words: 8502 Correct: 5095 Errors: 5486
    TOTAL Percent correct = 59.93% Error = 64.53% Accuracy = 35.47%
    TOTAL Insertions: 2079 Deletions: 929 Substitutions: 2478
    

    that's the same training task result, taken from result/trainingTask.align

    by leaving everything default except for NPART, my sphinx_train.cfg is :

    # Feature extraction parameters
    $CFG_WAVFILE_SRATE = 8000.0;
    $CFG_NUM_FILT = 15; # For wideband speech it's 25, for telephone 8khz reasonable value is 15
    $CFG_LO_FILT = 200; # For telephone 8kHz speech value is 200
    $CFG_HI_FILT = 3500; # For telephone 8kHz speech value is 3500
    $CFG_TRANSFORM = "dct"; # Previously legacy transform is used, but dct is more accurate
    $CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition
    $CFG_VECTOR_LENGTH = 13; # 13 is usually enough
    
    $CFG_HMM_TYPE = '.cont.'; # Sphinx 4, PocketSphinx
    $CFG_FINAL_NUM_DENSITIES = 32;  (In the continuous model elsif section)
    
    $CFG_STATESPERHMM = 4;
    
    # (yes/no) Train multiple-gaussian context-independent models (useful
    # for alignment, use 'no' otherwise) in the models created
    # specifically for forced alignment
    $CFG_FALIGN_CI_MGAU = 'yes';
    # (yes/no) Train multiple-gaussian context-independent models (useful
    # for alignment, use 'no' otherwise)
    $CFG_CI_MGAU = 'yes';
    # (yes/no) Train context-dependent models
    $CFG_CD_TRAIN = 'yes';
    # Number of tied states (senones) to create in decision-tree clustering
    $CFG_N_TIED_STATES = 8000;
    # How many parts to run Forward-Backward estimatinon in
    $CFG_NPART = 46;
    
    
    # Use force-aligned transcripts (if available) as input to training
    $CFG_FORCEDALIGN = 'yes';
    
     

    Last edit: Orest 2015-04-02
  • Nickolay V. Shmyrev

    does it make sense to increase it to 5 in order to improve accuracy?

    No

    does pocket sphinx support models with HMM STATE of 5 ?

    Yes

    can a good PTM model match a good continuous model in terms of WER if they are both trained in the same training database?

    No

    or the PTM is likely to always be approximately 10% less accurate?

    Yes

    a value in the range of 6000 to 12000 seems reasonable to me, would you try any other value?

    No

    (approximately 1 out of 8 transcriptions are not 100% accurate), so this would filter out the audio samples that don't seem to correspond to the transcription with a good ratio (and the ratio depends from the $CFG_FORCE_ALIGN_BEAM value). Did I interpret this functionality correctly?

    Not exactly. The purpose of alignment is to select among pronunciation variants in your dictionary and insert silence where appropriate.

    Forced alignment is not very good algorithm for filtering bad transcripts, it has significant disadvantages. For example it can filter correct transcripts as well.
    It can be used somewhat but strict filtering does not improve accuracy. It is not recommended to reduce the beam because it will filter more correct transcripts this way and the model will not be able to learn.

    Kaldi implements more advanced algorithms for cleaning up the transcripts which might be implemented in sphinxtrain one day. For example there is script find_bad_utts.sh which uses unigram decoder to figure out if transcript matches the audio.

    Is there any place in the logdir where I can see the number of discarded sentences in the form of percentage?

    You can compare number of input utterances and number of aligned utterances in falignout folder.

    causes all the subsequent IDs in that part (I'm using $CFG_NPART with values > 1) to fail.

    I've just committed a fix where it should issue an error and proceed. Thanks for the report.

    For the language model, I normally generate a language model with MITLM, I create it using all the transcriptions in the training data (excluding testing transcriptions), I noticed that there are tools that allow me to optimize the language model to minimize perplexity, would this process of optimization improve significantly my final Accuracy on testing sets? (for example more than 5% improvement ? )

    It is better to consider to introduce more language model data from other sources. Most algorithms are not as helpful as additional data. Another helpful thing would be to use RNNLM models which tend to be more accurate. MITLM tricks are not going to be significant advance.

    is it likely to create language models that are better (in decoding testing sets) than the ones MITLM would create with default parameters using the same input text?

    Yes

    Do you think training a model with beep dictionary is a better alternative than cmudict0.7 for spontaneous British English 8khz telephonic audio?

    Beep dictionary has non-commercial license. There is no good free dictionary for UK English

    Do you think using DNN's with Kaldi is a good choice for my task? is it likely to have a better WER/Accuracy rate for my task?

    Yes, Kaldi DNN is significantly more accurate.

    SENTENCE ERROR: 89.3% (620/694) WORD ERROR RATE: 64.5% (5486/8502)

    This is not very good result, maybe your input data is too dirty or there are issues with language model. It is hard to say offhand.

     
  • Orest

    Orest - 2015-04-08

    I've just committed a fix where it should issue an error and proceed. Thanks for the report.

    Thanks Nickolay, but it seems that it didn't completely solve the issue, I purposedly added a wav where the soxi command returns;

    soxi FAIL formats: can't open input file `631163.wav': WAVE: RIFF header not found
    

    I added it in the corresponding _train.fileids and _train.transcription, in the last line, and set $CFG_NPART to 40

    On Phase 3: this time it issues the warning as you said, and it proceeds, but then it stops without feedback on Phase 7:

    Sphinxtrain path: /opt/sphinxtrain/lib/sphinxtrain
    Sphinxtrain binaries path: /opt/sphinxtrain/libexec/sphinxtrain
    Running the training
    MODULE: 000 Computing feature from audio files
    Feature extraction is done
    MODULE: 00 verify training files
        Phase 1: Checking to see if the dict and filler dict agrees with the phonelist file.
            Found 129254 words using 40 phones
        Phase 2: Checking to make sure there are not duplicate entries in the dictionary
        Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist
    WARNING: Error in '/path-to_train.fileids', the feature file 'path-to-mfc file.mfc' does not exist, or is empty
        Phase 4: Checking number of lines in the transcript file should match lines in fileids file
        Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
            Estimated Total Hours Training: 56.4381803418803 (just for testing)
            Rule of thumb suggests 3000, however there is no correct answer
        Phase 6: Checking that all the words in the transcript are in the dictionary
            Words in dictionary: 129251
            Words in filler dictionary: 3
        Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
    

    In the logdir, only "000.comp_feat" folder is created, with 80 logfiles (40 for testing, and 40 for training) and when I move to the last train logfile, at the end it says:

    INFO: sphinx_fe.c(1049): Processing all remaining utterances at position 28743
    ERROR: "sphinx_fe.c", line 129: Failed to read RIFF headerWed Apr  8 15:53:13 2015
    

    to be sure that's the reason, I deleted the incriminating file from _train.fileids and _train.transcription in etc/, deleted everything except etc/ and /wav and run the training again, and this time the training worked.

    I did this test after reinstalling sphinxbase from (https://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/sphinxbase/) (where I saw that you modified sphinx_fe.c)

    (and my sphinxtrain is from github)

    How can I have sphinxtrain skip from training/processing certain files that give errors during sphinx_fe stage ?

     

    Last edit: Orest 2015-04-08
  • Orest

    Orest - 2015-04-22

    It is better to consider to introduce more language model data from other sources. Most algorithms are not as helpful as additional data.

    I didn't understand this, why it is better to consider to introduce more language model data from other sources ?
    Let's assume I want to transcribe speaker-independent telephonic audio of a certain topic (for example product feedback)
    and Let's assume I have 1000 hours of audio+transcriptions, and those 1000 hours represent the target audio/topic/accent I need to transcribe, and let's say 1000 hours correspond to 1400000 audiofiles (and 1400000 transcriptions)

    I can train an acoustic model with sphinxtrain from this training database, but for the language model, why it is better to introduce external language-model-data to my existing 1400000 transcriptions rather than just creating a language model with those 1400000 transcriptions only?

    Another helpful thing would be to use RNNLM models which tend to be more accurate. MITLM tricks are not going to be significant advance.

    I tried to familiarize with RNNLM Toolkit (a very cool project) with the idea of creating an ARPA-format language model for pocketsphinx, but recurrent nets are not finite state machines, so, even if a function is implemented to generate an ARPA-format language models using RNNLM Toolkit, that would be just an approximation, and it wouldn't make use of a rnn language model in decoding.
    Nickolay, do you think implementing RNN language models (in decoding) in cmu-sphinx would be a good idea?

     

    Last edit: Orest 2015-04-22
    • Nickolay V. Shmyrev

      Nickolay, do you think implementing RNN language models (in decoding) in cmu-sphinx would be a good idea?

      It might be good idea for you, we do not plan to implement them ourselves.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.