Combine default CMUSphinx Language Model and Custom Training Language Model...

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Combine default CMUSphinx Language Model and Custom Training Language Model using CMUCLMTK

Forum: Help

Creator: Manoj Gaonkar

Created: 2017-07-05

Updated: 2017-07-05

Manoj Gaonkar - 2017-07-05

Command i used:
lm_combine.exe -lm1 custom.lm -lm2 en-70k-0.1.lm -weight w.wt -lm mix.lm

w.wt:
custom.lm 0.5
en-70k-0.1.lm 0.5

When i tried merging the default CMU Sphinx LM with custom language model, I got the following

Reading in a 3-gram language model.
Number of 1-grams = 226.
Number of 2-grams = 913.
Number of 3-grams = 1595.
Reading unigrams...

Reading 2-grams...

Reading 3-grams...
Reading in a 3-gram language model.
Number of 1-grams = 72354.
Number of 2-grams = 6581523.
Number of 3-grams = 7704188.
Reading unigrams...

Reading 2-grams...
Error - Repeated 2-gram in ARPA format language model.

When i tried combined two Custom Trained Model, The execution was successfull.

Any Information will be helpful!!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2017-07-05
  
  The message says that big LM has duplicated ngrams, it might be the case. You can fix duplicated ngrams with text editor or with a script.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Manoj Gaonkar - 2017-07-06

Thanks for the information,

Is there any such script available.
Can you tell me the dataset used for Default Language Model Training

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Manoj Gaonkar - 2017-07-06

Hi Nickolay,

I searched for the duplicates in the big LM, There are no dupliactes:

There are n-grams in LM like,
-3.8544 service joined 0.0000
-4.3938 service joining 0.0000
-3.9777 service joint -0.1926
-3.8638 service jointly 0.0000

Are these considered as duplicates n-grams?

if (pos_of_novelty == i && j != 1)
quit(-1,"Error - Repeated %d-gram in ARPA format language model.\n", i);

This the code in lm_combine.c where i get the error.

Can you tell me what might be the other possibilities for that error while combining two LMs.

Anything will be helpful!!

Regards,
Manoj

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2017-07-11
  
  Maybe it expects the ngrams to be sorted, try to sort with sphinx_lm_sort.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.