CMU Sphinx / Forums / Help: Need to add or replace words in the pocketsphinx dictionary

RUPA SINGH - 2016-11-05

Hi Nickolay,
I am following cmusphinx tutorial and i have added and replaced some words in the existing dictionary. For example,
1 W AH N
10 W AH N Z IY R OW
11 W AH N W AH N
11TH W AH N W AH N T IY EY CH
12 W AH N T UW
19 W AH N AY N
1939 W AH N AY N TH R IY N AY N
1946 W AH N AY N F OW R S IH K S
1984 W AH N AY N EY T F OW R
1985 W AH N AY N EY T F AY V
1988 W AH N AY N EY T EY T
1989 W AH N AY N EY T N AY N
1ST W AH N EH S T IY
2 T UW
25 T UW F AY V
3 TH R IY
39 TH R IY N AY N
4 F OW R
7000 S EH V AH N Z IY R OW Z IY R OW Z IY R OW
72 S EH V AH N T UW
7500 S EH V AH N F AY V Z IY R OW Z IY R OW
8 EY T
9 N AY N

i have replaced '8' with 'EIGHT'

output before replacement : YOU TO ALL IS THAT IS THAT IS THE MOTOR USE AUTOS YOU TO AS BY THE LORD WHO VEHICLE ACT. THE PROVISIONS OF CHAPTER 8 THAT IS REGARDING THE TP THE THE SHORT IS WAS IN THE EFFECT DEAL WITH EFFECT FROM JULY AMENDMENT IN WHAT IS THE THAT STILL TALKING LORD THE MOTOR VEHICLE THAT 19 IN

output after replacement : YOU TO ALL IS THAT IS THAT IS THE MOTOR USE AUTOS YOU TO AS BY THE LORD WHO VEHICLE ACT. THE PROVISIONS OF CHAPTER THE THAT IS REGARDING THE TP THE THE SHORT IS WAS IN THE EFFECT DEAL WITH EFFECT FROM JULY AMENDMENT IN WHAT IS THE THAT STILL TALKING LORD THE MOTOR VEHICLE THAT 19 IN

In the second output instead of 'THE' , 'EIGHT' should come. i am unable to add or replace words in the dictionary.

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-05
  
  You need to update the language model, not just the dictionary
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-05

Hi Nickolay,
Thank you so much for the reply.
I have updated the language model, replaced '8' with 'EIGHT' and tested. Its working now.
Output: YOU TO ALL IS THAT IS THAT IS THE MOTOR USE AUTOS YOU TO AS BY THE LORD WHO VEHICLE ACT
THE PROVISIONS OF CHAPTER EIGHT THAT IS REGARDING THE TP THE THE SHORT IS WAS IN THE EFFECT DEAL WITH EFFECT FROM JULY AMENDMENT IN WHAT IS THE THAT STILL TALKING LORD THE MOTOR VEHICLE THAT 19 IN

i checked for one word but what if i have multiple words to extend the dictionary. then how to update the language model with those multiple words.

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-05
  
  It is covered in http://cmusphinx.sourceforge.net/wiki/tutoriallmadvanced
  
  For numbers it is recommended to postprocess the text output, not insert numbers in the language model.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-05

Thanks for the reply.
Earlier i have used LM tool to create language model. For testing purpose i have created lm of 20 sentences. can i use LM tool instead of SRILM tool for large scale language model?

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-05
  
  can i use LM tool instead of SRILM tool for large scale language model?
  
  No
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-06

Hi Nickolay,

Thank you for your response. I have installed SRILM tool and used a command -

ngram-count -text /home/eight/Downloads/chapter20.txt -lm newlm.lm

output: A new language model generated called 'newlm.lm'

If a word i.e 'arms' is there in new language model but not in the dictionary. To add that word in the dictionary can i directly add one line i.e arms AA R M Z or i have to do something else to update?

Thank you

Last edit: Nickolay V. Shmyrev 2016-11-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-06
  
  To add that word in the dictionary can i directly add one line i.e arms AA R M Z or i have to do something else to update?
  
  Yes you can
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-06

Hi, i am so sorry to bother you. i am not familiar with this.
Earlier i have used online LM tool and created both language model as well as dictionary at the same time using online LM tool. http://www.speech.cs.cmu.edu/tools/lmtool-new.html

Now i am using SRILM tool for large scale language model. I have used above 'ngram-count' command to create language model. Is it possible to create phonetic dictionary also using SRILM?

Thank you so much.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-06
  
  Is it possible to create phonetic dictionary also using SRILM?
  
  No
  
  You can use https://github.com/cmusphinx/g2p-seq2seq instead.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-06

Okay.. Thanks. i will use 'g2p-seq2seq' for the dictionary.
For better and large scale language model, from where i can get the text?

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-07
  
  Crawl the web
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-07

Hey,
Thank you for your response.
For testing purpose, i have created language model using SRILM tool and right now i am using cmudict-en-us.dict dictionary. i followed cmusphinx tutorial and trained acoustic model.

I have used a command to recognize the audio file:
pocketsphinx_continuous -hmm en-us-adapt -lm chapter.lm.bin -dict cmudict-en-us.dict -mllr mllr_matrix -infile Chapter_2_2.wav

Sentence in transcription:
you to all these factors practice of motor insurance is influenced by the motor vehicle act the provisions of chapter 8 that is regarding the TP insurance was made effective with effect from July 1946 still talking about the motor vehicles act of 19

Output by using the above command:
to the law this that you is that is award various autos the was why is the motor vehicle act
the provisions of chapter the that is regarding the p p p short is was is that effect deal with effect from to amendment in what is the this to in award the motor vehicle sacrament in

Accuracy is not good. I have added some words in the dictionary which are there in the langauage model
For example: 8 EY T
19 W AH N AY N

and some words are already there in the dictionary as well as in the language model,still it is not recognizing the word. i dont know where i am wrong. can you please help?

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-07
  
  To get help on the accuracy you need to provide the data files that you used.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-07

By using a text file i.e data.txt i have created language model srilm tool
Kindly find attached file

data.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-08
  
  It is better to provide everything as a single archive. You can upload to dropbox/google drive and give here a link.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-09

Hi Nickolay,
Thanks for the reply. I have attached a link kindly go through it.
It contains- six wav files : Chapter_2_0.wav, Chapter_2_1.wav, Chapter_2_2.wav,.....
- language model : chapter2.lm, chapter2.lm.bin
- dictionary : cmudict-en-us.dict
- transcription and fileids : motor89.transcription, motor89.fileids
- files which are generated while adapting acoustic model : gauden_count, mixw_counts, mllr_matrix, tmat_counts
- en-us-adapt folder
i have trained and tested for 89 wav files, here i am attaching 6 wav files. If you need other files also then please let me know i will upload that. Once again thank you very much for the help.
https://drive.google.com/drive/folders/0B_IHphmLx3m7SFB5MUV3YTBXbkk?usp=sharing

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-09
  
  MLLR adaptation is not effective for PTM model, you need to use MAP adaptation. You also need to test adaptation accuracy as described in adaptation tutorial.
  
  Ideally you also need to use much more adaptation data. You even have to train Indian English model to get a good accuracy.
  
  Special dictionary has to be designed for Indian English for training too, for example if you replace
  
  8 EY T
  
  with
  
  8 IY T
  
  it will recognize your sample better.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-09

I am so sorry, forgot to mention that i am doing it for Indian English accent. I have used both MLLR and MAP adaptation and created 'en-us-adapt' folder using MAP.
I didn't get Indian English dictionary thats why i am using UK English phonetic words for dictionary and trying to append it with US English dictionary. can you tell me how do i train Indian English model? from where? i am not able to get it. I will add more data for adaptation.
Thank you for your response.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-09
  
  can you tell me how do i train Indian English model?
  
  Acoustic model training tutorial is here:
  
  http://cmusphinx.sourceforge.net/wiki/tutorialam
  
  You can also provide sufficient amount of transcribed Indian English data, I'll train the model for you.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-09

Thank you so much for your response. Due to time constraint I will send you the transcribed Indian English data soon. Can you share the all steps which you will use to train the model.? Actually i need to know that for future use. how i can thank you. I appreciate all of your help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-09

Hi Nickolay,
On Immediate basis i am attaching a sample data file of Indian English accent. How much data you need to train the model for better accuracy? let me know i will provide you the large data asap.
Kindly go through the link.
https://drive.google.com/drive/folders/0B_IHphmLx3m7Zk9ZdXpkVXpnUWM?usp=sharing
There are two files of same data. One is in simple text format and the other is in transcription file format i.e.
Also i have attached some wav files for the testing purpose.If you need more files then please let me know.
Thanking you..

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-09
  
  The total duration of the data you provided is just 3 minutes. You need to provide 100-200 hours of transcribed data to train the model.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RUPA SINGH - 2016-11-09

Thank you for the rhelp. i have read in this tutorial : http://cmusphinx.sourceforge.net/wiki/tutorialam
there are no. of hours along with the no. of speakers define to train the model. maximum hours 50 is given. I am just clearing my doubt Once you train the model from 100-200 hours of transcribed data. will it be generic model? I am sorry to bother you. i just want to learn this from you and want to clear my all doubts. At present, i dont have so much data but i will provide you soon.

Thank you very much..

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-11-09
  
  Generic models are trained with 1000-10000 hours of data. Google trains with 6 lakhs hours, those are generic.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Need to add or replace words in the pocketsphinx dictionary

Speech Recognition Toolkit

Forums

Help

Need to add or replace words in the pocketsphinx dictionary document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Need to add or replace words in the pocketsphinx dictionary