CMU Sphinx / Forums / Help: Creation of dictionary for Arabic language

Speech Recognition Toolkit

Creation of dictionary for Arabic language

Forum: Help

Creator: jam

Created: 2015-03-10

Updated: 2015-03-12

jam - 2015-03-10

Hi there,

I am trying to use CMU Sphinx for Arabic language with a specific vocabulary and for that I need to create a dictionary.

I am working on Ubuntu 14.04.2 LTS on a virtual machine. Host OS is Windows 8.

I tried with Sequitur g2p (r1668 - numpy 1.9.1 - swig 3.0.5) and it gave me weird results:
1. Most of the words are recognised but there is a character (م), when this character is in a word the whole word result is wrong. I can see a wrong result in testing and also in dictionary generation (--apply).
2. Another problem is that some arabic characters with their transcription are appearing between two words results.

The problem might come from the fact that arabic characters and latin characters are in the same file. I need an expert advice on this: can this be a problem? Could you have a look on my files?

I then installed phonetisaurus (0.8a - openfst 1.4.1 - openngram 1.2.1) to give it a try. And followed the tutorial: http://code.google.com/p/phonetisaurus/wiki/FSMNLPTutorial#Output_words_-_useful_for_long_lists

Everything works fine until the last command:
~/CMUSphinx/dictionary/phonetisaurus-0.8a/phonetisaurus/script$ phonetisaurus-g2p --model=arabic_test/arabic_test.fst --input=arabic_test.wordlist --isfile --words

This command gives no output and seems to hang.

Could someone help me on this issue,

I will attach my source files for sequitur and phonetisaurus (which are the same with only file extension difference). I will also attach a notpad++ print screen of the resulting file in sequitur.

Thanks a lot for your help.

Jam.

Last edit: jam 2015-03-10

sequitur_phonetisaurus_pbs.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-03-10
  
  Input dictionary for phonetisaurus must be in a very specific format:
  
  word<tab>phone<space>phone<space>
  
  There must be tab symbol 0x9 ascii or \t between word and space. In your dictionary you just use space.
  
  If you convert dicitonary to a proper format it will train fine. You can do it with a simple python script.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jam - 2015-03-11

Thank you very much Nikolay.

I will try to do this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jam - 2015-03-12

Thank you Nikolay.

It worked!!

Should I remove the tab and the numbers it generates before using it in CMU Sphinx for training or can I leave it as it is?

Thanks again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-03-12
  
  You need to remove numbers
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Seif Mostafa - 2017-08-22
  
  hi jam, what should i do to make phones from word, should i do it manual by myself or there is a tool?
  Thanks!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Creation of dictionary for Arabic language

Speech Recognition Toolkit

Forums

Help

Creation of dictionary for Arabic language document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Creation of dictionary for Arabic language