Menu

Creation of dictionary for Arabic language

Help
jam
2015-03-10
2015-03-12
  • jam

    jam - 2015-03-10

    Hi there,

    I am trying to use CMU Sphinx for Arabic language with a specific vocabulary and for that I need to create a dictionary.

    I am working on Ubuntu 14.04.2 LTS on a virtual machine. Host OS is Windows 8.

    I tried with Sequitur g2p (r1668 - numpy 1.9.1 - swig 3.0.5) and it gave me weird results:
    1. Most of the words are recognised but there is a character (م), when this character is in a word the whole word result is wrong. I can see a wrong result in testing and also in dictionary generation (--apply).
    2. Another problem is that some arabic characters with their transcription are appearing between two words results.

    The problem might come from the fact that arabic characters and latin characters are in the same file. I need an expert advice on this: can this be a problem? Could you have a look on my files?

    I then installed phonetisaurus (0.8a - openfst 1.4.1 - openngram 1.2.1) to give it a try. And followed the tutorial: http://code.google.com/p/phonetisaurus/wiki/FSMNLPTutorial#Output_words_-_useful_for_long_lists

    Everything works fine until the last command:
    ~/CMUSphinx/dictionary/phonetisaurus-0.8a/phonetisaurus/script$ phonetisaurus-g2p --model=arabic_test/arabic_test.fst --input=arabic_test.wordlist --isfile --words

    This command gives no output and seems to hang.

    Could someone help me on this issue,

    I will attach my source files for sequitur and phonetisaurus (which are the same with only file extension difference). I will also attach a notpad++ print screen of the resulting file in sequitur.

    Thanks a lot for your help.

    Jam.

     

    Last edit: jam 2015-03-10
    • Nickolay V. Shmyrev

      Input dictionary for phonetisaurus must be in a very specific format:

       word<tab>phone<space>phone<space>
      

      There must be tab symbol 0x9 ascii or \t between word and space. In your dictionary you just use space.

      If you convert dicitonary to a proper format it will train fine. You can do it with a simple python script.

       
  • jam

    jam - 2015-03-11

    Thank you very much Nikolay.

    I will try to do this.

     
  • jam

    jam - 2015-03-12

    Thank you Nikolay.

    It worked!!

    Should I remove the tab and the numbers it generates before using it in CMU Sphinx for training or can I leave it as it is?

    Thanks again.

     
    • Nickolay V. Shmyrev

      You need to remove numbers

       
    • Seif Mostafa

      Seif Mostafa - 2017-08-22

      hi jam, what should i do to make phones from word, should i do it manual by myself or there is a tool?
      Thanks!

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.