Menu

How to build a Persian/Farsi dictionary to using with CMUsphinx!

Help
rezaee
2016-10-08
2016-11-09
  • rezaee

    rezaee - 2016-10-08

    Hello
    Let's considr we want to make a phonetic dictionary for Farsi digits from 1 to 10. this is the results in espeak with this command: espeak -v fa -x

     j'ek       1
     d'o        2
     s'e        3
     tS'AhAR    4
     p'andZ     5
     S'eS       6
     h'aft      7
     h'aSt      8
     n'oh       9
     d'ah       10
    

    And I should map them to a phoneset that I can use in CMUsphinx language model. so I built this mapping for them:

    j y
    e e
    k k
    d d
    o o
    s s
    tS ch
    A aa
    h h
    R r
    p p
    a a
    n n
    dZ j
    S sh
    f f
    t t
    

    Finally I should write my dictionary like this:

    یک y e k
    دو d o
    سه s e
    چهار ch aa h aa r
    پنج p a n j
    شش sh e sh
    هفت h a f t
    هشت h a sh t
    نه n o h
    ده d a h
    

    Ok, am I in the right way?
    What's the next step?
    Do I have the phonetic dictionary and phoneset file for my project? should I go forward to training my system with these two files to get my language model?

     

    Last edit: rezaee 2016-10-08
    • Nickolay V. Shmyrev

      Ok, am I in the right way?

      Yes

      What's the next step?

      Continue with speech data collection and training

      Do I have the phonetic dictionary and phoneset file for my project?

      You have the dictionary, phoneset must be compiled. Phoneset should lists phones.

      should I go forward to training my system with these two files to get my language model?

      Yes

       
      • rezaee

        rezaee - 2016-10-09

        You have the dictionary, phoneset must be compiled. Phoneset should lists phones.

        What do you mean by this? how should I compile? what is list of phones?
        you mean I can not do my project with these 2 files?

         
        • rezaee

          rezaee - 2016-10-09

          May you put a phoneset file here and I can see what is it?
          Do I need phoneset for next steps?

           
          • Nickolay V. Shmyrev

            This question is answered in acoustic model training tutorial "data preparation" section.

             
  • rezaee

    rezaee - 2016-10-08

    Another question is. what do I must do with these symbols: ' , :
    I didn't consider them into writing my phoneset mapping. will this make a problem?

     

    Last edit: rezaee 2016-10-08
    • Nickolay V. Shmyrev

      You need to provide context in which particular words symbols like , or : happen. Symbol : usulaly means prolonged phone, which you can use in phonset or ignore, depends on how frequently it happens.

       
  • rezaee

    rezaee - 2016-10-09

    Why should we map the phonems from espeak? because it has a Standard that CMUsphinx knows it?

     
    • Nickolay V. Shmyrev

      There is no standards but there are rules described in tutorial: spaces between phonemes, lowercase, no punctuation in phonemes. Those make it easier for software to process input files.

       
  • rezaee

    rezaee - 2016-10-09

    So the mapping file isn't important for project and it's only a self guied for ourselve to writing the dictionary. and the phoneset is the file that consist of characters we used for building our dictionary.

    I have written following characters in a file with ".phone" extension and writing my dictionary with these. they are all of the phones that we need to write Persian words in dictionary. so this is my phoneset file I think?

     a
     e
     o
     aa
     i
     u
     b
     p
     t
     s
     j
     ch
     h
     kh
     d
     z
     r
     z
     zh
     s
     sh
     s
     z
     t
     z
     gh
     f
     gh
     k
     g
     l
     m
     n
     v
     h
     y
     ss
    
     

    Last edit: rezaee 2016-10-09
  • rezaee

    rezaee - 2016-10-09

    What about filler dictionary?
    How should I build that?
    I read the tutorial in acoustic training part but it was't enogh for me!
    Can you explain more please?

     
  • rezaee

    rezaee - 2016-10-09

    Another question
    I used online language modeling service to put my text file senteces between by downloding it's ".sent" file. is there any command for doing this in ofline as easy as the online service?

     
    • Nickolay V. Shmyrev

      so this is my phoneset file I think?

      Yes

      I read the tutorial in acoustic training part but it was't enogh for me!

      You need to ask more detailed question then

      I used online language modeling service to put my text file senteces between by downloding it's ".sent" file. is there any command for doing this in ofline as easy as the online service?

      SRILM does not require you to insert <s>, it adds them automatically. Otherwise you can write a simple Python script.

       

      Last edit: Nickolay V. Shmyrev 2016-10-09
  • rezaee

    rezaee - 2016-10-14

    Unfortunately I don't know python! is there any existen script to add and () after sentences?

     
    • mehrshad

      mehrshad - 2016-11-09

      hi
      plz check your mail...

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.