Arpabet for non-english languages?

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Arpabet for non-english languages?

Forum: Help

Creator: Kristoffer

Created: 2018-03-07

Updated: 2018-03-07

Kristoffer - 2018-03-07

I understand Arpabet is well-suited for English. I have some non-English dictionaries with - what seems to be - X-SAMPA pronounciations. Can these be used out of the box or do I need to map the characters to Arpabet friendly tokens? I.e. space-separated alpha-only characters?

E.g. I have this for the words "expert" and "institut":

ek$"spEt` In$stI$"t}:t

Please advise!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2018-03-07
  
  do I need to map the characters to Arpabet friendly tokens?
  
  Yes you need to map.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kristoffer - 2018-03-07

Can I create my own set of tokens or do they need to match CMUDict? E.g.:

EH K S P ER T IH N S T IH T UW T

That is, where there is an equivalent tag I must use it? And then, when there is no match, I can invent my own?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2018-03-07
  
  You can create your own set of tokens.
  
  You can also check http://www.openslr.org/29/ and https://github.com/kaldi-asr/kaldi/tree/master/egs/sprakbanken_swe
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kristoffer - 2018-03-07

Thanks. Really appreciate your help :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kristoffer - 2018-03-07

Given the resources mentioned here, what tools do I need to create something like en-us-phone.lm.bin but for e.g. Swedish? I understand that the phone LM file is a lot smaller than the original file. Not sure how it was created. Is there a tutorial somewhere to get me started?

How much work do you think would be required in terms of hours/days?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.