Menu

Building a Language Model

Max
2012-04-21
2012-09-22
  • Max

    Max - 2012-04-21

    I would like to write a speech recognition program and appreciate anyone
    pointing me in the right direction. I would like it to be able to recognize
    about 6000 English technical words and it does not need to be able to
    recognize anything else. When a person is using the app, they will enter a
    list, so the first word will go into the first structured field, the second
    word into another field, and so on. I could even have the person say the word
    "next" or something else to signify divisions in the words so the program will
    not need to understand context.

    It seems to me that this would be substantially easier than a general speech
    recognition program that needs to be able to understand context, a large
    number of words, etc. Is this the case and how would I build my language
    model? Is there a program I can use that would create how the pronunciations
    should sound for these 6000 words or do I need to find pronunciations on the
    Internet or record persons saying them?

    Also the idea behind my app is for lay users to be able to use speech
    recognition for this technical field, so I don't care if the pronunciations
    are accurate, but rather I would match how lay persons think it should be
    pronounced to the actual words. This is why I could potentially use an
    automated pronunciation procedure to create my dictionary.

    Thanks for any insight!!

     
  • Max

    Max - 2012-04-22

    Thanks for your reply, I did read the tutorial but did not completely
    understand it. Re-reading the following page

    http://cmusphinx.sourceforge.net/wiki/tutorialdict

    does answer my question about building a dictionary. My gist from reading the
    dictionary is that my task is easier than general speech recognition, because
    of the constrained problem as I described it above, but wanted to confirm
    this. I also was not able to find any section on "wake-up" phrases, but the
    voice while driving app uses CMU Sphinx and does this, so I think it could
    work for me as well.

    To confirm, it sounds like I do not need to find speakers to say each of the
    6000 words I want, I can use an automatic text to voice program which will
    guess at the pronunciation and can use some machine learning to improve it
    once the app is being used.

     
  • Nickolay V. Shmyrev

    To confirm, it sounds like I do not need to find speakers to say each of the
    6000 words I want

    There is no need to train acoustic model for new words unless you have very
    different recording conditions. Current acoustic model is good enough for
    dictation. There is no need to record every word supposed to be recognized
    too.

     
  • Max

    Max - 2012-04-24

    Thanks for your replies, I find this helpful. The current acoustic model is
    good even for technical vocabulary? It seemed to me that it would be easier to
    achieve accuracy by building an acoustic/language model just for this purpose.
    Does this mean that I would just need a language model for the actual words,
    and not an acoustic model? The current solutions on my smartphone do not do
    well with a technical vocabulary so it seems that something needs to be done.

     
  • Nickolay V. Shmyrev

    The current acoustic model is good even for technical vocabulary?

    Yes

    It seemed to me that it would be easier to achieve accuracy by building an
    acoustic/language model just for this purpose.

    This issue is covered in tutorial

    Does this mean that I would just need a language model for the actual words,
    and not an acoustic model?

    Yes

     

Log in to post a comment.