Menu

Adding words to LM

Help
binac
2011-07-04
2012-09-22
  • binac

    binac - 2011-07-04

    I have a dictation system built on pocketsphinx and vocabulary of 65k words in
    n-gram model trained with CMU LM Toolkit. I need to have command words for
    correction and text formatting included in main language model but they are
    missing.
    What is the simplest way to include this words in existing model and ensure
    that it will be recognized in any context of continuous dictation.

     
  • Nickolay V. Shmyrev

    Hello binac

    There is little sense to add correction words to the language model. You need
    to spot for them in continuous stream either with parallel keyword spotter or
    with spotter for isolated commands embedded into ngram decoder. For example
    you can specifically hardcode the higher probability of the correction word in
    isolated utterances. You an do that by modifying sphinxbase ngram code.
    Usually correction command is an isolated word in a separate utterance, so you
    need to account for that case.

    Yes, dictation system is not that easy to build as it might seem. Maybe you
    will be interested in some Keith publicatoins

    http://keithv.com/pub/

    For example

    http://keithv.com/pub/speechduring/speech_rec_dictation_corrections.pdf

     
  • binac

    binac - 2011-07-05

    Thank you Nickolay

    Keith publications are very useful for me but I'm stuck on simple problem.

    My 3-gram LM for dictation is good for targeted vocabulary and I don't want to
    change it. However it, for some reason, does not contains words like
    "correct", "select" which I want to use parallel with dictation like "correct
    some text".

    Is there a possibility to manually add these words in arpa text format LM and
    maybe specify their higher probabilities there. Or maybe build some smaller
    model with this words if some tool for merging models or some feature in
    pocketsphinx
    to use both models at same time exist which I don't know.

    I just want to avoid rebuilding whole LM.

     
  • Nickolay V. Shmyrev

    You can build unigram language model from the list of your words and mix it
    with your large language model using mitlm.

     

Log in to post a comment.