Menu

Best way to include sequences/numbers in a language model?

Help
2016-03-02
2016-03-02
  • Konstantin Koss

    Konstantin Koss - 2016-03-02

    Hello,

    so I am successfully generating a domain specific language model via cmuclmtk from a text corpus that covers a few dozen potential commands and sentences my application is likely to hear.

    Now I want to add a small amount of commands that include a number in the sentence structure.

    Would I go in and fill up the corpus with line after line of (e.g.)

    <s> the patient took 1 minute </s>
    <s> the patient took 2 minutes </s>
    <s> the patient took 3 minutes </s><s> the patient took 178 seconds </s>
    

    That seems excessive. And I can imagine that this would immensely bloat the probability of this particular phrase in contrast to all the others, right? (I don’t really understand the inner workings of language models)

    Or would I add a few sample sentences of the above format and then list all combinations of

    <s> 1 minute </s>
    <s> 2 minutes </s><s> 199 minutes </s>
    

    …and again for seconds and other phrases that take numbers?

    Or is it enough to list numbers by themselves and trust that sphinx will correctly insert them into the phrases?

    Also, would it be better to list numbers as words, as in

    <s> one </s>
    <s> two </s>
    <s> ten </s>
    <s> twenty one </s>
    

    ?

    The concise version of my questions:

    1. What is the best practice to add phrases to a text corpus for language model generation that take a variable (be it numbers, months, dates)?
    2. Is probability bloat of repeated phrases something I have to be careful about?
    3. Should symbols be written out or can the lm toolkit handle them?

    Thank you very much for your time,

    Konstantin

     

    Last edit: Nickolay V. Shmyrev 2016-03-02
    • Nickolay V. Shmyrev

      We recommend to expand numbers and dates to words, that gives model much more flexibility.

      You definitely need to care about probabilities, if sentences are rare you do not need many of them in a training text.

      With numbers to words expaned you do not need to list every possible number, just some of them.

       

Log in to post a comment.