Menu

cmuclmtk-0.7 text2idngram and s tags

Help
Halle
2012-01-30
2012-09-22
  • Halle

    Halle - 2012-01-30

    Hi Nickolay,

    I've been experimenting with generating language models using cmuclmtk-0.7
    instead of mitlm. I had a couple of questions about it.

    My first question is about text2idngram, specifically the hash size that is
    passed to it. I can't use the default of 2000000 because the memory isn't
    there for it on the device. I did some tests of reducing it to various numbers
    and discovered that I could reduce it down to 5000 without the (final .arpa)
    output apparently changing, and furthermore it reduced the time needed by
    about a third. There seems to be nothing but upside to reducing the hash size.
    Is there any downside?

    My second question is that when I generate an ARPA model using the method
    defined in this link: http://cmusphinx.sourceforge.net/wiki/tutoriallm, I get some entries in my ARPA
    file with end + start sentence markers as in the following 3-gram:

    -1.4260 TUTORIAL

    When a language model is made in this way using primarily single words
    surrounded by the sentence markers, for instance the following starting text
    file:

    CHANGE MODEL
    MONDAY
    TUESDAY
    WEDNESDAY
    THURSDAY
    FRIDAY
    SATURDAY
    SUNDAY
    QUIDNUNC

    Then almost every word will have an entry for this WORD pattern under
    the 3-grams as well as the more expected WORD pattern, like so:

    -0.9542 FRIDAY
    -0.9542
    MONDAY
    -0.9542
    QUIDNUNC
    -0.9542
    SATURDAY
    -0.9542
    SUNDAY
    -0.9542
    THURSDAY
    -0.9542
    TUESDAY
    -0.9542
    WEDNESDAY
    -0.3010 CHANGE MODEL
    -0.3010 FRIDAY
    -0.3010 MONDAY
    -0.3010 SATURDAY
    -0.3010 SUNDAY
    -0.3010 THURSDAY
    -0.3010 TUESDAY
    -0.3010 WEDNESDAY
    -0.3010 CHANGE MODEL

    -0.3010 FRIDAY

    -0.3010 MODEL

    -0.3010 MONDAY

    -0.3010 SATURDAY

    -0.3010 SUNDAY

    -0.3010 THURSDAY

    -0.3010 TUESDAY

    -0.3010 WEDNESDAY

    Is this expected? Is it problematic for recognition? Thanks for your insight.

     
  • eliasmajic

    eliasmajic - 2012-02-01

    Just have guesses to this answer so I will remain silent but I suggest you
    check out SRILM as well.

    http://www.speech.sri.com/projects/srilm/

     
  • Nickolay V. Shmyrev

    My first question is about text2idngram, specifically the hash size that is
    passed to it. I can't use the default of 2000000 because the memory isn't
    there for it on the device. I did some tests of reducing it to various numbers
    and discovered that I could reduce it down to 5000 without the (final .arpa)
    output apparently changing, and furthermore it reduced the time needed by
    about a third. There seems to be nothing but upside to reducing the hash size.
    Is there any downside

    I think it can be done, but it maybe requires some rework in outdated cmuclmtk
    sources (hash needs to be linked from sphinxbase)

    -1.4260 TUTORIAL
    Is this expected? Is it problematic for recognition? Thanks for your insight.

    This shouldn't hurt. But it's not good either. I never had time to fix it in
    cmuclmtk

    Just have guesses to this answer so I will remain silent but I suggest you
    check out SRILM as well.

    Also mitlm and irstlm, they should be way better candidates.

     
  • Halle

    Halle - 2012-02-01

    Just have guesses to this answer so I will remain silent but I suggest you
    check out SRILM as well.

    I can't use srilm or irstlm because of their licenses; they'd probably be
    usable for me, but unusable for my users. I have implemented a working port of
    mitlm but I would really like to drop the C++ requirement since its the only
    non-C dependency I'm working with at it leads to some awkwardness in Xcode
    that leads to many support cases. But if the hash size reduction and the
    issue both have unknown consequences in cmuclmtk I guess I have to
    stick with mitlm (which I do like other than the C++ thing). Nickolay, what is
    the potential danger with changing the hash size?

     
  • Nickolay V. Shmyrev

    . But if the hash size reduction and the issue both have unknown
    consequences in cmuclmtk I guess I have to stick with mitlm (which I do like
    other than the C++ thing).

    Consequences are known and they do not change anything. The model will be
    still functional. You can safely go with cmuclmtk with modified initial hash
    if you want to use it.

     
  • Halle

    Halle - 2012-02-02

    That sounds good; can I ask you to expand on your previous comments a bit
    then?

    I think it can be done, but it maybe requires some rework in outdated
    cmuclmtk sources (hash needs to be linked from sphinxbase)

    This sounds like there is something I still need to do in order for this to
    work without issue.

    This shouldn't hurt. But it's not good either. I never had time to fix it in
    cmuclmtk

    Let's talk about the "not good"-ness a bit more. To me it looks like it is
    effectively re-adding the 2-grams into the 3-gram section because it is taking
    the end/start tags as words, so it probably raises the probability of the true
    2-grams versus the true 3-grams a bit. Is there a problem with my doing a last
    pass on the .arpa after it is created and just deleting the lines in which the
    pattern appears, and then adjusting the n-gram counts in the data
    section? Or will this have a distorting effect on the overall probabilities?
    Whatever I do there will have to work as well for 1000 words which are derived
    from complete sentences as 10 words which were derived from a simple corpus
    for command-and-control.

     
  • Nickolay V. Shmyrev

    To me it looks like it is effectively re-adding the 2-grams into the 3-gram
    section because it is taking the end/start tags as words, so it probably
    raises the probability of the true 2-grams versus the true 3-grams a bit.

    No, its not like that. Decoder never query those trigrams with so they
    have no effect except they take memory.

    Is there a problem with my doing a last pass on the .arpa after it is
    created and just deleting the lines in which the pattern appears, and
    then adjusting the n-gram counts in the data section? Or will this have a
    distorting effect on the overall probabilities? Whatever I do there will have
    to work as well for 1000 words which are derived from complete sentences as 10
    words which were derived from a simple corpus for command-and-control.

    Cleanup like this will work too.

    I would better understand how do those things get into counts and fix it. It
    shouldn't be complex, just some painful work to cleanup cmuclmtk. I can not
    give you other advise.

     

Log in to post a comment.