Menu

<UNK> tag in the arpa file

Help
UF grad
2008-08-14
2012-09-22
  • UF grad

    UF grad - 2008-08-14

    My training data has only digit, but for some reason the arpa file created by the LM tools contains <UNK>. During the decoding with Sphinx3, the decoder reports the following error in the log file

    297:ERROR: "wid.c", line 282: <UNK> is not a word in dictionary and it is not a class tag.

    Here is the content in the arpa file

    \1-grams:
    -2.1255 <UNK> 0.0000
    -0.8702 EIGHT 0.2125
    -1.1043 FIVE 0.1381
    -1.1043 FOUR 0.0565
    -1.3473 NINE -0.0028
    -1.1712 OH -0.0283
    -1.4723 ONE -0.0892
    -1.1712 SEVEN 0.0240
    -0.9080 SIX -0.1648
    -0.9494 THREE 0.2020

    Can someone point out what this <UNK> is? Let me know if you want to see the actual training data

     
    • Rafa

      Rafa - 2008-09-08

      Hi.
      I've come to the same error message, but there's a difference: I want to use the <UNK> word!

      After browsing the forum it seems that Sphinx3 doesn't have word spotting implemented, and that's what I need, detecting the word "cat" in the sentence "my cat is pink", i.e., recognizing "<UNK> cat <UNK> <UNK>" or similar.

      A possibility is the building of a garbage model with all phone pronunciations, but I think it isn't the perfect solution. Ideally any hypothesis that doesn't score above a determined confidence threshold should be rejected as <UNK>. I suppose that this is what is not implemented in Sphinx3.

      But what I want to know is why is the <UNK> symbol explicitly supported by Sphinx3 in LMs, including bigrams and trigrams. That LMs are capable of recognizing <UNK> everywhere inside a sentence without discarding the other words. So it seems that it's all in there!

      So, what's the meaning of the <UNK> tokens being supported in bigrams and trigrams? Any way of using it?

      Thank you.

       
    • David Huggins-Daines

      By default, the LM tools create "open vocabulary" language models, which means that the vocabulary contains an unknown word token <UNK> which represents all words that are not part of the vocabulary. Because language model probabilities are smoothed, even if there are no unknown words in training, <UNK> is still assigned a small probability, just like any other vocabulary word that wasn't seen in training.

      However, the Sphinx decoders don't know or care about words that aren't in the intersection of the dictionary and the LM vocabulary. In Sphinx3, <UNK> is treated like any other word that isn't in the dictionary, and is ignored (that message should really be a warning rather than an error). In PocketSphinx, <UNK> is mapped to the language model ID for unknown words, which in practice means that it is ignored.

      The LM tools should probably create closed vocabulary models by default instead, to avoid this confusion.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.