My training data has only digit, but for some reason the arpa file created by the LM tools contains <UNK>. During the decoding with Sphinx3, the decoder reports the following error in the log file
297:ERROR: "wid.c", line 282: <UNK> is not a word in dictionary and it is not a class tag.
Here is the content in the arpa file
\1-grams:
-2.1255 <UNK> 0.0000
-0.8702 EIGHT 0.2125
-1.1043 FIVE 0.1381
-1.1043 FOUR 0.0565
-1.3473 NINE -0.0028
-1.1712 OH -0.0283
-1.4723 ONE -0.0892
-1.1712 SEVEN 0.0240
-0.9080 SIX -0.1648
-0.9494 THREE 0.2020
Can someone point out what this <UNK> is? Let me know if you want to see the actual training data
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi.
I've come to the same error message, but there's a difference: I want to use the <UNK> word!
After browsing the forum it seems that Sphinx3 doesn't have word spotting implemented, and that's what I need, detecting the word "cat" in the sentence "my cat is pink", i.e., recognizing "<UNK> cat <UNK> <UNK>" or similar.
A possibility is the building of a garbage model with all phone pronunciations, but I think it isn't the perfect solution. Ideally any hypothesis that doesn't score above a determined confidence threshold should be rejected as <UNK>. I suppose that this is what is not implemented in Sphinx3.
But what I want to know is why is the <UNK> symbol explicitly supported by Sphinx3 in LMs, including bigrams and trigrams. That LMs are capable of recognizing <UNK> everywhere inside a sentence without discarding the other words. So it seems that it's all in there!
So, what's the meaning of the <UNK> tokens being supported in bigrams and trigrams? Any way of using it?
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
By default, the LM tools create "open vocabulary" language models, which means that the vocabulary contains an unknown word token <UNK> which represents all words that are not part of the vocabulary. Because language model probabilities are smoothed, even if there are no unknown words in training, <UNK> is still assigned a small probability, just like any other vocabulary word that wasn't seen in training.
However, the Sphinx decoders don't know or care about words that aren't in the intersection of the dictionary and the LM vocabulary. In Sphinx3, <UNK> is treated like any other word that isn't in the dictionary, and is ignored (that message should really be a warning rather than an error). In PocketSphinx, <UNK> is mapped to the language model ID for unknown words, which in practice means that it is ignored.
The LM tools should probably create closed vocabulary models by default instead, to avoid this confusion.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My training data has only digit, but for some reason the arpa file created by the LM tools contains <UNK>. During the decoding with Sphinx3, the decoder reports the following error in the log file
297:ERROR: "wid.c", line 282: <UNK> is not a word in dictionary and it is not a class tag.
Here is the content in the arpa file
\1-grams:
-2.1255 <UNK> 0.0000
-0.8702 EIGHT 0.2125
-1.1043 FIVE 0.1381
-1.1043 FOUR 0.0565
-1.3473 NINE -0.0028
-1.1712 OH -0.0283
-1.4723 ONE -0.0892
-1.1712 SEVEN 0.0240
-0.9080 SIX -0.1648
-0.9494 THREE 0.2020
Can someone point out what this <UNK> is? Let me know if you want to see the actual training data
Hi.
I've come to the same error message, but there's a difference: I want to use the <UNK> word!
After browsing the forum it seems that Sphinx3 doesn't have word spotting implemented, and that's what I need, detecting the word "cat" in the sentence "my cat is pink", i.e., recognizing "<UNK> cat <UNK> <UNK>" or similar.
A possibility is the building of a garbage model with all phone pronunciations, but I think it isn't the perfect solution. Ideally any hypothesis that doesn't score above a determined confidence threshold should be rejected as <UNK>. I suppose that this is what is not implemented in Sphinx3.
But what I want to know is why is the <UNK> symbol explicitly supported by Sphinx3 in LMs, including bigrams and trigrams. That LMs are capable of recognizing <UNK> everywhere inside a sentence without discarding the other words. So it seems that it's all in there!
So, what's the meaning of the <UNK> tokens being supported in bigrams and trigrams? Any way of using it?
Thank you.
By default, the LM tools create "open vocabulary" language models, which means that the vocabulary contains an unknown word token <UNK> which represents all words that are not part of the vocabulary. Because language model probabilities are smoothed, even if there are no unknown words in training, <UNK> is still assigned a small probability, just like any other vocabulary word that wasn't seen in training.
However, the Sphinx decoders don't know or care about words that aren't in the intersection of the dictionary and the LM vocabulary. In Sphinx3, <UNK> is treated like any other word that isn't in the dictionary, and is ignored (that message should really be a warning rather than an error). In PocketSphinx, <UNK> is mapped to the language model ID for unknown words, which in practice means that it is ignored.
The LM tools should probably create closed vocabulary models by default instead, to avoid this confusion.