I've placed <s> </s> and <p></p> pairs into my corpus, and I've gotten truly strange language model 3-grams.
As an example: "</s> <s> word".
This strikes me as entirely wrong. I have similar problems with <p> tags.
I do expect to see 3-grams such as "<s> word1 word2" or "<s> solitary_word </s>"
1. Am I correct that context hint markers should not show up in the text in this fashion?
2. If yes, does anyone know how to fix this problem?
WHAT DOESN'T WORK: * I've tried placing <s>, </s>, <p>, </p> into ccs instead of just <s> or <p>.
* I've tried a ccs with just <p>, or with <p> first in the text
* I've tried taking <s> and <p> out of the vocabulary, but that gives me <UNK> in the n-grams and no <s>, </s> at all.
TIA.
Log in to post a comment.
I've placed <s> </s> and <p></p> pairs into my corpus, and I've gotten truly strange language model 3-grams.
As an example: "</s> <s> word".
This strikes me as entirely wrong. I have similar problems with <p> tags.
I do expect to see 3-grams such as "<s> word1 word2" or "<s> solitary_word </s>"
1. Am I correct that context hint markers should not show up in the text in this fashion?
2. If yes, does anyone know how to fix this problem?
WHAT DOESN'T WORK:
* I've tried placing <s>, </s>, <p>, </p> into ccs instead of just <s> or <p>.
* I've tried a ccs with just <p>, or with <p> first in the text
* I've tried taking <s> and <p> out of the vocabulary, but that gives me <UNK> in the n-grams and no <s>, </s> at all.
TIA.