Moshe Yudkowsky - 2002-07-26

I've placed <s> </s> and <p></p> pairs into my corpus, and I've gotten truly strange language model 3-grams.

As an example: "</s> <s> word".

This strikes me as entirely wrong. I have similar problems with <p> tags.

I do expect to see 3-grams such as "<s> word1 word2" or "<s> solitary_word </s>"

1. Am I correct that context hint markers should not show up in the text in this fashion?

2. If yes, does anyone know how to fix this problem?

WHAT DOESN'T WORK:
* I've tried placing <s>, </s>, <p>, </p> into ccs instead of just <s> or <p>.

* I've tried a ccs with just <p>, or with <p> first in the text

* I've tried taking <s> and <p> out of the vocabulary, but that gives me <UNK> in the n-grams and no <s>, </s> at all.

TIA.