I've found that an unescaped angle bracket in TREC doc elements can cause the terms following the angle bracket to not be indexed. Please see the attached sample doc and param xml as an example. That seems very reasonable and changing the corpus.class to from trectext to txt seems to get around this problem.
I was hoping that using a fileclass of txt would also be a solution to a problem I raised a while back about angle brackets not being seen as tokens - https://sourceforge.net/p/lemur/discussion/2106523/thread/8470aff4/#8750/53cd. (Unfortunately, the suggestion of converting angle brackets to an OOV term is impractical in our app.) It seems that angle brackets are still stripped from the indexed text even with a fileclass of txt. Is this expected? If so, are there any other suggestions (besides converting/escaping the angle brackets) for avoiding the problem described in https://sourceforge.net/p/lemur/discussion/2106523/thread/8470aff4/#8750/53cd? Thanks.
The tagged tokenizer will consume all characters following an unclosed tag, eg <two... looking="" for="" the=""> at the end.
The txt lexer discards all punctuation, including angle brackets.
You will have to change the lexer, either TextParser.l (txt class) or TextTokenizer.l (tagged document classes), to accomplish your objective. Note that that is functionally equivalent to substituting an indexable token, as previously suggested.