Menu

#1 {SG,X}ML tags not properly recognised

open
nobody
None
5
2009-07-16
2009-07-16
Lou Burnard
No

By default, your tokeniser only recognises XML or SGML tags in the form <xxxx>. It attempts to tokenize XML or SGML tags that contain spaces, such as
<tag attr="foo">. This is almost certainly not what an application should do, and it is not what the current release of treeTagger does. Tags containing spaces should be treated just the same as any other tag, and should not be tokenised.

Apologies if this is a configuration option which I missed!

Discussion


Log in to post a comment.

MongoDB Logo MongoDB