#8 lexer improvements

closed-postponed
nobody
None
5
2003-05-31
2002-12-09
Mikhail Zabaluev
No

- weeds out more headers known to contain unique,
irreproducible or otherwise random content;
- accounts for continuation lines after the
abovementioned headers, such as References:
- conforms to the _real_ syntax of the MIME boundaries,
as specified in the RFC 2046 (e.g. a space is legal in
between the boundary characters)
- refines the match for Base64 so that short tail lines
don't get miscounted for words;
- gives a nod to PGP signatures;
- adds uppercase HTML tags to the ignored words list
(but why sophsticate the parser with them to begin
with, instead of keeping them in the default list of
ignored words?);

Discussion

  • Logged In: YES
    user_id=2788

    We haven't yet decided what to do with the patch, it is full of useful changes. The uppercase tags section will be killed though, we can better make a case-insensitive lexer instead, because <TaBlE> is still a valid HTML tag and we don't want all 32 lower-/upper-case combinations of TABLE in the lexer. It's too fat already.

     
    • status: open --> open-postponed
     
  • David Relson
    David Relson
    2003-05-31

    • status: open-postponed --> closed-postponed
     
  • David Relson
    David Relson
    2003-05-31

    Logged In: YES
    user_id=30510

    Mikhail,

    Bogofilter's parsing has been significantly changed with the
    0.13. Most of the features you've requested are handled.

    PGP doesn't get special treatment and html tags are ignored,
    except for A, IMG, and FONT tags whose innards are
    tokensized and scored.

    I'm closing this request since it has been substantially
    fulfilled.

    David