#8 lexer improvements

closed-postponed
nobody
None
5
2003-05-31
2002-12-09
No

- weeds out more headers known to contain unique,
irreproducible or otherwise random content;
- accounts for continuation lines after the
abovementioned headers, such as References:
- conforms to the _real_ syntax of the MIME boundaries,
as specified in the RFC 2046 (e.g. a space is legal in
between the boundary characters)
- refines the match for Base64 so that short tail lines
don't get miscounted for words;
- gives a nod to PGP signatures;
- adds uppercase HTML tags to the ignored words list
(but why sophsticate the parser with them to begin
with, instead of keeping them in the default list of
ignored words?);

Discussion

  • Matthias Andree

    Matthias Andree - 2002-12-13

    Logged In: YES
    user_id=2788

    We haven't yet decided what to do with the patch, it is full of useful changes. The uppercase tags section will be killed though, we can better make a case-insensitive lexer instead, because <TaBlE> is still a valid HTML tag and we don't want all 32 lower-/upper-case combinations of TABLE in the lexer. It's too fat already.

     
  • Matthias Andree

    Matthias Andree - 2002-12-13
    • status: open --> open-postponed
     
  • David Relson

    David Relson - 2003-05-31
    • status: open-postponed --> closed-postponed
     
  • David Relson

    David Relson - 2003-05-31

    Logged In: YES
    user_id=30510

    Mikhail,

    Bogofilter's parsing has been significantly changed with the
    0.13. Most of the features you've requested are handled.

    PGP doesn't get special treatment and html tags are ignored,
    except for A, IMG, and FONT tags whose innards are
    tokensized and scored.

    I'm closing this request since it has been substantially
    fulfilled.

    David

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks