#9 Strip html comments?

closed
nobody
None
5
2003-01-20
2002-12-21
No

Please consider the attached spam I just received (I
removed a few headers for a little privacy :^). It
uses a strategy to avoid filters, which is to interrupt
almost every word with an html comment. This seems to
be as bad for bogofilter as the base64 problem -- or
worst, because the secondary effect of database
pollution with irrelevant words is bigger here.

I think that, ultimately, the ideal of text analysis is
to present to bogofilter the same text presented by the
mail client to the human, but by now it seems to be
good enough to have a way to join tokens artificially
separated by html comments during the statistical
analysis made by bogofilter.

Of course there are still the headers left to
bogofilter statistical analysis, but...

Discussion

  • Logged In: YES
    user_id=71708

    I received a few more spam buildt with the html-comment
    strategy. Here is a sample line from the body:

    Re<!--accounting-->duce Bo<!--accounting-->dy
    Fa<!--accounting-->t and

     
  • David Relson
    David Relson
    2003-01-20

    • status: open --> closed
     
  • David Relson
    David Relson
    2003-01-20

    Logged In: YES
    user_id=30510

    The ability to strip html comments has been added and is
    presently available through cvs. It will be included in the
    next release.