#11 Random HTML tags splitting tokens

closed
nobody
None
5
2003-02-06
2003-02-04
Andrew Kisliakov
No

I recently received a spam that used a technique
similar to using an HTML comment to split a word.
However, this particular specimen merely split a word
using an incorrectly formed HTML comment tag without
the -- characters, eg:

wo<!junk>rd

It would probably be an idea to generalise the HTML
comment stripping code to strip all HTML tags (after
extracting the tokens contained therein) before parsing
the remaining text.

Discussion

  • David Relson
    David Relson
    2003-02-06

    Logged In: YES
    user_id=30510

    Andrew,

    The code for stripping HTML comments is a recent addition.
    You can expect to see it enhanced as we learn more about
    what's needed.

    I suggest you subscribe to the bogofilter mailing list,
    bogofilter@aotto.com, to keep abreast of developments.

    David

     
  • David Relson
    David Relson
    2003-02-06

    • status: open --> closed