Please consider the attached spam I just received (I
removed a few headers for a little privacy :^). It
uses a strategy to avoid filters, which is to interrupt
almost every word with an html comment. This seems to
be as bad for bogofilter as the base64 problem -- or
worst, because the secondary effect of database
pollution with irrelevant words is bigger here.
I think that, ultimately, the ideal of text analysis is
to present to bogofilter the same text presented by the
mail client to the human, but by now it seems to be
good enough to have a way to join tokens artificially
separated by html comments during the statistical
analysis made by bogofilter.
Of course there are still the headers left to
bogofilter statistical analysis, but...