#17 HTML should trump plain text


I now get spams where the message is a multipart MIME
message, with both an HTML section and a text section.
My email client shows the HTML and ignores the text.
This structure for emails was designed to allow the
folks without HTML to have something to read; the text
is supposed to be the same as the HTML. But spammers
are abusing it now.

The text will be something very unlikely to trip a spam
filter: a page from a Kipling novel, or Alice in
Wonderland, or poetry, or whatever. Then the HTML
section has the true payload.

I think that by default, SpamProbe should ignore the
text-only part in this sort of message, and check
purely the HTML part. I don't want to train my spam
database that Alice or Kipling are spam indicators, and
I want to increase the chances to catch the spam.

P.S. If you analyze such an email message and can
deduce that the HTML part is egregiously different from
the text part, the email is almost certainly a spam.
Normal email will have the same message in the text and
HTML sections. But the two sections won't be
identical, and I'm not sure how you could write code to
compare them. Strip out all HTML tags from the HTML
section, and then use a Bayesian comparison of the
words from the two sections?

Steve R. Hastings