#60 BF should decode &#65

closed
nobody
None
5
2003-10-08
2003-10-04
Tim Freeman
No

I received the following spam with a style of HTML
obfuscation I have not seen before:

<font color="#FFFFFD">summon her allies), then the
lesser states will hold aloof and </font><FONT SIZE=3
PTSIZE=12><br>
=<!--89-->========<!--6-->====<!--8-->===========<!--W-->==<!--4-->=<!--i-->=<!--5r-->==================<br>
Get<!--vE--> AN<!--4-->Y RX<!--cs-->
D<!--3-->rugs<!--N-->
You N<!--18-->EED or R<!--1z-->e<!--7-->fills!!<BR>
=<!--NQ-->==========<!--7Q-->====<!--2-->=======<!--9-->=====<!--00-->=<!--8T-->=====<!--M-->====<!--3-->=<!--X1-->========<br>
<font color="#FFFFFA">1,500. It came therefore to
L67,500, and L80,000 more for fitting it up,</font><br>
OUR<!--Tn--> US D<!--2-->octo<!--51-->r<!--mC-->s
wil<!--5d-->l
<!--1-->Wri<!--1i-->t<!--5-->e YOU
a <!--29-->Prescri<!--m-->pti<!--1-->on<!--0-->
for <!--93-->F<!--4U-->REE<BR>
You w<!--e8-->il<!--0-->l<!--X--> <!--k-->get it
NEXT-DAY
via Fed<!--Lg-->-Ex!!</FONT><BR>
<font color="#FFFFF2">The human mind delights in grand
conceptions of supernatural beings.</font><br>
<a
href="http://ww%77.ed%64ytsed.biz/%76%70r%36651/"><!--1C-->Visit<!--NL-->
To<!--F-->d<!--5-->ay</a><BR>
<BR><font color="#FFFFF4">Almanac, if he made a point
of being acquainted with every thing</font><br>
<FONT SIZE=1><a
href="http://www.%65ddyt%73ed.biz/unsubs%63ribe.d%64d">Pl<!--cz-->e<!--a1-->ase<!--6-->
<!--UT-->n<!--5-->o more</a></FONT><p>
<font color="#FFFFF5">have to do with musical
composers, a piano, and a brief revery</font></p></FONT>

The page looks like this on browser:

==============================================
Get ANY RX Drugs You NEED or Refills!!
==============================================
OUR US Doctors will Write YOU a Prescription for FREE
You will get it NEXT-DAY via Fed-Ex!!

Feeding the email into "bogofilter -vvv" shows that it
doesn't
see the word "Prescription" there. Instead, it sees "Pre"
and doesn't notice that "s" is the same as "s". It
would be better if it understood the &# HTML escapes.

I observed this with bogofilter 0.15.4-1, which is nearly
the current version available by Debian. I didn't try
0.15.5 yet. My apologies if this bug is recently fixed.

Incidentally, bogofilter is also distracted by the
almost-white
text and sees words like "Almanac" in the email that
aren't
visible in the browser. I don't see a way to solve this so
I'm not officially reporting it here and now, but I'll
mention it
just in case someone else sees a fix. You can't simply
ignore nearly-white text, since then the spammer can
write their message in white text against a dark
background.

Discussion

  • David Relson

    David Relson - 2003-10-04

    Logged In: YES
    user_id=30510

    Tim,

    Bogofilter doesn't currently support decoding of "&"
    characters nor does it attempt to do "eye-space" decoding
    (which is needed to ignore the nearly-white text).

    These tricks are like the first spammer who used "v1agra".
    It worked for a while, but has become a red flag - a sure
    indicator of spam.

    Remember, it "learns" as you train it with spam tricks it's
    not seen before. The next spammer to use the same trick
    should be caught.

    David

     
  • Tim Freeman

    Tim Freeman - 2003-10-04

    bogofilter -vvv output for this spam

     
  • Tim Freeman

    Tim Freeman - 2003-10-04

    Logged In: YES
    user_id=299187

    >Remember, it "learns" as you train it with spam tricks
    >it's not seen before. The next spammer to use the same trick
    >should be caught.

    I don't see any words printed in the wordlist from
    "bogofilter -vvv"
    that seem likely to match this style of spam next time it
    comes up.
    As far as I can tell, if spammers choose to write mails using
    entirely &# escapes, Bogofilter will not learn anything useful
    from the body of the message.

    I'll attach the wordlist.

     
  • Tim Freeman

    Tim Freeman - 2003-10-04

    Logged In: YES
    user_id=299187

    Oops, I see words like FFFFF2 on the wordlist, so
    I agree that BF could learn to skip emails with
    nearly-white text. I stand by my claim that
    it learns nothing useful from text if enough &# escapes are
    present.

    The simplest fix might be to persuade the parser that "&#"
    is a word,
    or perhaps that &# followed by digits consitute a word.
    WIth the
    second fix, BF could distinguish between an email that uses &#
    to encode plain ascii and an email that uses &# to encode
    strange characters that perhaps merit being encoded like that.

     
  • Matthias Andree

    Matthias Andree - 2003-10-05

    Logged In: YES
    user_id=2788

    David, we could emit these numeric HTML entities such as
    A as individual tokens, which would allow bogofilter to
    learn A as a spam token. I'd think this is rather
    effective to train for and catch spam ATM.

     
  • Matthias Andree

    Matthias Andree - 2003-10-05

    Logged In: YES
    user_id=2788

    I've posted a patch to emit these entities as tokens on the
    bogofilter-dev >AT< aotto.com mailing list for review.

     
  • Tim Freeman

    Tim Freeman - 2003-10-06

    Logged In: YES
    user_id=299187

    I see that the patch on bogofilter-dev includes this code:

    > +HTML_ENTITY "&#"[[:digit:]]+";"
    > +

    I think this says you only want to recognize the token if it
    has a
    trailing semicolon.

    Mozilla, at least, doesn't require the trailing semicolon. For
    example, the file

    &#65&#66

    displays as

    AB

    I think bogofilter should be at least as forgiving as the
    browsers.
    Otherwise the spammers use the loose syntax (leaving out
    the semicolon in this case), bogofilter chokes, and the message
    is readable when it gets through. I haven't tried this on
    other
    browsers. However, if current versions of other browsers aren't
    forgiving, then maybe future versions will be, so I still think
    Bogofilter shouldn't require the trailing semicolon.

    My guess on the choice between recognizing A as a token
    or interpreting it as an "A" is that A will only appear in
    spams, whereas if it is interpreted as "A" then the
    resulting words
    will have to go through the less reliable process of normal
    spam
    sorting. I vote for recognizing &#65 as a token. I like
    the idea
    of recognizing spam by recognizing the tricks spammers use.

    I'm not sure about the process here. Do appends I make to this
    bug get forwarded to bogofilter-dev? If not, let me know and
    I'll resend to bogofilter-dev.

     
  • David Relson

    David Relson - 2003-10-08

    Logged In: YES
    user_id=30510

    Fixed in CVS.

     
  • David Relson

    David Relson - 2003-10-08
    • summary: BF should decode &#65 --> BF should decode &#65
     
  • David Relson

    David Relson - 2003-10-08
    • summary: BF should decode &#65 --> BF should decode &#65
     
  • David Relson

    David Relson - 2003-10-08
    • summary: BF should decode &#65 --> BF should decode &#65
    • status: open --> closed
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks