pyzor-users Mailing List for Pyzor (Page 33)

Brought to you by: abyz12345, alex-kiro, anadelonbrin

This project can now be found here.

pyzor-users — general discussion of Pyzor and Pyzor-related topics

You can subscribe to this list here.

2002	Jan	Feb	Mar	Apr (75)	May (6)	Jun (6)	Jul (9)	Aug (46)	Sep (28)	Oct (56)	Nov (23)	Dec
2003	Jan (23)	Feb (13)	Mar (10)	Apr (11)	May (23)	Jun (9)	Jul (6)	Aug (20)	Sep (28)	Oct (1)	Nov (23)	Dec (1)
2004	Jan (9)	Feb (6)	Mar (3)	Apr (12)	May (14)	Jun (3)	Jul (2)	Aug (9)	Sep (3)	Oct (8)	Nov (43)	Dec (9)
2005	Jan	Feb (1)	Mar (5)	Apr (17)	May (4)	Jun (2)	Jul (3)	Aug (2)	Sep (7)	Oct (8)	Nov	Dec (3)
2006	Jan (4)	Feb (2)	Mar (6)	Apr (3)	May	Jun (31)	Jul (4)	Aug (3)	Sep (5)	Oct (19)	Nov (16)	Dec (9)
2007	Jan	Feb	Mar (6)	Apr	May	Jun	Jul (5)	Aug	Sep (23)	Oct (7)	Nov (6)	Dec
2008	Jan (9)	Feb	Mar	Apr (9)	May (11)	Jun	Jul (1)	Aug (1)	Sep (3)	Oct	Nov (10)	Dec
2009	Jan (3)	Feb	Mar (5)	Apr (26)	May (45)	Jun (16)	Jul (41)	Aug (25)	Sep (4)	Oct (1)	Nov (8)	Dec (5)
2010	Jan (1)	Feb (3)	Mar (2)	Apr (21)	May (4)	Jun (18)	Jul (3)	Aug (2)	Sep (12)	Oct	Nov	Dec (5)
2011	Jan	Feb (3)	Mar (6)	Apr	May (1)	Jun (3)	Jul	Aug (4)	Sep (3)	Oct (1)	Nov	Dec (9)
2012	Jan (6)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2013	Jan (4)	Feb	Mar (1)	Apr	May (4)	Jun (7)	Jul	Aug	Sep	Oct	Nov (4)	Dec
2014	Jan	Feb	Mar	Apr (2)	May (3)	Jun (3)	Jul (7)	Aug (1)	Sep (3)	Oct (2)	Nov (8)	Dec
2015	Jan	Feb (2)	Mar	Apr	May	Jun (4)	Jul	Aug (4)	Sep	Oct (2)	Nov (1)	Dec (5)
2016	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (2)	Aug	Sep	Oct	Nov	Dec
2017	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec

Flat | Threaded

<< < 1 .. 31 32 33 34 35 .. 46 > >> (Page 33 of 46)

Re: Message header issue

From: Frank T. <ft...@ne...> - 2003-06-10 19:57:55

Ted Sudtell, on 2003-06-06, wrote:

> Has this problem been addressed while using Pyzor report?  If so, what
> do I have to do?

> ValueError: unknown Content-Transfer-Encoding: binary

It's symptomatic of a more general issue, that pyzor doesn't like it when
illegal content-transfer encodings are used.  It's in the buglist.

-- 
Frank Tobin			http://www.neverending.org/~ftobin/

Message header issue

From: Ted S. <wfa...@cu...> - 2003-06-06 15:53:17

Has this problem been addressed while using Pyzor report?  If so, what do I 
have to do?

Thanks

Traceback (most recent call last):
  File "/usr/bin/pyzor", line 4, in ?
    pyzor.client.run()
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 934, in run
    ExecCall().run()
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 188, in run
    if not apply(dispatch, (self, args)):
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 282, in report
    for digest in FileDigester(sys.stdin, self.digest_spec, do_mbox):
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 619, in next
    digest = self.digester.next()
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 648, in next
    next_msg = self.mbox.next()
  File "/usr/lib/python2.2/mailbox.py", line 34, in next
    return self.factory(_Subfile(self.fp, start, stop))
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 679, in 
__init__
    self.curfile = self.__class__(self.multifile)
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 671, in 
__init__
    mimetools.decode(msg.fp, self.curfile, encoding)
  File "/usr/lib/python2.2/mimetools.py", line 149, in decode
    raise ValueError, \
ValueError: unknown Content-Transfer-Encoding: binary

Re: Pyzor and SpamAssassin and error 256

From: Frank T. <ft...@ne...> - 2003-06-04 05:29:38

Kyle Wheeler, on 2003-06-03, wrote:

> Sometimes this is simply because Pyzor timed out (which happens with
> alarming frequency, and doesn't seem to be handled very well). The Pyzor
> protocol seems based around UDP, which makes it have occasional problems
> with the iptables firewall that I go through (the UDP packets get denied
> on the return trip from the Pyzor server sometimes).

Ah, yes, I forgot about that potential problem; thanks!  Concerning the
frequency of timeouts, I think I might make it so that pyzor re-tries
sending a packet maybe two times.

> Other times I get this error much more consistently for a specific
> message. It will work fine if I pipe the mail directly to Pyzor, but not
> when through SpamAssassin. I had SpamAssassin save off the stderr of
> Pyzor when this happens. Anyone have any ideas as to why it might be
> happening? It *looks* like the spam message is somehow shorter than
> Pyzor is expecting...

This is a known problem in pyzor...I'll be adding some code so that this 
situation is handled more gracefully.

-- 
Frank Tobin			http://www.neverending.org/~ftobin/

Pyzor and SpamAssassin and error 256

From: Kyle W. <ky...@me...> - 2003-06-03 17:16:08

Hello,

Like Jonathan Micheal Hawkins and Jeff Jackson, I'm having problems with
Pyzor and SpamAssassin. Namely, on some messages I get the following
error message:

Pyzor -> report failed: Received error code 256 at
/usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Reporter.pm line 327.

Sometimes this is simply because Pyzor timed out (which happens with
alarming frequency, and doesn't seem to be handled very well). The Pyzor
protocol seems based around UDP, which makes it have occasional problems
with the iptables firewall that I go through (the UDP packets get denied on
the return trip from the Pyzor server sometimes).

Other times I get this error much more consistently for a specific
message. It will work fine if I pipe the mail directly to Pyzor, but not
when through SpamAssassin. I had SpamAssassin save off the stderr of
Pyzor when this happens. Anyone have any ideas as to why it might be
happening? It *looks* like the spam message is somehow shorter than
Pyzor is expecting...

Here's the stderr:

Traceback (most recent call last):
  File "/usr/bin/pyzor", line 4, in ?
    pyzor.client.run()
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 934, in run
    ExecCall().run()
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 188, in run
    if not apply(dispatch, (self, args)):
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 282, in rep=
ort
    for digest in FileDigester(sys.stdin, self.digest_spec, do_mbox):
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 615, in __i=
nit__
    self.digester =3D iter(get_file_digester(fp, spec, mbox))
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 632, in get=
_file_digester
    return (DataDigester(rfc822BodyCleaner(fp),
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 679, in __i=
nit__
    self.curfile =3D self.__class__(self.multifile)
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 671, in __i=
nit__
    mimetools.decode(msg.fp, self.curfile, encoding)
  File "//usr/lib/python2.2/mimetools.py", line 137, in decode
    return base64.decode(input, output)
  File "//usr/lib/python2.2/base64.py", line 29, in decode
    line =3D input.readline()
  File "//usr/lib/python2.2/multifile.py", line 80, in readline
    raise Error, 'sudden EOF in MultiFile.readline()'
multifile.Error: sudden EOF in MultiFile.readline()

--=20
Well, I've wrestled with reality for over thirty five years, doctor, and I'm
happy to say I've finally won out over it.
-- Jimmy Stewart, in "Harvey"

Re: email being (possibly) incorrectly tagged. Why?

From: Frank T. <ft...@ne...> - 2003-05-31 07:05:52

Jason Sjobeck, on 2003-05-30, wrote:

> The body of the message was empty & the subject said "test 678". 

Pyzor only looks at the body of a mesage, and given an empty message, will 
actually barf at the moment.

> Since this message could not have possibly ever been seen by any one
> before, how could it be tagged?

An empty message is not unique.  Could you send the output of
  pyzor -d check < message

-- 
Frank Tobin			http://www.neverending.org/~ftobin/

email being (possibly) incorrectly tagged. Why?

From: Jason S. <ja...@sj...> - 2003-05-30 23:21:22

Dear List,

I am using Pyzor on our company's mail gateway server. We are running
postfix, amavisd, SpamAssassin, Razor, and Pyzor. We like all of those
softwares a lot.

I was testing something completely unrelated this afternoon when I sent
a test email to myself from my ISP's email server to my internal
corporate address, examined the headers, and discovered that Pyzor had
tagged my message.

The body of the message was empty & the subject said "test 678".=20

Since this message was from my account at my ISP to my corp' account, I
can not figure out how any one out in the public internet could have
tagged this messages to get Pyzor to tag it. Or am I misunderstanding
something. I thought Pyzor was a completely human activated tagging
system, meaning that someone would have had to have seen the message in
question and reported it as spam. Since this message could not have
possibly ever been seen by any one before, how could it be tagged?

Any tips or advice is most appreciated.

Thanks.

Jason
ICQ : 127795461

A few errors to deal with

From: <lis...@nu...> - 2003-05-29 14:58:30

I've seeded a very large number of spamtraps to a number of spamming 
outfits' "remove" pages.  That was lots of fun. :-)  I'm of course already 
getting spam for those spamtraps.  I noticed some errors in my procmail 
log and I thought I'd ask about them.

>From uda...@ms...  Thu May 29 02:01:04 2003
 Subject: ***SPAM*** Thin Summer In 2003...begin Now
  Folder: /var/mail/spool/piehole                                          
5359 
Pyzor -> report failed: Received error code 256 at 
/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Reporter.pm line 
306.


>From bou...@bo...  Thu May 29 03:55:31 2003
 Subject: R, Save up to 80 percent on inkjets & no cost shipping
  Folder: /var/mail/spool/piehole                                          
4648
Traceback (most recent call last):
  File "/usr/bin/pyzor", line 4, in ?
    pyzor.client.run()
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 934, in 
run
    ExecCall().run()
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 188, in 
run
    if not apply(dispatch, (self, args)):
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 282, in 
report
    for digest in FileDigester(sys.stdin, self.digest_spec, do_mbox):
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 615, in 
__init__
    self.digester = iter(get_file_digester(fp, spec, mbox))
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 632, in 
get_file_digester
    return (DataDigester(rfc822BodyCleaner(fp),
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 679, in 
__init__
    self.curfile = self.__class__(self.multifile)
  File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 671, in 
__init__
    mimetools.decode(msg.fp, self.curfile, encoding)
  File "//usr/lib/python2.2/mimetools.py", line 149, in decode
    raise ValueError, \
ValueError: unknown Content-Transfer-Encoding: binary
Pyzor -> report failed: Received error code 256 at 
/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Reporter.pm line 
306.


I only received those two errors about about 20 pieces of spam.  I'm 
really not sure what to make of either of them.  Does anyone know what 
they mean?

Justin

Re: Stripping out SpamAssassin's additions

From: <lis...@nu...> - 2003-05-28 19:42:42

On Wed, 28 May 2003, Frank Tobin wrote:

> lis...@nu..., on 2003-05-28, wrote:
> 
> > Do I need to strip out the extra header lines and Subject changes before
> > reporting spam via Pyzor?
> 
> No; pyzor does not look at headers.

Great.  That's what I needed to know.  Thanks

Justin

Re: Stripping out SpamAssassin's additions

From: Frank T. <ft...@ne...> - 2003-05-28 17:39:45

lis...@nu..., on 2003-05-28, wrote:

> Do I need to strip out the extra header lines and Subject changes before
> reporting spam via Pyzor?

No; pyzor does not look at headers.

-- 
Frank Tobin			http://www.neverending.org/~ftobin/

Stripping out SpamAssassin's additions

From: <lis...@nu...> - 2003-05-28 15:58:00

Do I need to strip out the extra header lines and Subject changes before 
reporting spam via Pyzor?  I'm setting up around 5000 spamtraps that 
are aliases for a single account.  I'm using procmail on that account to 
auto-report via Pyzor.  I'm stripping out SA's header lines with 
spamassassin -r, ReSent and X-Scanned-By lines with grep (although formail 
would have been just as easy), and using sed to undo my changes to the 
Subejct line with MIMEDefang scores mail >= 10.  Do I actually need to do 
all that?  I'm under the impression that Razor and Pyzor only take a hash 
of the message body, which I'm not altering.  Clarification would be 
welcomed.  Thanks!

Justin

Re: Pyzor error when reporting

From: Frank T. <ft...@ne...> - 2003-05-28 05:21:45

Jackson, Jeff, on 2003-05-22, wrote:

> In a message from March, Jonathan Michael Hawkins had the same problem
> I'm haveing in the thread "spamassassin -r < message.txt doesn't work".
> I'm running Red Hat Linux 8.0 with SpamAssassin 2.55 and Pyzor 0.4. When
> I try to submit spam using spamassasin -r < spam.txt, everything is
> reported to Razor and DCC fine, but Pyzor gives me the following error:

Every time?  It's hard for me to debug without any pyzor stderr output
from spam assassin; can that be retrieved?

-- 
Frank Tobin			http://www.neverending.org/~ftobin/

Pyzor error when reporting

From: Jackson, J. <jef...@rb...> - 2003-05-22 14:36:28

Hi,=20

Another new Pyzor user here, hoping for a little help...

In a message from March, Jonathan Michael Hawkins had the same problem =
I'm haveing in the thread "spamassassin -r < message.txt doesn't work". =
I'm running Red Hat Linux 8.0 with SpamAssassin 2.55 and Pyzor 0.4. When =
I try to submit spam using spamassasin -r < spam.txt, everything is =
reported to Razor and DCC fine, but Pyzor gives me the following error:

debug: executable for pyzor was found at /usr/bin/pyzor
debug: Pyzor is available: /usr/bin/pyzor
debug: entering helper-app run mode
debug: leaving helper-app run mode
Pyzor -> report failed: Received error code 256 at =
/usr/lib/perl5/site_perl/5.8.
0/Mail/SpamAssassin/Reporter.pm line 326.

When I use pyzor itself to sumbit, it works fine:

[filter@prickle filter]$ pyzor report < spam.txt
66.47.67.162:24441    (200, 'OK')

I couldn't find a resolution to the problem in the list archives, and =
the previous message thread ends without one...

If anybody can shed any light on the problem, it would be greatly =
appreciated.

Jeff Jackson=20
R.B. Zack & Associates, Inc.  =20
www.rbza.com=20
General Contractors for the Virtual World,=20
  Building Business Applications that Work Since 1981.=20
jef...@rb...=20
(310) 833-0211 x180=20

QUIDQUID LATINE DICTUM SIT, PROFUNDUM VIDITUR
(Whatever is said in Latin appears profound)

The information in this e-mail is confidential and may be legally =
privileged. It is intended solely for the addressee. Access to this =
e-mail by anyone else is unauthorized. If you are not the intended =
recipient, any disclosure, copying, distribution or any action taken or =
omitted to be taken in reliance on it, is prohibited and may be =
unlawful.


=20

Re: setup.py build error

From: Frank T. <ft...@ne...> - 2003-05-19 20:21:27

Leon Roy, on 2003-05-19, wrote:

> NameError: name 'object' is not defined
> 
> I've downloaded the latest 0.4.0 release of pyzor, and tried the
> previous 0.3.1 but both give the same error. I have been able to
> successfully install before on Mandrake, (using debian at the moment).

You need Python 2.2.1 or later.

-- 
Frank Tobin			http://www.neverending.org/~ftobin/

setup.py build error

From: Leon R. <dog...@vi...> - 2003-05-19 16:32:15

Hello,
	when I run: python setup.py build,  I get the following error: 

Traceback (most recent call last):
  File "setup.py", line 5, in ?
    import pyzor
  File "lib/pyzor/__init__.py", line 57, in ?
    class Singleton(object):
NameError: name 'object' is not defined

I've downloaded the latest 0.4.0 release of pyzor, and tried the
previous 0.3.1 but both give the same error. I have been able to
successfully install before on Mandrake, (using debian at the moment).

Any help appreciated,
-Leon.

Re: debug: Pyzor: couldn't grok response "downloading servers from http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x"

From: Frank T. <ft...@ne...> - 2003-05-19 06:47:41

pos...@sj..., on 2003-05-18, wrote:

> debug: Pyzor: couldn't grok response "downloading servers from
> http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x"
> ------------------------------------------------------------
> 
> Have I done something wrong?

No.  SA reads pyzor's stderr (which it shouldn't), and the first time you 
run pzyor, it will spit out a message about downloading the server list.

> One note, when I finished installing the package and ran "pyzor
> discover" at the console it said "no command interpreter" & then I found
> out it was looking for a program named "python2" while my machine only
> had "python", so I simply did a "cp python python2" then it seemed to
> run fine. Is this a mistake? IS this relevant?

Interesting; if you install pyzor from the tarball it should point itself
to the correct python executable.  If you were using someone else's
packaging of pyzor, there could be problems.  Your workaround is fine, but 
I'd recommend symlinking python2 to python, instead of copying.

-- 
Frank Tobin			http://www.neverending.org/~ftobin/

debug: Pyzor: couldn't grok response "downloading servers from http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x"

From: <pos...@sj...> - 2003-05-18 08:45:56

Attachments: error.txt

Dear List,

Thanks for letting me post here. I am looking very forward to using what
sounds like an incredibly cool piece of software.=20

We are running the following softwares:
Mandrake Linux 9.0
Postfix 1.11.1
Amavisd-new
SpamAssassin
Razor
Pyzor
clamAV & AntiVir

I just got Pyzor installed and restarted "amavisd" & SpamAssassin in
debug mode so I can see what SpamAssass is doing when it calls Pyzor.

I am getting the following error:

------------------------------------------------------------
debug: Pyzor is available: /usr/bin/pyzor
debug: entering helper-app run mode
debug: Pyzor: got response: downloading servers from
http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x
debug: leaving helper-app run mode
debug: Pyzor: couldn't grok response "downloading servers from
http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x"
------------------------------------------------------------

Have I done something wrong?

One note, when I finished installing the package and ran "pyzor
discover" at the console it said "no command interpreter" & then I found
out it was looking for a program named "python2" while my machine only
had "python", so I simply did a "cp python python2" then it seemed to
run fine. Is this a mistake? IS this relevant?

Back to the original error message, though, has anyone been here before
and lived to tell the tale? If so, I would be most appreciative to hear
how you resolved the issue.

Thanks very much.

Jason

Post Script: if you could reply directly as well as to the list, that
would be great. Thanks again.

RE: Digesting and Spam-matching

From: Roman S. <rn...@on...> - 2003-05-04 20:07:20

On Fri, 2 May 2003, Keith Jackson wrote:

I am not sure bayes-formula based filter are of any good
when catching server-wide spam. Last month volume of spam
increased again and now I see more and more spams 
which use randomization. They put random strings 
right into text! Human can distinguish "random"
pieces easily while computers can't. 

Personally I am using SpamOracle and it chokes on 
<!--dfsdfsdf--> based technique.

So, it seems, spammers will be major inspiration for AI people ;-)

In my view, any mail/spam left after pyzor is to be read
by some AI system. Bayes is only first step to that.

And also I do not think pyzor need to be more adaptive.
It's simple and bullet-proof tool. And not hi-profile one - 
so spammers are not taking it into account yet ;-)


Sincerely yours, Roman Suzi
-- 
rn...@on... =\= My AI powered by GNU/Linux RedHat 7.3

RE: Digesting and Spam-matching

From: Keith J. <kja...@cr...> - 2003-05-02 15:17:56

> If 32 bits of search space is sufficient, and the MD5 step
> isn't providing
> any significant privacy, why bother with it at all?
> Seems like you could reach a similar result just shipping the
> software with
> a pre-compiled dictionary matching common words to your
> 4-byte references,
> and pass chains of those around, instead.

True. But either way, this is a small detail. Send each actual token, or
send a hash of each token. I guess it's a warm fuzzy feeling. By my own
arguments then, it's just bloat, and the actual token should be sent. Either
way. I guess I'd do MD5 because it's simple to implement, and would make it
at least mildly difficult for an 'evil server'. As I said before, spam and
security are separate issues. If you want security, use PGP or some other
package designed specifically for it.

> > Secondly, see above comment about not storing duplicates.
> > Storing all email everyone gets is just silly.
>
> Ok, but aren't you degrading your ability to detect spam,
> then? I thought we
> were designing for the case where nobody sends identical
> spams (see your
> critique of Pyzor's hash strategy) but instead sends
> individualized spams
> with minor differences. So we've either got to store all of
> them, or hope
> that the first spam that we get is similar enough to all of
> the others to
> avoid detection.

My thoughts were the 'first' spam contains something like:

Keith, you've won!!!

3 tokens: [Keith,], [you've], [won!!!]

Now you get a second spam mail, and it has [Greg,] instead of [Keith,] it
will match with our 'theoretical' matching software the spam message stored
on the server... the [Keith,] token will have it's 'variable' attribute
incremented by one.

So now the next match will be stronger. It will see that [Keith,], [you've],
[won!!!] matches better with [George,], [you've], [won!!!] because it sees
that that first token is variable. I don't want to store each unique spam, I
want to store the first unique spam, which will then be modified via voting
and whatnot to be more like a spam template, thus the point of doing this on
the server and harnessing the power of distributed computing, vs. a client
trying to figure this stuff out.

Another thought that comes to mind just writing this email, to be even more
optimized, I can hash the whole email on the client side (ignoring headers),
and come up with one hash. Send it to the server, if that hash matches, then
we don't even need to do this more advanced stuff. If it doesn't, then we
can go into this more detailed matching.

> > Again, I'm no expert, but the algorithm being hard
> > shouldn't be a reason for
> > not doing it. I'm sure as I'm sitting here there are PhD's
> > all over that
> > have papers on the net about good algorithms of this type.
>
> I think this is what I find hard to swallow about your
> suggestion - that
> we just find some smart people from somewhere else to fix the storage
> problem and the spam-matching problem, and then say that the
> spam problem
> has been solved.
>
> Isn't that like saying "I know how to stop SARS. We'll make
> some pills
> that everyone will take. That might be expensive, but we'll get some
> smart manufacturing guys to help make it cheap. And we don't have an
> anti-viral drug that works against SARS yet, but there are a bunch of
> smart pharmacology guys who work on that sort of thing all
> day, they'll
> get that figured out pretty soon. OK, problem solved. Next!"

Granted. But your statement 'And all of this assumes that you have a good
algorithm for deciding whether two hash chains are similar enough to be the
same, or not', makes it sound as though this is a radical new algorithm that
no one has thought of, or that they don't have any good implementations of.
And it's not. I personally could think of a few approaches my self.

> It doesn't really seem reasonable to me to compare
> hypothetical software
> to actual software - the actual software always has bugs &
> limitations,
> while the hypothetical software never seems to have any,
> because it can
> be modified much more quickly than actual software, and is always
> fully debugged & optimized.

You are very right, and I've never said this is better, I'd be a fool to say
so until I see it run. I've said it could be better, and I'm throwing around
ideas, I could be wrong, like you said, until it's fully debugged &
optimized. But the whole point of this is my 'prove me wrong' mentality.
Because if you can prove to me that this won't work, then I won't have to
waste my time trying to implement it.

Keith

Re: Digesting and Spam-matching

From: Greg B. <gbr...@bi...> - 2003-05-02 14:28:42

On Fri, May 02, 2003 at 09:03:11AM -0400, Keith Jackson wrote:
> 
> 5232 bytes = 327 hashes * 128 bits
> THAT, you are correct, is impractical, and not optimized or compressed even
> remotely.
> 
> Just as my thinking aloud in these emails, you could store them as
> references to a dictionary. The reference is going to be less than 128 bits.
> An index number, let's say 4 bytes... an unsigned long. The dictionary won't
> grow forever, it will change based on the messages currently on the server.
> So..
> 
> 1308 bytes = 327 hashes * 4 byte refs
> 
> That's 25% storage of what you are talking about. A 75% gain. And I'm not
> even a wiz at this stuff.

If 32 bits of search space is sufficient, and the MD5 step isn't providing
any significant privacy, why bother with it at all? 

Seems like you could reach a similar result just shipping the software with
a pre-compiled dictionary matching common words to your 4-byte references,
and pass chains of those around, instead. 

> Secondly, see above comment about not storing duplicates. Storing all email
> everyone gets is just silly.

Ok, but aren't you degrading your ability to detect spam, then? I thought we
were designing for the case where nobody sends identical spams (see your
critique of Pyzor's hash strategy) but instead sends individualized spams
with minor differences. So we've either got to store all of them, or hope
that the first spam that we get is similar enough to all of the others to
avoid detection. 

> Again, I'm no expert, but the algorithm being hard shouldn't be a reason for
> not doing it. I'm sure as I'm sitting here there are PhD's all over that
> have papers on the net about good algorithms of this type.

I think this is what I find hard to swallow about your suggestion - that
we just find some smart people from somewhere else to fix the storage 
problem and the spam-matching problem, and then say that the spam problem
has been solved. 

Isn't that like saying "I know how to stop SARS. We'll make some pills 
that everyone will take. That might be expensive, but we'll get some
smart manufacturing guys to help make it cheap. And we don't have an
anti-viral drug that works against SARS yet, but there are a bunch of
smart pharmacology guys who work on that sort of thing all day, they'll
get that figured out pretty soon. OK, problem solved. Next!"

It doesn't really seem reasonable to me to compare hypothetical software
to actual software - the actual software always has bugs & limitations,
while the hypothetical software never seems to have any, because it can
be modified much more quickly than actual software, and is always
fully debugged & optimized. 

--
Greg Broiles
gbr...@pa...

RE: Digesting and Spam-matching

From: Keith J. <kja...@cr...> - 2003-05-02 13:03:57

> If you're only interested in perfect solutions, you're going
> to be reading a
> lot of spam over the next few years.

No, not perfect. But not easily defeatable, either.

> I think you're optimizing for a pretty narrow class of client - someone
with
> a fast, unmetered net connection (so it's not burdensome to have to
download
> all of the spam messages and upload the hash chains)

I don't think broadband is that narrow of a class, and becoming less narrow
every day. I see this with a lot of Open Source projects. They design
everything they write to work EVERYWHERE, and sometimes, to make progress,
you just have to say... 'Hey, sorry man, get a better computer, get a faster
connection'. Look at Microsoft. You think they are designing windows for
dialup users? I don't think it'll be too many more revisions before dialup
is some obscure extra add-in you gotta install separately. Hell with Win95
users, hell with dialup users, hell with people with 286 computers. And you
can quote me on that. If we don't say that, we'll never GET anywhere.

> and a relatively fast
> PC (to calculate a few hundred MD5's per message), with
> access to a pretty nice server (to store a few hundred MD5's per message
per
> customer). The server becomes more useful as it's got more customers
> (someone's gotta read the spam the first time, and say "hey! that's a
spam!", unless
> you're going to depend on spam-trap addresses), so it's not so interesting
to say
> "well, four of my friends and I will share a little server on an old PC
> I've got in my garage".

Calculating one MD5 per word of an email is sorry to say, I don't think that
intensive at all, given today's computers. Your basic math of customer *
messages doesn't work well, either. I get 500 mails a day let's say. I have
to upload a hash chain for each. The server will first.... match it with
known spam. If 450 of my 500 mails are spam.... then those 450 hash chains
are already on the server. I wouldn't store them twice. So that leaves 50
hash chains the server has to store. If the server doesn't see any matches
on those 50 non-spams within let's say a day, it can throw them away. So,
the actual math to describe server space is not customer * messages, it's
number of unique spams sent across the world (within the past two months)
plus the number of customers times the number of legitimate emails they get.
The actual unique spams would be stored for like 2 months, and the
non-matches would be stored a day.

We'd end up with a server with a relatively large database of spam, but I
have a feeling... I may be wrong, that the 'volatile' legitimate email will
be larger. But it doesn't need to have critical storage, or to be kept
around longer than a day.

Now, on top of that, you don't store messages on a server as an md5 hash
chain. When you first parse the message, you add all the hashes to a server
dictionary, not re-adding ones that have been added before. Then, the actual
message is stored as a list of references to the hashes.

I don't see this as unrealistic or taking millions of dollars of hardware to
accomplish.

> Now, if we think about other people who care about spam - like, say,
people
> with dialup access or metered access who don't want to download spams only
> to discard them on the client PC - or ISP's who don't want their spool
> disks filled with spams waiting to be delivered then discarded - then
there's
> a pretty significant processing load placed on the receiving mailserver,
> if they're the ones that have to calculate the hashes everytime a message
> is received.

Buy broadband and a better PC I say to you. Sorry. See above comment about
Open Source mentality.

> > I'd give a few more bytes for better spam protection.
>
> Have you done the math on this? I don't think we're talking about "a few
> more bytes", I think we're talking about a fair amount of data, if you're
> planning to store an ordered list of 128-bit values for every message
> received over the last 60 days for a few thousand, or tens of thousands,
> of people.

I have done the math for it. And I explained it above. As I said in the
original message, which some creative optimizations. I'm not even a great
optimization guy, and I've come up with a better scheme than storing ordered
lists of 128 bit hashes per message. Have YOU dome the math on this? ;) How
many times in my email did I repeat words... like 'a'? or 'the'? etc...

> Your initial message in this thread, stripped of its headers, and counted
> by "wc", had 327 words. Assuming I have a nice way to store the hashes
> that doesn't incur any overhead, that's 5232 bytes of data to remember
> your message. If I get 300 emails per day like yours (probably not too
> far off the mark), and I want to store 60 days' worth of them, that's
> almost 92 megs of data .. for one person's email. Now, I'm probably
> worse than the average, but that's still a few orders of
> magnitude worse than seems practical.

5232 bytes = 327 hashes * 128 bits
THAT, you are correct, is impractical, and not optimized or compressed even
remotely.

Just as my thinking aloud in these emails, you could store them as
references to a dictionary. The reference is going to be less than 128 bits.
An index number, let's say 4 bytes... an unsigned long. The dictionary won't
grow forever, it will change based on the messages currently on the server.
So..

1308 bytes = 327 hashes * 4 byte refs

That's 25% storage of what you are talking about. A 75% gain. And I'm not
even a wiz at this stuff.

Secondly, see above comment about not storing duplicates. Storing all email
everyone gets is just silly.

> (The storage requirements would be reduced quite a bit if you didn't want
> to keep the order of the hashes, but that would also reduce the
> ability of the program to differentiate between messages .. or if we
> didn't care about privacy, and stored the message itself - the same data
> from your message, whose hashes totalled 5232 bytes of data, was only
> 1993 bytes as text.)

Well, order of hashes is obviously important in this schema. But I'm sure
there are many other schema's as well.

> And all of this assumes that you have a good algorithm for deciding
> whether two hash chains are similar enough to be the same, or not;

Again, I'm no expert, but the algorithm being hard shouldn't be a reason for
not doing it. I'm sure as I'm sitting here there are PhD's all over that
have papers on the net about good algorithms of this type.

> I think you're making the problem a lot harder by trying to operate on
> hashes, rather than text, because it's easier to write code that
> ignores garbage strings than code that ignores hashes of garbage
> strings.

Well, this is my problem with pyzor (and some others, now that I'm looking
around). It assumes it knows about the content of the message. So, it won't
even take advantage of the power of distributed spam detection if the
spammers can first defeat the simple front end. My contention is that this
is the WRONG way to do it. If you are doing distributed spam detection, but
only doing it in some conditions that spammers can defeat, then you aren't
doing distributed spam detection at all. I'd rather just have a client side
filter.

> Further, the privacy protection provided by the hash isn't very good -
> what's to keep a nosy server operator from running MD5 over the
> contents of a few good dictionaries, and then substituting the known
> hashes for the contents of the messages you disclose? Sure, they'd
> probably end up with some missing words, but messages written in
> known languages would be revealed pretty quickly. (Unless you use
> some salt, so that when I hash a word I get a different result than
> when you hash that word .. but then it's not possible to compare
> my messages received to your messages received, and notice interesting
> parallels.)

This is the reason I think we need an open source distributed spam
protection. This kind of feedback is good, and should be considered.
Furthermore, the people who run the servers are out in the open, so to
speak, and not behind a corporate veil. If the user is worried about privacy
however, by opinion is and always will be, don't use this, or use PGP on
your messages. I'm talking about designing spam protection, I'm bloating my
software if I worry too much about security. That should be handled at a
different level.

So, I still believe my idea is possible and not impractical.

RE: Digesting and Spam-matching

From: Frank T. <ft...@ne...> - 2003-05-02 05:00:14

Keith Jackson, on 2003-05-01, wrote:

> So, browsing the source, studying the rules, I can compose a spam mail
> that will easily defeat this system. So then, not to be rude or
> putting-down, this is mostly useless. People started blocking key words
> like 'cock'. So the spammers use 'c0ck'. People start using this system,
> the spammers will read your source and get around it. So, rather than a
> real solution to spam, this sounds more like one step in the cat and
> mouse game.

You're somewhat right.  I fully understand the limitations of pyzor.  So
far, however, the simple solution has worked well, and scaled well.  Once
we need more, it won't be that hard to change pyzor so that what is
digested is done more dynamically.  Eventually, however, I fully realize
that we could get to a point where all pieces of spam are drastically
unique, and can't be compared to each other to any significant degree.

Pyzor was designed to solve a problem as it existed and could be solved at
the time.  If the problem is borgish, and adapts to preventions such as
pyzor and razor, then these solutions will likely fade away, which is
perfectly fine with me.  Against a skilled, determined, and funded
attacker, preventing spam could become a *much* more difficult problem.

> I disagree. While not as compact and quick, it will catch more spam,
> which is the primary goal. Google has the whole world cached. I'm sure
> storing spam emails is not that impractical. Besides, it doesn't have to
> keep it forever. Spams that were sent out two months ago, are not likely
> to be looked up, how many people don't check their email in two months?

First of all, Google has $$$, bandwidth, and many machines.  Storing spam
emails isn't that impractical, really.  I'm pretty sure the Razor servers
store the entire piece of spam, but when checking, only a digest is sent
to be queried.  This nicely allows dynamic rules about hashing, which is a
good idea, requires significant more work on the part of mass mailer
developers to thwart.

> I just don't think pyzor is going to work for me as it stands now. And
> if I were writing a mass mailer program that they advertise via mass
> mail ;) I'd design it to beat this.

If you or I was so inclined, you and I could probably both design mass
mailers that would defeat any preventive measures, especially if said
measures had source available.  The only tricky one to defeat might be
Bayesian-like systems, which might be the only long-term winner.

-- 
Frank Tobin			http://www.neverending.org/~ftobin/

Re: Digesting and Spam-matching

From: Greg B. <gbr...@bi...> - 2003-05-01 20:39:46

On Thu, May 01, 2003 at 02:46:43PM -0400, Keith Jackson wrote:
> > As I stated in the previous mail, pyzor doesn't hash the
> > entire mail, just
> > a section of it after removing data that would likely contribute to
> > non-uniqueness, such as whitespace, urls, and email
> > addresses.  The exact
> > rules for removing suspicious elements are described in
> > pyzor.client.DataDigester.
> 
> So, browsing the source, studying the rules, I can compose a spam mail that
> will easily defeat this system. So then, not to be rude or putting-down,
> this is mostly useless. People started blocking key words like 'cock'. So
> the spammers use 'c0ck'. People start using this system, the spammers will
> read your source and get around it. So, rather than a real solution to spam,
> this sounds more like one step in the cat and mouse game.

If you're only interested in perfect solutions, you're going to be reading a 
lot of spam over the next few years.

> So, keeping two months, and some efficient storage of hashes, such as not
> duplicating them across spam mail entries, and using compression for
> server/client communication, I don't think it's really too impractical.

I think you're optimizing for a pretty narrow class of client - someone with
a fast, unmetered net connection (so it's not burdensome to have to download
all of the spam messages and upload the hash chains) and a relatively fast
PC (to calculate a few hundred MD5's per message), with access to a pretty
nice server (to store a few hundred MD5's per message per customer). The 
server becomes more useful as it's got more customers (someone's gotta read
the spam the first time, and say "hey! that's a spam!", unless you're 
going to depend on spam-trap addresses), so it's not so interesting to say
"well, four of my friends and I will share a little server on an old PC
I've got in my garage".

Now, if we think about other people who care about spam - like, say, people
with dialup access or metered access who don't want to download spams only
to discard them on the client PC - or ISP's who don't want their spool 
disks filled with spams waiting to be delivered then discarded - then there's
a pretty significant processing load placed on the receiving mailserver,
if they're the ones that have to calculate the hashes everytime a message
is received. 

> I'd give a few more bytes for better spam protection.

Have you done the math on this? I don't think we're talking about "a few
more bytes", I think we're talking about a fair amount of data, if you're
planning to store an ordered list of 128-bit values for every message
received over the last 60 days for a few thousand, or tens of thousands,
of people.

Your initial message in this thread, stripped of its headers, and counted 
by "wc", had 327 words. Assuming I have a nice way to store the hashes
that doesn't incur any overhead, that's 5232 bytes of data to remember
your message. If I get 300 emails per day like yours (probably not too
far off the mark), and I want to store 60 days' worth of them, that's
almost 92 megs of data .. for one person's email. Now, I'm probably 
worse than the average, but that's still a few orders of magnitude worse
than seems practical.

(The storage requirements would be reduced quite a bit if you didn't want
to keep the order of the hashes, but that would also reduce the 
ability of the program to differentiate between messages .. or if we
didn't care about privacy, and stored the message itself - the same data
from your message, whose hashes totalled 5232 bytes of data, was only
1993 bytes as text.)

And all of this assumes that you have a good algorithm for deciding
whether two hash chains are similar enough to be the same, or not; 
I think you're making the problem a lot harder by trying to operate on
hashes, rather than text, because it's easier to write code that 
ignores garbage strings than code that ignores hashes of garbage 
strings. 

Further, the privacy protection provided by the hash isn't very good -
what's to keep a nosy server operator from running MD5 over the
contents of a few good dictionaries, and then substituting the known
hashes for the contents of the messages you disclose? Sure, they'd
probably end up with some missing words, but messages written in 
known languages would be revealed pretty quickly. (Unless you use
some salt, so that when I hash a word I get a different result than
when you hash that word .. but then it's not possible to compare
my messages received to your messages received, and notice interesting
parallels.)

> I just don't think pyzor is going to work for me as it stands now. And if I
> were writing a mass mailer program that they advertise via mass mail ;) I'd
> design it to beat this.
> 
> I do wish you much luck with this project though. Anything open source for
> fighting spam is a Good Thing (tm).

Good luck to you, too ..

--
Greg Broiles
gbr...@pa...

Re: Open Source SpamNet look-alike

From: Kelson V. <ke...@sp...> - 2003-05-01 18:57:19

"Keith Jackson" <kja...@cr...> wrote:
>I found razor. But, it looked like a wannabe OpenSource Cloudmark.

Just for the record, it's the other way around.  Razor is actually the 
basis for SpamNet.  (This can be verified easily at www.cloudmark.com, 
where you may also notice that the creator of Razor is one of the founders 
of Cloudmark.)

We now return you to your regularly scheduled Pyzor discussion.


Kelson Vibber
www.speed.net

RE: Digesting and Spam-matching

From: Keith J. <kja...@cr...> - 2003-05-01 18:46:48

> As I stated in the previous mail, pyzor doesn't hash the
> entire mail, just
> a section of it after removing data that would likely contribute to
> non-uniqueness, such as whitespace, urls, and email
> addresses.  The exact
> rules for removing suspicious elements are described in
> pyzor.client.DataDigester.

So, browsing the source, studying the rules, I can compose a spam mail that
will easily defeat this system. So then, not to be rude or putting-down,
this is mostly useless. People started blocking key words like 'cock'. So
the spammers use 'c0ck'. People start using this system, the spammers will
read your source and get around it. So, rather than a real solution to spam,
this sounds more like one step in the cat and mouse game.

I'm sorry for wasting everyone's time, and my lack of ability to read python
source :)

> > The cons are that's a lot of data to transmit to the server. If you have
an
> > email with lots of short words, you are actually sending more data to
the
> > server than the size of the original message.
>
> Yes, it is a lot of data, and given the volume of use on the public
> server, quite impractical.

I disagree. While not as compact and quick, it will catch more spam, which
is the primary goal. Google has the whole world cached. I'm sure storing
spam emails is not that impractical. Besides, it doesn't have to keep it
forever. Spams that were sent out two months ago, are not likely to be
looked up, how many people don't check their email in two months?

So, keeping two months, and some efficient storage of hashes, such as not
duplicating them across spam mail entries, and using compression for
server/client communication, I don't think it's really too impractical.

I'd give a few more bytes for better spam protection.

I just don't think pyzor is going to work for me as it stands now. And if I
were writing a mass mailer program that they advertise via mass mail ;) I'd
design it to beat this.

I do wish you much luck with this project though. Anything open source for
fighting spam is a Good Thing (tm).

Thanks,

Keith

Re: Open Source SpamNet look-alike

From: Thomas G. <gu...@th...> - 2003-05-01 18:16:28

On Wed, Apr 30, 2003 at 08:58:38AM -0400, Keith Jackson wrote:
> I don't want to trust a 'corporation' to manage the servers and
> whatnot. So, I looked around. First I found razor. But, it looked
> like a wannabe OpenSource Cloudmark. This however looks like a true
> Open Source look-alike, which is what I'm looking for.

That's why I use it, too.


> Now the point of this email..
> 
> I've been looking around your site for protocol documentation, and there is
> none. Please tell me the protocol is open? If truly open source, I should be
> able to write a server and/or client that would be compatible with your
> server and/or client. And that's precisely what I would like to do. I'm not
> going to get into an argument over what languages are better, but for my
> purposes, having this in Python just isn't good enough, I intend to write a
> C/C++ implementation, maybe an Outlook add-in, etc.

Python is very good, realy. You can extend it with C/C++. You even can
write COM-Server or clients (win32). This might help you for the
outlook plugin.

> If there is a C/C++ implementation already being worked on somewhere else, I
> would be interested in helping on that. If not, I'm going to be starting
> one. 

Why? Are you afraid that python is too slow? 99% a better algorithm helps.

I guess you can reverse engineer the protocol by looking at the
code. Python is very readable.

 thomas

-- 
Thomas Guettler <gu...@th...>
http://www.thomas-guettler.de

9 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 .. 31 32 33 34 35 .. 46 > >> (Page 33 of 46)