You can subscribe to this list here.
| 2002 |
Jan
|
Feb
|
Mar
|
Apr
(75) |
May
(6) |
Jun
(6) |
Jul
(9) |
Aug
(46) |
Sep
(28) |
Oct
(56) |
Nov
(23) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2003 |
Jan
(23) |
Feb
(13) |
Mar
(10) |
Apr
(11) |
May
(23) |
Jun
(9) |
Jul
(6) |
Aug
(20) |
Sep
(28) |
Oct
(1) |
Nov
(23) |
Dec
(1) |
| 2004 |
Jan
(9) |
Feb
(6) |
Mar
(3) |
Apr
(12) |
May
(14) |
Jun
(3) |
Jul
(2) |
Aug
(9) |
Sep
(3) |
Oct
(8) |
Nov
(43) |
Dec
(9) |
| 2005 |
Jan
|
Feb
(1) |
Mar
(5) |
Apr
(17) |
May
(4) |
Jun
(2) |
Jul
(3) |
Aug
(2) |
Sep
(7) |
Oct
(8) |
Nov
|
Dec
(3) |
| 2006 |
Jan
(4) |
Feb
(2) |
Mar
(6) |
Apr
(3) |
May
|
Jun
(31) |
Jul
(4) |
Aug
(3) |
Sep
(5) |
Oct
(19) |
Nov
(16) |
Dec
(9) |
| 2007 |
Jan
|
Feb
|
Mar
(6) |
Apr
|
May
|
Jun
|
Jul
(5) |
Aug
|
Sep
(23) |
Oct
(7) |
Nov
(6) |
Dec
|
| 2008 |
Jan
(9) |
Feb
|
Mar
|
Apr
(9) |
May
(11) |
Jun
|
Jul
(1) |
Aug
(1) |
Sep
(3) |
Oct
|
Nov
(10) |
Dec
|
| 2009 |
Jan
(3) |
Feb
|
Mar
(5) |
Apr
(26) |
May
(45) |
Jun
(16) |
Jul
(41) |
Aug
(25) |
Sep
(4) |
Oct
(1) |
Nov
(8) |
Dec
(5) |
| 2010 |
Jan
(1) |
Feb
(3) |
Mar
(2) |
Apr
(21) |
May
(4) |
Jun
(18) |
Jul
(3) |
Aug
(2) |
Sep
(12) |
Oct
|
Nov
|
Dec
(5) |
| 2011 |
Jan
|
Feb
(3) |
Mar
(6) |
Apr
|
May
(1) |
Jun
(3) |
Jul
|
Aug
(4) |
Sep
(3) |
Oct
(1) |
Nov
|
Dec
(9) |
| 2012 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
(4) |
Feb
|
Mar
(1) |
Apr
|
May
(4) |
Jun
(7) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(4) |
Dec
|
| 2014 |
Jan
|
Feb
|
Mar
|
Apr
(2) |
May
(3) |
Jun
(3) |
Jul
(7) |
Aug
(1) |
Sep
(3) |
Oct
(2) |
Nov
(8) |
Dec
|
| 2015 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
(4) |
Sep
|
Oct
(2) |
Nov
(1) |
Dec
(5) |
| 2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
|
From: Frank T. <ft...@ne...> - 2003-06-10 19:57:55
|
Ted Sudtell, on 2003-06-06, wrote: > Has this problem been addressed while using Pyzor report? If so, what > do I have to do? > ValueError: unknown Content-Transfer-Encoding: binary It's symptomatic of a more general issue, that pyzor doesn't like it when illegal content-transfer encodings are used. It's in the buglist. -- Frank Tobin http://www.neverending.org/~ftobin/ |
|
From: Ted S. <wfa...@cu...> - 2003-06-06 15:53:17
|
Has this problem been addressed while using Pyzor report? If so, what do I
have to do?
Thanks
Traceback (most recent call last):
File "/usr/bin/pyzor", line 4, in ?
pyzor.client.run()
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 934, in run
ExecCall().run()
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 188, in run
if not apply(dispatch, (self, args)):
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 282, in report
for digest in FileDigester(sys.stdin, self.digest_spec, do_mbox):
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 619, in next
digest = self.digester.next()
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 648, in next
next_msg = self.mbox.next()
File "/usr/lib/python2.2/mailbox.py", line 34, in next
return self.factory(_Subfile(self.fp, start, stop))
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 679, in
__init__
self.curfile = self.__class__(self.multifile)
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 671, in
__init__
mimetools.decode(msg.fp, self.curfile, encoding)
File "/usr/lib/python2.2/mimetools.py", line 149, in decode
raise ValueError, \
ValueError: unknown Content-Transfer-Encoding: binary
|
|
From: Frank T. <ft...@ne...> - 2003-06-04 05:29:38
|
Kyle Wheeler, on 2003-06-03, wrote: > Sometimes this is simply because Pyzor timed out (which happens with > alarming frequency, and doesn't seem to be handled very well). The Pyzor > protocol seems based around UDP, which makes it have occasional problems > with the iptables firewall that I go through (the UDP packets get denied > on the return trip from the Pyzor server sometimes). Ah, yes, I forgot about that potential problem; thanks! Concerning the frequency of timeouts, I think I might make it so that pyzor re-tries sending a packet maybe two times. > Other times I get this error much more consistently for a specific > message. It will work fine if I pipe the mail directly to Pyzor, but not > when through SpamAssassin. I had SpamAssassin save off the stderr of > Pyzor when this happens. Anyone have any ideas as to why it might be > happening? It *looks* like the spam message is somehow shorter than > Pyzor is expecting... This is a known problem in pyzor...I'll be adding some code so that this situation is handled more gracefully. -- Frank Tobin http://www.neverending.org/~ftobin/ |
|
From: Kyle W. <ky...@me...> - 2003-06-03 17:16:08
|
Hello,
Like Jonathan Micheal Hawkins and Jeff Jackson, I'm having problems with
Pyzor and SpamAssassin. Namely, on some messages I get the following
error message:
Pyzor -> report failed: Received error code 256 at
/usr/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Reporter.pm line 327.
Sometimes this is simply because Pyzor timed out (which happens with
alarming frequency, and doesn't seem to be handled very well). The Pyzor
protocol seems based around UDP, which makes it have occasional problems
with the iptables firewall that I go through (the UDP packets get denied on
the return trip from the Pyzor server sometimes).
Other times I get this error much more consistently for a specific
message. It will work fine if I pipe the mail directly to Pyzor, but not
when through SpamAssassin. I had SpamAssassin save off the stderr of
Pyzor when this happens. Anyone have any ideas as to why it might be
happening? It *looks* like the spam message is somehow shorter than
Pyzor is expecting...
Here's the stderr:
Traceback (most recent call last):
File "/usr/bin/pyzor", line 4, in ?
pyzor.client.run()
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 934, in run
ExecCall().run()
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 188, in run
if not apply(dispatch, (self, args)):
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 282, in rep=
ort
for digest in FileDigester(sys.stdin, self.digest_spec, do_mbox):
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 615, in __i=
nit__
self.digester =3D iter(get_file_digester(fp, spec, mbox))
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 632, in get=
_file_digester
return (DataDigester(rfc822BodyCleaner(fp),
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 679, in __i=
nit__
self.curfile =3D self.__class__(self.multifile)
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 671, in __i=
nit__
mimetools.decode(msg.fp, self.curfile, encoding)
File "//usr/lib/python2.2/mimetools.py", line 137, in decode
return base64.decode(input, output)
File "//usr/lib/python2.2/base64.py", line 29, in decode
line =3D input.readline()
File "//usr/lib/python2.2/multifile.py", line 80, in readline
raise Error, 'sudden EOF in MultiFile.readline()'
multifile.Error: sudden EOF in MultiFile.readline()
--=20
Well, I've wrestled with reality for over thirty five years, doctor, and I'm
happy to say I've finally won out over it.
-- Jimmy Stewart, in "Harvey"
|
|
From: Frank T. <ft...@ne...> - 2003-05-31 07:05:52
|
Jason Sjobeck, on 2003-05-30, wrote: > The body of the message was empty & the subject said "test 678". Pyzor only looks at the body of a mesage, and given an empty message, will actually barf at the moment. > Since this message could not have possibly ever been seen by any one > before, how could it be tagged? An empty message is not unique. Could you send the output of pyzor -d check < message -- Frank Tobin http://www.neverending.org/~ftobin/ |
|
From: Jason S. <ja...@sj...> - 2003-05-30 23:21:22
|
Dear List, I am using Pyzor on our company's mail gateway server. We are running postfix, amavisd, SpamAssassin, Razor, and Pyzor. We like all of those softwares a lot. I was testing something completely unrelated this afternoon when I sent a test email to myself from my ISP's email server to my internal corporate address, examined the headers, and discovered that Pyzor had tagged my message. The body of the message was empty & the subject said "test 678".=20 Since this message was from my account at my ISP to my corp' account, I can not figure out how any one out in the public internet could have tagged this messages to get Pyzor to tag it. Or am I misunderstanding something. I thought Pyzor was a completely human activated tagging system, meaning that someone would have had to have seen the message in question and reported it as spam. Since this message could not have possibly ever been seen by any one before, how could it be tagged? Any tips or advice is most appreciated. Thanks. Jason ICQ : 127795461 |
|
From: <lis...@nu...> - 2003-05-29 14:58:30
|
I've seeded a very large number of spamtraps to a number of spamming
outfits' "remove" pages. That was lots of fun. :-) I'm of course already
getting spam for those spamtraps. I noticed some errors in my procmail
log and I thought I'd ask about them.
>From uda...@ms... Thu May 29 02:01:04 2003
Subject: ***SPAM*** Thin Summer In 2003...begin Now
Folder: /var/mail/spool/piehole
5359
Pyzor -> report failed: Received error code 256 at
/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Reporter.pm line
306.
>From bou...@bo... Thu May 29 03:55:31 2003
Subject: R, Save up to 80 percent on inkjets & no cost shipping
Folder: /var/mail/spool/piehole
4648
Traceback (most recent call last):
File "/usr/bin/pyzor", line 4, in ?
pyzor.client.run()
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 934, in
run
ExecCall().run()
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 188, in
run
if not apply(dispatch, (self, args)):
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 282, in
report
for digest in FileDigester(sys.stdin, self.digest_spec, do_mbox):
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 615, in
__init__
self.digester = iter(get_file_digester(fp, spec, mbox))
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 632, in
get_file_digester
return (DataDigester(rfc822BodyCleaner(fp),
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 679, in
__init__
self.curfile = self.__class__(self.multifile)
File "/usr/lib/python2.2/site-packages/pyzor/client.py", line 671, in
__init__
mimetools.decode(msg.fp, self.curfile, encoding)
File "//usr/lib/python2.2/mimetools.py", line 149, in decode
raise ValueError, \
ValueError: unknown Content-Transfer-Encoding: binary
Pyzor -> report failed: Received error code 256 at
/usr/local/lib/perl5/site_perl/5.8.0/Mail/SpamAssassin/Reporter.pm line
306.
I only received those two errors about about 20 pieces of spam. I'm
really not sure what to make of either of them. Does anyone know what
they mean?
Justin
|
|
From: <lis...@nu...> - 2003-05-28 19:42:42
|
On Wed, 28 May 2003, Frank Tobin wrote: > lis...@nu..., on 2003-05-28, wrote: > > > Do I need to strip out the extra header lines and Subject changes before > > reporting spam via Pyzor? > > No; pyzor does not look at headers. Great. That's what I needed to know. Thanks Justin |
|
From: Frank T. <ft...@ne...> - 2003-05-28 17:39:45
|
lis...@nu..., on 2003-05-28, wrote: > Do I need to strip out the extra header lines and Subject changes before > reporting spam via Pyzor? No; pyzor does not look at headers. -- Frank Tobin http://www.neverending.org/~ftobin/ |
|
From: <lis...@nu...> - 2003-05-28 15:58:00
|
Do I need to strip out the extra header lines and Subject changes before reporting spam via Pyzor? I'm setting up around 5000 spamtraps that are aliases for a single account. I'm using procmail on that account to auto-report via Pyzor. I'm stripping out SA's header lines with spamassassin -r, ReSent and X-Scanned-By lines with grep (although formail would have been just as easy), and using sed to undo my changes to the Subejct line with MIMEDefang scores mail >= 10. Do I actually need to do all that? I'm under the impression that Razor and Pyzor only take a hash of the message body, which I'm not altering. Clarification would be welcomed. Thanks! Justin |
|
From: Frank T. <ft...@ne...> - 2003-05-28 05:21:45
|
Jackson, Jeff, on 2003-05-22, wrote: > In a message from March, Jonathan Michael Hawkins had the same problem > I'm haveing in the thread "spamassassin -r < message.txt doesn't work". > I'm running Red Hat Linux 8.0 with SpamAssassin 2.55 and Pyzor 0.4. When > I try to submit spam using spamassasin -r < spam.txt, everything is > reported to Razor and DCC fine, but Pyzor gives me the following error: Every time? It's hard for me to debug without any pyzor stderr output from spam assassin; can that be retrieved? -- Frank Tobin http://www.neverending.org/~ftobin/ |
|
From: Jackson, J. <jef...@rb...> - 2003-05-22 14:36:28
|
Hi,=20 Another new Pyzor user here, hoping for a little help... In a message from March, Jonathan Michael Hawkins had the same problem = I'm haveing in the thread "spamassassin -r < message.txt doesn't work". = I'm running Red Hat Linux 8.0 with SpamAssassin 2.55 and Pyzor 0.4. When = I try to submit spam using spamassasin -r < spam.txt, everything is = reported to Razor and DCC fine, but Pyzor gives me the following error: debug: executable for pyzor was found at /usr/bin/pyzor debug: Pyzor is available: /usr/bin/pyzor debug: entering helper-app run mode debug: leaving helper-app run mode Pyzor -> report failed: Received error code 256 at = /usr/lib/perl5/site_perl/5.8. 0/Mail/SpamAssassin/Reporter.pm line 326. When I use pyzor itself to sumbit, it works fine: [filter@prickle filter]$ pyzor report < spam.txt 66.47.67.162:24441 (200, 'OK') I couldn't find a resolution to the problem in the list archives, and = the previous message thread ends without one... If anybody can shed any light on the problem, it would be greatly = appreciated. Jeff Jackson=20 R.B. Zack & Associates, Inc. =20 www.rbza.com=20 General Contractors for the Virtual World,=20 Building Business Applications that Work Since 1981.=20 jef...@rb...=20 (310) 833-0211 x180=20 QUIDQUID LATINE DICTUM SIT, PROFUNDUM VIDITUR (Whatever is said in Latin appears profound) The information in this e-mail is confidential and may be legally = privileged. It is intended solely for the addressee. Access to this = e-mail by anyone else is unauthorized. If you are not the intended = recipient, any disclosure, copying, distribution or any action taken or = omitted to be taken in reliance on it, is prohibited and may be = unlawful. =20 |
|
From: Frank T. <ft...@ne...> - 2003-05-19 20:21:27
|
Leon Roy, on 2003-05-19, wrote: > NameError: name 'object' is not defined > > I've downloaded the latest 0.4.0 release of pyzor, and tried the > previous 0.3.1 but both give the same error. I have been able to > successfully install before on Mandrake, (using debian at the moment). You need Python 2.2.1 or later. -- Frank Tobin http://www.neverending.org/~ftobin/ |
|
From: Leon R. <dog...@vi...> - 2003-05-19 16:32:15
|
Hello,
when I run: python setup.py build, I get the following error:
Traceback (most recent call last):
File "setup.py", line 5, in ?
import pyzor
File "lib/pyzor/__init__.py", line 57, in ?
class Singleton(object):
NameError: name 'object' is not defined
I've downloaded the latest 0.4.0 release of pyzor, and tried the
previous 0.3.1 but both give the same error. I have been able to
successfully install before on Mandrake, (using debian at the moment).
Any help appreciated,
-Leon.
|
|
From: Frank T. <ft...@ne...> - 2003-05-19 06:47:41
|
pos...@sj..., on 2003-05-18, wrote: > debug: Pyzor: couldn't grok response "downloading servers from > http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x" > ------------------------------------------------------------ > > Have I done something wrong? No. SA reads pyzor's stderr (which it shouldn't), and the first time you run pzyor, it will spit out a message about downloading the server list. > One note, when I finished installing the package and ran "pyzor > discover" at the console it said "no command interpreter" & then I found > out it was looking for a program named "python2" while my machine only > had "python", so I simply did a "cp python python2" then it seemed to > run fine. Is this a mistake? IS this relevant? Interesting; if you install pyzor from the tarball it should point itself to the correct python executable. If you were using someone else's packaging of pyzor, there could be problems. Your workaround is fine, but I'd recommend symlinking python2 to python, instead of copying. -- Frank Tobin http://www.neverending.org/~ftobin/ |
|
From: <pos...@sj...> - 2003-05-18 08:45:56
|
Dear List, Thanks for letting me post here. I am looking very forward to using what sounds like an incredibly cool piece of software.=20 We are running the following softwares: Mandrake Linux 9.0 Postfix 1.11.1 Amavisd-new SpamAssassin Razor Pyzor clamAV & AntiVir I just got Pyzor installed and restarted "amavisd" & SpamAssassin in debug mode so I can see what SpamAssass is doing when it calls Pyzor. I am getting the following error: ------------------------------------------------------------ debug: Pyzor is available: /usr/bin/pyzor debug: entering helper-app run mode debug: Pyzor: got response: downloading servers from http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x debug: leaving helper-app run mode debug: Pyzor: couldn't grok response "downloading servers from http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x" ------------------------------------------------------------ Have I done something wrong? One note, when I finished installing the package and ran "pyzor discover" at the console it said "no command interpreter" & then I found out it was looking for a program named "python2" while my machine only had "python", so I simply did a "cp python python2" then it seemed to run fine. Is this a mistake? IS this relevant? Back to the original error message, though, has anyone been here before and lived to tell the tale? If so, I would be most appreciative to hear how you resolved the issue. Thanks very much. Jason Post Script: if you could reply directly as well as to the list, that would be great. Thanks again. |
|
From: Roman S. <rn...@on...> - 2003-05-04 20:07:20
|
On Fri, 2 May 2003, Keith Jackson wrote: I am not sure bayes-formula based filter are of any good when catching server-wide spam. Last month volume of spam increased again and now I see more and more spams which use randomization. They put random strings right into text! Human can distinguish "random" pieces easily while computers can't. Personally I am using SpamOracle and it chokes on <!--dfsdfsdf--> based technique. So, it seems, spammers will be major inspiration for AI people ;-) In my view, any mail/spam left after pyzor is to be read by some AI system. Bayes is only first step to that. And also I do not think pyzor need to be more adaptive. It's simple and bullet-proof tool. And not hi-profile one - so spammers are not taking it into account yet ;-) Sincerely yours, Roman Suzi -- rn...@on... =\= My AI powered by GNU/Linux RedHat 7.3 |
|
From: Keith J. <kja...@cr...> - 2003-05-02 15:17:56
|
> If 32 bits of search space is sufficient, and the MD5 step > isn't providing > any significant privacy, why bother with it at all? > Seems like you could reach a similar result just shipping the > software with > a pre-compiled dictionary matching common words to your > 4-byte references, > and pass chains of those around, instead. True. But either way, this is a small detail. Send each actual token, or send a hash of each token. I guess it's a warm fuzzy feeling. By my own arguments then, it's just bloat, and the actual token should be sent. Either way. I guess I'd do MD5 because it's simple to implement, and would make it at least mildly difficult for an 'evil server'. As I said before, spam and security are separate issues. If you want security, use PGP or some other package designed specifically for it. > > Secondly, see above comment about not storing duplicates. > > Storing all email everyone gets is just silly. > > Ok, but aren't you degrading your ability to detect spam, > then? I thought we > were designing for the case where nobody sends identical > spams (see your > critique of Pyzor's hash strategy) but instead sends > individualized spams > with minor differences. So we've either got to store all of > them, or hope > that the first spam that we get is similar enough to all of > the others to > avoid detection. My thoughts were the 'first' spam contains something like: Keith, you've won!!! 3 tokens: [Keith,], [you've], [won!!!] Now you get a second spam mail, and it has [Greg,] instead of [Keith,] it will match with our 'theoretical' matching software the spam message stored on the server... the [Keith,] token will have it's 'variable' attribute incremented by one. So now the next match will be stronger. It will see that [Keith,], [you've], [won!!!] matches better with [George,], [you've], [won!!!] because it sees that that first token is variable. I don't want to store each unique spam, I want to store the first unique spam, which will then be modified via voting and whatnot to be more like a spam template, thus the point of doing this on the server and harnessing the power of distributed computing, vs. a client trying to figure this stuff out. Another thought that comes to mind just writing this email, to be even more optimized, I can hash the whole email on the client side (ignoring headers), and come up with one hash. Send it to the server, if that hash matches, then we don't even need to do this more advanced stuff. If it doesn't, then we can go into this more detailed matching. > > Again, I'm no expert, but the algorithm being hard > > shouldn't be a reason for > > not doing it. I'm sure as I'm sitting here there are PhD's > > all over that > > have papers on the net about good algorithms of this type. > > I think this is what I find hard to swallow about your > suggestion - that > we just find some smart people from somewhere else to fix the storage > problem and the spam-matching problem, and then say that the > spam problem > has been solved. > > Isn't that like saying "I know how to stop SARS. We'll make > some pills > that everyone will take. That might be expensive, but we'll get some > smart manufacturing guys to help make it cheap. And we don't have an > anti-viral drug that works against SARS yet, but there are a bunch of > smart pharmacology guys who work on that sort of thing all > day, they'll > get that figured out pretty soon. OK, problem solved. Next!" Granted. But your statement 'And all of this assumes that you have a good algorithm for deciding whether two hash chains are similar enough to be the same, or not', makes it sound as though this is a radical new algorithm that no one has thought of, or that they don't have any good implementations of. And it's not. I personally could think of a few approaches my self. > It doesn't really seem reasonable to me to compare > hypothetical software > to actual software - the actual software always has bugs & > limitations, > while the hypothetical software never seems to have any, > because it can > be modified much more quickly than actual software, and is always > fully debugged & optimized. You are very right, and I've never said this is better, I'd be a fool to say so until I see it run. I've said it could be better, and I'm throwing around ideas, I could be wrong, like you said, until it's fully debugged & optimized. But the whole point of this is my 'prove me wrong' mentality. Because if you can prove to me that this won't work, then I won't have to waste my time trying to implement it. Keith |
|
From: Greg B. <gbr...@bi...> - 2003-05-02 14:28:42
|
On Fri, May 02, 2003 at 09:03:11AM -0400, Keith Jackson wrote: > > 5232 bytes = 327 hashes * 128 bits > THAT, you are correct, is impractical, and not optimized or compressed even > remotely. > > Just as my thinking aloud in these emails, you could store them as > references to a dictionary. The reference is going to be less than 128 bits. > An index number, let's say 4 bytes... an unsigned long. The dictionary won't > grow forever, it will change based on the messages currently on the server. > So.. > > 1308 bytes = 327 hashes * 4 byte refs > > That's 25% storage of what you are talking about. A 75% gain. And I'm not > even a wiz at this stuff. If 32 bits of search space is sufficient, and the MD5 step isn't providing any significant privacy, why bother with it at all? Seems like you could reach a similar result just shipping the software with a pre-compiled dictionary matching common words to your 4-byte references, and pass chains of those around, instead. > Secondly, see above comment about not storing duplicates. Storing all email > everyone gets is just silly. Ok, but aren't you degrading your ability to detect spam, then? I thought we were designing for the case where nobody sends identical spams (see your critique of Pyzor's hash strategy) but instead sends individualized spams with minor differences. So we've either got to store all of them, or hope that the first spam that we get is similar enough to all of the others to avoid detection. > Again, I'm no expert, but the algorithm being hard shouldn't be a reason for > not doing it. I'm sure as I'm sitting here there are PhD's all over that > have papers on the net about good algorithms of this type. I think this is what I find hard to swallow about your suggestion - that we just find some smart people from somewhere else to fix the storage problem and the spam-matching problem, and then say that the spam problem has been solved. Isn't that like saying "I know how to stop SARS. We'll make some pills that everyone will take. That might be expensive, but we'll get some smart manufacturing guys to help make it cheap. And we don't have an anti-viral drug that works against SARS yet, but there are a bunch of smart pharmacology guys who work on that sort of thing all day, they'll get that figured out pretty soon. OK, problem solved. Next!" It doesn't really seem reasonable to me to compare hypothetical software to actual software - the actual software always has bugs & limitations, while the hypothetical software never seems to have any, because it can be modified much more quickly than actual software, and is always fully debugged & optimized. -- Greg Broiles gbr...@pa... |
|
From: Keith J. <kja...@cr...> - 2003-05-02 13:03:57
|
> If you're only interested in perfect solutions, you're going > to be reading a > lot of spam over the next few years. No, not perfect. But not easily defeatable, either. > I think you're optimizing for a pretty narrow class of client - someone with > a fast, unmetered net connection (so it's not burdensome to have to download > all of the spam messages and upload the hash chains) I don't think broadband is that narrow of a class, and becoming less narrow every day. I see this with a lot of Open Source projects. They design everything they write to work EVERYWHERE, and sometimes, to make progress, you just have to say... 'Hey, sorry man, get a better computer, get a faster connection'. Look at Microsoft. You think they are designing windows for dialup users? I don't think it'll be too many more revisions before dialup is some obscure extra add-in you gotta install separately. Hell with Win95 users, hell with dialup users, hell with people with 286 computers. And you can quote me on that. If we don't say that, we'll never GET anywhere. > and a relatively fast > PC (to calculate a few hundred MD5's per message), with > access to a pretty nice server (to store a few hundred MD5's per message per > customer). The server becomes more useful as it's got more customers > (someone's gotta read the spam the first time, and say "hey! that's a spam!", unless > you're going to depend on spam-trap addresses), so it's not so interesting to say > "well, four of my friends and I will share a little server on an old PC > I've got in my garage". Calculating one MD5 per word of an email is sorry to say, I don't think that intensive at all, given today's computers. Your basic math of customer * messages doesn't work well, either. I get 500 mails a day let's say. I have to upload a hash chain for each. The server will first.... match it with known spam. If 450 of my 500 mails are spam.... then those 450 hash chains are already on the server. I wouldn't store them twice. So that leaves 50 hash chains the server has to store. If the server doesn't see any matches on those 50 non-spams within let's say a day, it can throw them away. So, the actual math to describe server space is not customer * messages, it's number of unique spams sent across the world (within the past two months) plus the number of customers times the number of legitimate emails they get. The actual unique spams would be stored for like 2 months, and the non-matches would be stored a day. We'd end up with a server with a relatively large database of spam, but I have a feeling... I may be wrong, that the 'volatile' legitimate email will be larger. But it doesn't need to have critical storage, or to be kept around longer than a day. Now, on top of that, you don't store messages on a server as an md5 hash chain. When you first parse the message, you add all the hashes to a server dictionary, not re-adding ones that have been added before. Then, the actual message is stored as a list of references to the hashes. I don't see this as unrealistic or taking millions of dollars of hardware to accomplish. > Now, if we think about other people who care about spam - like, say, people > with dialup access or metered access who don't want to download spams only > to discard them on the client PC - or ISP's who don't want their spool > disks filled with spams waiting to be delivered then discarded - then there's > a pretty significant processing load placed on the receiving mailserver, > if they're the ones that have to calculate the hashes everytime a message > is received. Buy broadband and a better PC I say to you. Sorry. See above comment about Open Source mentality. > > I'd give a few more bytes for better spam protection. > > Have you done the math on this? I don't think we're talking about "a few > more bytes", I think we're talking about a fair amount of data, if you're > planning to store an ordered list of 128-bit values for every message > received over the last 60 days for a few thousand, or tens of thousands, > of people. I have done the math for it. And I explained it above. As I said in the original message, which some creative optimizations. I'm not even a great optimization guy, and I've come up with a better scheme than storing ordered lists of 128 bit hashes per message. Have YOU dome the math on this? ;) How many times in my email did I repeat words... like 'a'? or 'the'? etc... > Your initial message in this thread, stripped of its headers, and counted > by "wc", had 327 words. Assuming I have a nice way to store the hashes > that doesn't incur any overhead, that's 5232 bytes of data to remember > your message. If I get 300 emails per day like yours (probably not too > far off the mark), and I want to store 60 days' worth of them, that's > almost 92 megs of data .. for one person's email. Now, I'm probably > worse than the average, but that's still a few orders of > magnitude worse than seems practical. 5232 bytes = 327 hashes * 128 bits THAT, you are correct, is impractical, and not optimized or compressed even remotely. Just as my thinking aloud in these emails, you could store them as references to a dictionary. The reference is going to be less than 128 bits. An index number, let's say 4 bytes... an unsigned long. The dictionary won't grow forever, it will change based on the messages currently on the server. So.. 1308 bytes = 327 hashes * 4 byte refs That's 25% storage of what you are talking about. A 75% gain. And I'm not even a wiz at this stuff. Secondly, see above comment about not storing duplicates. Storing all email everyone gets is just silly. > (The storage requirements would be reduced quite a bit if you didn't want > to keep the order of the hashes, but that would also reduce the > ability of the program to differentiate between messages .. or if we > didn't care about privacy, and stored the message itself - the same data > from your message, whose hashes totalled 5232 bytes of data, was only > 1993 bytes as text.) Well, order of hashes is obviously important in this schema. But I'm sure there are many other schema's as well. > And all of this assumes that you have a good algorithm for deciding > whether two hash chains are similar enough to be the same, or not; Again, I'm no expert, but the algorithm being hard shouldn't be a reason for not doing it. I'm sure as I'm sitting here there are PhD's all over that have papers on the net about good algorithms of this type. > I think you're making the problem a lot harder by trying to operate on > hashes, rather than text, because it's easier to write code that > ignores garbage strings than code that ignores hashes of garbage > strings. Well, this is my problem with pyzor (and some others, now that I'm looking around). It assumes it knows about the content of the message. So, it won't even take advantage of the power of distributed spam detection if the spammers can first defeat the simple front end. My contention is that this is the WRONG way to do it. If you are doing distributed spam detection, but only doing it in some conditions that spammers can defeat, then you aren't doing distributed spam detection at all. I'd rather just have a client side filter. > Further, the privacy protection provided by the hash isn't very good - > what's to keep a nosy server operator from running MD5 over the > contents of a few good dictionaries, and then substituting the known > hashes for the contents of the messages you disclose? Sure, they'd > probably end up with some missing words, but messages written in > known languages would be revealed pretty quickly. (Unless you use > some salt, so that when I hash a word I get a different result than > when you hash that word .. but then it's not possible to compare > my messages received to your messages received, and notice interesting > parallels.) This is the reason I think we need an open source distributed spam protection. This kind of feedback is good, and should be considered. Furthermore, the people who run the servers are out in the open, so to speak, and not behind a corporate veil. If the user is worried about privacy however, by opinion is and always will be, don't use this, or use PGP on your messages. I'm talking about designing spam protection, I'm bloating my software if I worry too much about security. That should be handled at a different level. So, I still believe my idea is possible and not impractical. |
|
From: Frank T. <ft...@ne...> - 2003-05-02 05:00:14
|
Keith Jackson, on 2003-05-01, wrote: > So, browsing the source, studying the rules, I can compose a spam mail > that will easily defeat this system. So then, not to be rude or > putting-down, this is mostly useless. People started blocking key words > like 'cock'. So the spammers use 'c0ck'. People start using this system, > the spammers will read your source and get around it. So, rather than a > real solution to spam, this sounds more like one step in the cat and > mouse game. You're somewhat right. I fully understand the limitations of pyzor. So far, however, the simple solution has worked well, and scaled well. Once we need more, it won't be that hard to change pyzor so that what is digested is done more dynamically. Eventually, however, I fully realize that we could get to a point where all pieces of spam are drastically unique, and can't be compared to each other to any significant degree. Pyzor was designed to solve a problem as it existed and could be solved at the time. If the problem is borgish, and adapts to preventions such as pyzor and razor, then these solutions will likely fade away, which is perfectly fine with me. Against a skilled, determined, and funded attacker, preventing spam could become a *much* more difficult problem. > I disagree. While not as compact and quick, it will catch more spam, > which is the primary goal. Google has the whole world cached. I'm sure > storing spam emails is not that impractical. Besides, it doesn't have to > keep it forever. Spams that were sent out two months ago, are not likely > to be looked up, how many people don't check their email in two months? First of all, Google has $$$, bandwidth, and many machines. Storing spam emails isn't that impractical, really. I'm pretty sure the Razor servers store the entire piece of spam, but when checking, only a digest is sent to be queried. This nicely allows dynamic rules about hashing, which is a good idea, requires significant more work on the part of mass mailer developers to thwart. > I just don't think pyzor is going to work for me as it stands now. And > if I were writing a mass mailer program that they advertise via mass > mail ;) I'd design it to beat this. If you or I was so inclined, you and I could probably both design mass mailers that would defeat any preventive measures, especially if said measures had source available. The only tricky one to defeat might be Bayesian-like systems, which might be the only long-term winner. -- Frank Tobin http://www.neverending.org/~ftobin/ |
|
From: Greg B. <gbr...@bi...> - 2003-05-01 20:39:46
|
On Thu, May 01, 2003 at 02:46:43PM -0400, Keith Jackson wrote: > > As I stated in the previous mail, pyzor doesn't hash the > > entire mail, just > > a section of it after removing data that would likely contribute to > > non-uniqueness, such as whitespace, urls, and email > > addresses. The exact > > rules for removing suspicious elements are described in > > pyzor.client.DataDigester. > > So, browsing the source, studying the rules, I can compose a spam mail that > will easily defeat this system. So then, not to be rude or putting-down, > this is mostly useless. People started blocking key words like 'cock'. So > the spammers use 'c0ck'. People start using this system, the spammers will > read your source and get around it. So, rather than a real solution to spam, > this sounds more like one step in the cat and mouse game. If you're only interested in perfect solutions, you're going to be reading a lot of spam over the next few years. > So, keeping two months, and some efficient storage of hashes, such as not > duplicating them across spam mail entries, and using compression for > server/client communication, I don't think it's really too impractical. I think you're optimizing for a pretty narrow class of client - someone with a fast, unmetered net connection (so it's not burdensome to have to download all of the spam messages and upload the hash chains) and a relatively fast PC (to calculate a few hundred MD5's per message), with access to a pretty nice server (to store a few hundred MD5's per message per customer). The server becomes more useful as it's got more customers (someone's gotta read the spam the first time, and say "hey! that's a spam!", unless you're going to depend on spam-trap addresses), so it's not so interesting to say "well, four of my friends and I will share a little server on an old PC I've got in my garage". Now, if we think about other people who care about spam - like, say, people with dialup access or metered access who don't want to download spams only to discard them on the client PC - or ISP's who don't want their spool disks filled with spams waiting to be delivered then discarded - then there's a pretty significant processing load placed on the receiving mailserver, if they're the ones that have to calculate the hashes everytime a message is received. > I'd give a few more bytes for better spam protection. Have you done the math on this? I don't think we're talking about "a few more bytes", I think we're talking about a fair amount of data, if you're planning to store an ordered list of 128-bit values for every message received over the last 60 days for a few thousand, or tens of thousands, of people. Your initial message in this thread, stripped of its headers, and counted by "wc", had 327 words. Assuming I have a nice way to store the hashes that doesn't incur any overhead, that's 5232 bytes of data to remember your message. If I get 300 emails per day like yours (probably not too far off the mark), and I want to store 60 days' worth of them, that's almost 92 megs of data .. for one person's email. Now, I'm probably worse than the average, but that's still a few orders of magnitude worse than seems practical. (The storage requirements would be reduced quite a bit if you didn't want to keep the order of the hashes, but that would also reduce the ability of the program to differentiate between messages .. or if we didn't care about privacy, and stored the message itself - the same data from your message, whose hashes totalled 5232 bytes of data, was only 1993 bytes as text.) And all of this assumes that you have a good algorithm for deciding whether two hash chains are similar enough to be the same, or not; I think you're making the problem a lot harder by trying to operate on hashes, rather than text, because it's easier to write code that ignores garbage strings than code that ignores hashes of garbage strings. Further, the privacy protection provided by the hash isn't very good - what's to keep a nosy server operator from running MD5 over the contents of a few good dictionaries, and then substituting the known hashes for the contents of the messages you disclose? Sure, they'd probably end up with some missing words, but messages written in known languages would be revealed pretty quickly. (Unless you use some salt, so that when I hash a word I get a different result than when you hash that word .. but then it's not possible to compare my messages received to your messages received, and notice interesting parallels.) > I just don't think pyzor is going to work for me as it stands now. And if I > were writing a mass mailer program that they advertise via mass mail ;) I'd > design it to beat this. > > I do wish you much luck with this project though. Anything open source for > fighting spam is a Good Thing (tm). Good luck to you, too .. -- Greg Broiles gbr...@pa... |
|
From: Kelson V. <ke...@sp...> - 2003-05-01 18:57:19
|
"Keith Jackson" <kja...@cr...> wrote: >I found razor. But, it looked like a wannabe OpenSource Cloudmark. Just for the record, it's the other way around. Razor is actually the basis for SpamNet. (This can be verified easily at www.cloudmark.com, where you may also notice that the creator of Razor is one of the founders of Cloudmark.) We now return you to your regularly scheduled Pyzor discussion. Kelson Vibber www.speed.net |
|
From: Keith J. <kja...@cr...> - 2003-05-01 18:46:48
|
> As I stated in the previous mail, pyzor doesn't hash the > entire mail, just > a section of it after removing data that would likely contribute to > non-uniqueness, such as whitespace, urls, and email > addresses. The exact > rules for removing suspicious elements are described in > pyzor.client.DataDigester. So, browsing the source, studying the rules, I can compose a spam mail that will easily defeat this system. So then, not to be rude or putting-down, this is mostly useless. People started blocking key words like 'cock'. So the spammers use 'c0ck'. People start using this system, the spammers will read your source and get around it. So, rather than a real solution to spam, this sounds more like one step in the cat and mouse game. I'm sorry for wasting everyone's time, and my lack of ability to read python source :) > > The cons are that's a lot of data to transmit to the server. If you have an > > email with lots of short words, you are actually sending more data to the > > server than the size of the original message. > > Yes, it is a lot of data, and given the volume of use on the public > server, quite impractical. I disagree. While not as compact and quick, it will catch more spam, which is the primary goal. Google has the whole world cached. I'm sure storing spam emails is not that impractical. Besides, it doesn't have to keep it forever. Spams that were sent out two months ago, are not likely to be looked up, how many people don't check their email in two months? So, keeping two months, and some efficient storage of hashes, such as not duplicating them across spam mail entries, and using compression for server/client communication, I don't think it's really too impractical. I'd give a few more bytes for better spam protection. I just don't think pyzor is going to work for me as it stands now. And if I were writing a mass mailer program that they advertise via mass mail ;) I'd design it to beat this. I do wish you much luck with this project though. Anything open source for fighting spam is a Good Thing (tm). Thanks, Keith |
|
From: Thomas G. <gu...@th...> - 2003-05-01 18:16:28
|
On Wed, Apr 30, 2003 at 08:58:38AM -0400, Keith Jackson wrote: > I don't want to trust a 'corporation' to manage the servers and > whatnot. So, I looked around. First I found razor. But, it looked > like a wannabe OpenSource Cloudmark. This however looks like a true > Open Source look-alike, which is what I'm looking for. That's why I use it, too. > Now the point of this email.. > > I've been looking around your site for protocol documentation, and there is > none. Please tell me the protocol is open? If truly open source, I should be > able to write a server and/or client that would be compatible with your > server and/or client. And that's precisely what I would like to do. I'm not > going to get into an argument over what languages are better, but for my > purposes, having this in Python just isn't good enough, I intend to write a > C/C++ implementation, maybe an Outlook add-in, etc. Python is very good, realy. You can extend it with C/C++. You even can write COM-Server or clients (win32). This might help you for the outlook plugin. > If there is a C/C++ implementation already being worked on somewhere else, I > would be interested in helping on that. If not, I'm going to be starting > one. Why? Are you afraid that python is too slow? 99% a better algorithm helps. I guess you can reverse engineer the protocol by looking at the code. Python is very readable. thomas -- Thomas Guettler <gu...@th...> http://www.thomas-guettler.de |