Re: strange false positive

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> $ pyzor predigest < long.txt
> YoureceivedthismessagebecauseyouaretotheGoogleGroupsgroup.
> Toposttothisgroup,sendemailto
> Tofromthisgroup,sendemailto
> Tofromthisgroup,sendemailto
> Formoreoptions,visitthisgroupat

I think this is just evidence of a flaw in the Pyzor specification.
The spec uses 3 lines of the message 20% into the content and 3 lines
60% into the content (and then normalises those lines) as the digest
material (unless there are fewer than 4 lines, in which case they are
all used).

Here there are 11 lines, so Pyzor will choose to use lines 5, 6, 7 and
7, 8, 9 (since the message is short, the two groups of three end up
overlapping).  Line 9 is one long string of characters, so is
normalised away to nothing.

You could change anything in the rest of the message (line 1 (blank),
line 2 (actual content), lines 3 or 4 (blank and symbols), or lines 10
or 11 (blank)) and the predigest (and therefore the digest) would not
change.  Insert an extra line anywhere before line 10 or change lines
5, 6, 7, 8, or 9, and the predigest/digest will change.

There are two things I don't like here:

 * I don't like that in short messages the predigest can have some
material twice.

 * It seems like 6 lines is not a lot to base a unique identifier on,
especially in small messages like this.  I presume that the idea is
that it makes it hard to insert per-email text into a message and have
the digest change.  However, (1) all you need to do is insert a
newline at the right place and the digest *will* change, and (2)
taking only 6 lines seems like it's too conservative and will lead to
more false positives.

I'm still relatively new to Pyzor development, so I'm reluctant to
change fundamentals like the digest specification.  However, I'm not
really convinced it's ideal, so it is something I'm looking at (as I
have time).  I'd like to look at what other digesting methods do (I'm
familiar with some already, but looking at more), and run some tests
on different specs.  Since it's such a major change (for a start, it
means a separate database) I doubt there will be any changes in the
code in the next few months.

> And there is another odd thing. It works for the shortened message, but
> only when using <.
[...]

I don't know the answer to this.  My guess would be that the cat+|
combination are doing something slightly different than just piping in
with <.  For example, if there were a different number of blank lines
at the end, then the total number of lines would be different, and the
20% and 60% positions would be in different places, and so the
predigest/digest would change.

Cheers,
Tony