From: Tony M. <to...@sp...> - 2009-07-28 10:08:51
|
> $ pyzor predigest < long.txt > YoureceivedthismessagebecauseyouaretotheGoogleGroupsgroup. > Toposttothisgroup,sendemailto > Tofromthisgroup,sendemailto > Tofromthisgroup,sendemailto > Formoreoptions,visitthisgroupat I think this is just evidence of a flaw in the Pyzor specification. The spec uses 3 lines of the message 20% into the content and 3 lines 60% into the content (and then normalises those lines) as the digest material (unless there are fewer than 4 lines, in which case they are all used). Here there are 11 lines, so Pyzor will choose to use lines 5, 6, 7 and 7, 8, 9 (since the message is short, the two groups of three end up overlapping). Line 9 is one long string of characters, so is normalised away to nothing. You could change anything in the rest of the message (line 1 (blank), line 2 (actual content), lines 3 or 4 (blank and symbols), or lines 10 or 11 (blank)) and the predigest (and therefore the digest) would not change. Insert an extra line anywhere before line 10 or change lines 5, 6, 7, 8, or 9, and the predigest/digest will change. There are two things I don't like here: * I don't like that in short messages the predigest can have some material twice. * It seems like 6 lines is not a lot to base a unique identifier on, especially in small messages like this. I presume that the idea is that it makes it hard to insert per-email text into a message and have the digest change. However, (1) all you need to do is insert a newline at the right place and the digest *will* change, and (2) taking only 6 lines seems like it's too conservative and will lead to more false positives. I'm still relatively new to Pyzor development, so I'm reluctant to change fundamentals like the digest specification. However, I'm not really convinced it's ideal, so it is something I'm looking at (as I have time). I'd like to look at what other digesting methods do (I'm familiar with some already, but looking at more), and run some tests on different specs. Since it's such a major change (for a start, it means a separate database) I doubt there will be any changes in the code in the next few months. > And there is another odd thing. It works for the shortened message, but > only when using <. [...] I don't know the answer to this. My guess would be that the cat+| combination are doing something slightly different than just piping in with <. For example, if there were a different number of blank lines at the end, then the total number of lines would be different, and the 20% and 60% positions would be in different places, and so the predigest/digest would change. Cheers, Tony |