Re: Message Prequalification for Digest

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> Running on C#, VB.NET.

You've got both a C# and a Visual Basic implementation?  Couldn't you
just have one and then use an assembly from the other (or any .NET)
language?

> And on a side note i have
> managed to get Pyzor to partially run on windows by uncommenting out certain
> lines so that it does not throw errors with python26

There are only really two issues with running Pyzor on Windows: it
currently uses signal.alarm to handle timeouts, and it assumes
POSIX-style paths for various files.  The latter is easily fixed (I'll
try to get this for 0.6, although I'm already way behind the time I
wanted to have that done).  The former can be fixed in various ways -
e.g. having platform-specific timeout code, or using threads rather
than signals, or just not having a timeout on platforms without
signal.alarm (leaving handling timeouts to the user).

Can I ask what you're planning to do with your implementation when
it's done?  In particular: are you planning on distributing it?  If
so, then the best solution might be for the Python pyzor to stay
reasonably unfriendly to Windows and just provide links to your
implementation.  (And ensure that we work with you to make sure that
the implementations stay reasonably in sync).

> I have managed to get it down to the basics,  the only thing i cannot find
> an equivalent of how pyzor 'normalizes the html' in .net. I have this regex
> snippet 'html_tag_ptrn = re.compile(r'<.*?>')' in pyzor but using the same
> snippet does not produce the desired results. Any idea?

That regular expression captures anything (other than newlines) within
angle brackets (the *? makes it a non-greedy capture, which means
it'll stop at the first >, rather than the last), including <> (i.e.
nothing between the brackets).  Again, this is a very crude
expression, that will catch things like <this> as well as real tags.
It also completely ignores the MIME type of the message, so this runs
on both text/plain and text/html.

> So far this is what i have done as best as i can understand
>
> 1. Removes(any) 'words' (sequences of characters separated by whitespace)
> that are 10 or more characters
> 2. Remove anything that is so long it that it looks like a unique identifier

1 & 2 are the same thing, really.  i.e. 2 is done by doing 1.

> 3. Removes anything that looks like an email address

Yes.  This, like the URL regex, isn't crafted amazingly well.  "looks
like an email address" just means any non-whitespace characters that
surround an "@".  I suppose it's good enough and doesn't really effect
the uniqueness much, but it's not the regex I would choose.

> 4. Removes anything that looks like a URL.

Yes.  This regex is worse than the email one.  When we get to
re-examining the specification, I'd like to change this to something
more accurate.  At the moment, it's any sequence of lower-case letters
followed by a colon and then a sequence of non-whitespace characters.

> 6. Removes any whitespace.
> 7. Discards any lines that are fewer than 8 characters in length.

Yes.

> 8. Removes extra lines

What do you mean by this?

Cheers,
Tony