From: Tony M. <to...@sp...> - 2009-08-16 03:15:42
|
> Running on C#, VB.NET. You've got both a C# and a Visual Basic implementation? Couldn't you just have one and then use an assembly from the other (or any .NET) language? > And on a side note i have > managed to get Pyzor to partially run on windows by uncommenting out certain > lines so that it does not throw errors with python26 There are only really two issues with running Pyzor on Windows: it currently uses signal.alarm to handle timeouts, and it assumes POSIX-style paths for various files. The latter is easily fixed (I'll try to get this for 0.6, although I'm already way behind the time I wanted to have that done). The former can be fixed in various ways - e.g. having platform-specific timeout code, or using threads rather than signals, or just not having a timeout on platforms without signal.alarm (leaving handling timeouts to the user). Can I ask what you're planning to do with your implementation when it's done? In particular: are you planning on distributing it? If so, then the best solution might be for the Python pyzor to stay reasonably unfriendly to Windows and just provide links to your implementation. (And ensure that we work with you to make sure that the implementations stay reasonably in sync). > I have managed to get it down to the basics, the only thing i cannot find > an equivalent of how pyzor 'normalizes the html' in .net. I have this regex > snippet 'html_tag_ptrn = re.compile(r'<.*?>')' in pyzor but using the same > snippet does not produce the desired results. Any idea? That regular expression captures anything (other than newlines) within angle brackets (the *? makes it a non-greedy capture, which means it'll stop at the first >, rather than the last), including <> (i.e. nothing between the brackets). Again, this is a very crude expression, that will catch things like <this> as well as real tags. It also completely ignores the MIME type of the message, so this runs on both text/plain and text/html. > So far this is what i have done as best as i can understand > > 1. Removes(any) 'words' (sequences of characters separated by whitespace) > that are 10 or more characters > 2. Remove anything that is so long it that it looks like a unique identifier 1 & 2 are the same thing, really. i.e. 2 is done by doing 1. > 3. Removes anything that looks like an email address Yes. This, like the URL regex, isn't crafted amazingly well. "looks like an email address" just means any non-whitespace characters that surround an "@". I suppose it's good enough and doesn't really effect the uniqueness much, but it's not the regex I would choose. > 4. Removes anything that looks like a URL. Yes. This regex is worse than the email one. When we get to re-examining the specification, I'd like to change this to something more accurate. At the moment, it's any sequence of lower-case letters followed by a colon and then a sequence of non-whitespace characters. > 6. Removes any whitespace. > 7. Discards any lines that are fewer than 8 characters in length. Yes. > 8. Removes extra lines What do you mean by this? Cheers, Tony |