From: Matt <ma...@gm...> - 2009-08-14 07:08:02
|
Ok thanks for the tip. Running on C#, VB.NET. And on a side note i have managed to get Pyzor to partially run on windows by uncommenting out certain lines so that it does not throw errors with python26 I have managed to get it down to the basics, the only thing i cannot find an equivalent of how pyzor 'normalizes the html' in .net. I have this regex snippet 'html_tag_ptrn = re.compile(r'<.*?>')' in pyzor but using the same snippet does not produce the desired results. Any idea? So far this is what i have done as best as i can understand 1. Removes(any) 'words' (sequences of characters separated by whitespace) that are 10 or more characters 2. Remove anything that is so long it that it looks like a unique identifier 3. Removes anything that looks like an email address 4. Removes anything that looks like a URL. 5. Rmoves anything that looks like HTML tags. (STUCK HERE!) 6. Removes any whitespace. 7. Discards any lines that are fewer than 8 characters in length. 8. Removes extra lines Then run the following rules : 1. If the message is greater than 4 lines in length, do the following: - Discard the first 20% of the message - then Grab the next 3 lines. - Discards the 60% of the message - then Grab the next 3 lines. - Discards the remainder of the message. If less than 4 lines use the entire body Am i missing anything else? On Fri, Aug 14, 2009 at 1:02 PM, Tony Meyer <to...@sp...> wrote: > > During my limited attempts to port the basic check routines over to .NET > > Which .NET language are you porting to? > > > i noticed that there are no minimum requirements before the hash is > > calculated. > > That's not completely correct. There has to be at least one line > whose normalised length is 8 characters or more, otherwise there are > no offsets, and no digest. A message with very little text will have > a completely different digest to a different message with very little > text. > > However, the basic point is correct - the smaller the message, the > less unique the hash is. As I've indicated previously, I think the > digest specification needs re-examining, but I don't think it's > something that I should or will get to this year. > > Cheers, > Tony > > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus > on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > _______________________________________________ > pyzor-users mailing list > pyz...@li... > https://lists.sourceforge.net/lists/listinfo/pyzor-users > |