Re: Non English Spam

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> How does Pyzor deal with Non English Spam, for Example Chinese characters or
> other languages which do not use spaces for their sentences? If all
> characters more than 10 characters are removed we are most likely left with
> a very empty body to hash?

Pyzor completely ignores the language.  Non-English languages that do
use whitespace to separate out words will generally work fine,
although the average word length in other languages is often longer
than in English, so normalisation may remove content that would be
better left in the digest.  As you suggested, if there is little or no
whitespace, as with many Eastern languages, there may be little
content to digest.

This is something that could be considered when looking into the
specification (probably early next year).  Until then, you can (a)
submit a patch - or at least open a ticket - if this is important to
you, and/or (b) adjust the normalisation settings in your copy of
Pyzor to better match the messages you are trying to identify (of
course, you'll need to have mutliple sources doing this in order to
match with them).

Cheers,
Tony