From: Tony M. <to...@sp...> - 2009-08-15 20:15:29
|
> How does Pyzor deal with Non English Spam, for Example Chinese characters or > other languages which do not use spaces for their sentences? If all > characters more than 10 characters are removed we are most likely left with > a very empty body to hash? Pyzor completely ignores the language. Non-English languages that do use whitespace to separate out words will generally work fine, although the average word length in other languages is often longer than in English, so normalisation may remove content that would be better left in the digest. As you suggested, if there is little or no whitespace, as with many Eastern languages, there may be little content to digest. This is something that could be considered when looking into the specification (probably early next year). Until then, you can (a) submit a patch - or at least open a ticket - if this is important to you, and/or (b) adjust the normalisation settings in your copy of Pyzor to better match the messages you are trying to identify (of course, you'll need to have mutliple sources doing this in order to match with them). Cheers, Tony |