If a message's clues are viewed when on the Exchange
server, and compared to the same message moved to a
pst file, the clues are not the same. It appears (I
haven't examined closely yet; can do on request) that
on Exchange the html part of the message is used, and
in the pst, it isnt'.
Probably related to this is the problem that moving a
message back and forwards between Exchange and a
pst file (showing clues each time) results in an ever-
increasing number of tokens.
It doesn't appear to be the PR_SEARCH_KEY changing:
>>> key1 = "PR_SEARCH_KEY : '\n\x02\xde\xfd7\xf6
\xa7A\x93\xfd\xf3\xb1\xfeA\x16\xf9'"
>>> key2 = "PR_SEARCH_KEY : '\n\x02\xde\xfd7\xf6
\xa7A\x93\xfd\xf3\xb1\xfeA\x16\xf9'"
>>> key1 == key2
True
Next thing to try? :)
dump_props on a message in a pst.
Logged In: YES
user_id=552329
The dump_props are attached.
If I just move the messages about, doing 'show clues', then
no training takes place. I think my original comment was
wrong - trying now, I get the same number of tokens no
matter how many times I move (although the exchange count
and pst count are different). Anyway, the log (at verbose=1)
doesn't show anything apart from the "already trained as
ham" message.
If I train a message I get not that much more. pst first:
"""
Training on message 'Re: comparing 2 images' - trained as
spam
Saving bayes database with 4637 spam and 410 good
messages
-> C:\Documents and Settings\tameyer.MASSEY\Application
Data\SpamBayes\default_bayes_database.db
-> C:\Documents and Settings\tameyer.MASSEY\Application
Data\SpamBayes\default_message_database.db
Saved databases in 896.138ms
"""
and moving it back to Exchange:
"""
Training on message 'Re: comparing 2 images' - trained as
good
Saving bayes database with 4636 spam and 411 good
messages
-> C:\Documents and Settings\tameyer.MASSEY\Application
Data\SpamBayes\default_bayes_database.db
-> C:\Documents and Settings\tameyer.MASSEY\Application
Data\SpamBayes\default_message_database.db
Saved databases in 850.026ms
"""
Does this help?
Logged In: YES
user_id=14198
The underlying bug seems to be
https://sourceforge.net/tracker/index.php?func=detail&aid=798029&group_id=61702&atid=498103
- however, as it looks like we will be almost
"hand-crafting" the HTML of the message, I will leave this
open, as we may still end up with bugs if the html we
generate isn't identical (token-wise) to the MS one.
Logged In: YES
user_id=552329
Are we going to be able to get identical token streams?
Attached are two 'show clues' messages, for the same
message, on a pst and on Exchange. 26 clues for one, and
28 for the other. This is a plain text message.
The extra two clues arise because Exchange html'ises the
plain text message and so the words in the subject also
appear in the body.
Plain text message on Exchange
Same message in the pst (no training was done).