I just downloaded mstor 0.9.9 and discovered that mstor gets confused sometimes regarding when a new message is started. mstor is seeing extra messages that have null headers, so it's message count does not agree with Thunderbird or my own little mbox reader program. I found that if I changed the FROM__PATTERN to "From - " as Thunderbird uses, I get a different (larger) count of messages using getMessageCount. This is non-intuitive, thinking it should be less than or equal to the message count using "From " as the FROM__PATTERN. My thought was that it was picking up some regular content as the beginning of a new message. So, I then modified the default buffer size to 1024*1024 and got the right number of messages - e.g. no null messages. My instinct now says that the algorithm that looks for patterns across buffer reads is broken.
Any hints as to what might be going on? Anyone else seen this? My email files are very big - e.g. 150MB and 287MB.
I'll see if I can debug this, but thought I'd ask first. I haven't downloaded from CVS, so I can't say if that exhibits the same behavior. FYI, my free time is very limited for the next few weeks, so don't expect anything real soon.
Thanks in advance for any help/advice on this!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've just applied a few fixes to the regex used in the From_ patterns and also increased the default buffer size to 8192. Results are looking promising so far, however it may still require some more tweaking (I am not very confident in my regex abilities..). :)
regards,
ben
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just downloaded mstor 0.9.9 and discovered that mstor gets confused sometimes regarding when a new message is started. mstor is seeing extra messages that have null headers, so it's message count does not agree with Thunderbird or my own little mbox reader program. I found that if I changed the FROM__PATTERN to "From - " as Thunderbird uses, I get a different (larger) count of messages using getMessageCount. This is non-intuitive, thinking it should be less than or equal to the message count using "From " as the FROM__PATTERN. My thought was that it was picking up some regular content as the beginning of a new message. So, I then modified the default buffer size to 1024*1024 and got the right number of messages - e.g. no null messages. My instinct now says that the algorithm that looks for patterns across buffer reads is broken.
Any hints as to what might be going on? Anyone else seen this? My email files are very big - e.g. 150MB and 287MB.
I'll see if I can debug this, but thought I'd ask first. I haven't downloaded from CVS, so I can't say if that exhibits the same behavior. FYI, my free time is very limited for the next few weeks, so don't expect anything real soon.
Thanks in advance for any help/advice on this!
Hi Rich,
I've just applied a few fixes to the regex used in the From_ patterns and also increased the default buffer size to 8192. Results are looking promising so far, however it may still require some more tweaking (I am not very confident in my regex abilities..). :)
regards,
ben