Menu

Mstor and problems reading large mbox files

Help
2008-07-02
2013-05-01
  • Martin Gregorie

    Martin Gregorie - 2008-07-02

    I am developing a mail archive based round an RDBMS. It uses a non-interactive loader program on a daily bases to load the day's mail traffic into the database. It can either use mstor to read from an mbox file or pick up the messages direct from a POP3 server. If invalid messages are discovered, they are written to mbox file(s) for correction and reinput. This is necessary because JavaMail seems to be more fussy than most MUAs when it comes to header syntax. This part of the system, together with a search engine, is working and working well. I actually use a kludge program to fix syntax I know will trip JavaMail (e.g some copies of Lookout use ; rather than , as a list separator).
    The kludge generated a fixed mbox file which is fed to the loader.

    However, I've discovered that the database content upsets the Postgres DB restore program, probably because message bodies can contain single quotes. For this reason and to gain DBMS independence I'm currently building an application-specific backup program: the idea is to write the entire database content to one or more mbox files and then use the loader to reload the database from them. The backup program currently writes the file direct from the database without going through JavaMail. For each message it writes:

    From <sender's address>
    Headers from the database
    CRLF
    message body from the database
    CRLF
    CRLF

    I have to construct a fake envelope header because JavaMail doesn't pass me it when I'm storing mail in the database. The CRLFCRLF at the end is needed or mstor doesn't recognise the start of a new message.

    However, I must be missing a trick, because on a recent test run against my live archive the backup read out 31942 messages (Postgres says this is correct), but when I asked the loader to scan the file without loading anything
    it found 31961 messages, so somehow it thinks thre are an additional 19 messages in there. When I used the loader to load these messages, it loaded 31909 messages and wrote 52 rejects out for analysis. Many of them consist of a line - "From - Sun Jun 29 21:38:23 2008" is typical - followed by what looks like the rest of a message body terminated by a blank line. The other rejects all seem to have magically become base64 encoded and with lengths that are not multiples of 4 bytes, but that's my problem, not yours

    Accordingly, I have two questions:

    1) Should I be doing anything special to lines in the message body that
       start with "From " to prevent mstor from interpretting them? If so,
       does this only apply to paragraphs that start with 'From '?

    2) Is there any way I could use mstor to write the backup file?
       That would be desirable because it should ensure that the file is
       always formatted correctly. So far I haven't found a way to create a
       Message or MimeMessage contains only the unmodified headers and message
       body as stored in the database. Again, what have I missed?

    Regards,
    Martin Gregorie

     
    • Ben Fortuna

      Ben Fortuna - 2008-07-03

      Hi Martin,

      I also have noticed how intolerant JavaMail is with invalid headers. This is probably something that should be addressed by the Glassfish/JavaMail devs, but in the meantime I guess we have to find workarounds like yours..

      1) Which version of mstor are you using? Prior to 0.9.11 phantom messages were being identified due to incorrectly identifying "From " lines in nested messages. I'm pretty confident this was fixed in 0.9.11 tho. According to the mbox format if the pattern "\nFrom " is identified in a message it should be escaped as "\n>From ", however if you are using mstor to create the mbox file this should be done automatically. You can open the mbox file in a text editor to verify this is working correctly.

      2) Depending on how your data is stored in the database you could investigate creating an input stream from your data to create the MimeMessage using this constructor:

      MimeMessage(javax.mail.Session session, java.io.InputStream in)

      I think you would just need to concatenate the headers and body of the message for this to work.

      regards,
      ben

       
    • Martin Gregorie

      Martin Gregorie - 2008-07-03

      Hi Ben,

      Thanks for your response. I knew there was an escape symbol but couldn't remember what it was and had failed to find it in quick RFC search.

      I'm currently using mstor 0.9.11. I have not seen false 'From ' recognition with this version during normal operation. The fact that it only tripped on 19 out of 31900 messages shows its a pretty unusual combination. I'd already noticed that the initial 'From ' is missed unless its preceded by a blank line (which can be CRLFCRLF or LFLF), hence my suggestion that a paragraph starting with 'From ' might be the trigger since its pretty unusual to find that in written English. I'll try a test message containing such a paragraph and will let you know if its the culprit.

      Thanks for the MimeMessage hint - I'll try that and let you know how it goes. Its an easy change since I have headers and content as separate CLOB-type database fields and am already concatenating them to create the backup file.

      Martin

       
    • Martin Gregorie

      Martin Gregorie - 2008-07-07

      Hi Ben,

      I've just tried writing messages via MimeMessage and mstor - it worked well in a short test, thanks.

      However, I have found a problem with handling message content which contains lines starting with 'From '. I hacked together file containing a single message whose content includes a paragraph starting with 'From ' and tried to load it into my database. It was parsed as two messages. The first, which was loaded ended just ahead of this paragraph. The second, which was rejected by the loader for having no date, subject or sender was sent to the rejects bin with the line starting with 'From ' replaced by an envelope header and followed by the rest of the message.

      I manually edited it to change 'From ' to '>From '. The message was now parsed correctly. When I looked at it in the database the envelope header had been discarded (I expected this) but the '>' was retained in front of 'From ',
      which I didn't expect.

      Then I made a backup using your suggested method if constructing a MimeMessage and writing it via mstor. This looks good: an envelope header was created and written out followed by the stored message.  The original headers were included without changes, again what I'd hoped for, and the '>From ' was retained as you'd expect.

      Judging by this test I think there's a problem with content lines starting with 'From '. I can't test this further because I can't get the test message stored in the database with the masking '>' removed, so I can't easily see if it would be replaced when the message is backed up via mstor. There is nothing in my code that edits the message content before its stored and, in fact, this would be technically difficult since I go through a byte array rather than a StringBuffer to avoid byte<->character translation: the database holds the content as a CLOB. In PostgresQL this is implemented as a BYTEA field which you can't do anything to except store it and retrieve it.

      If you'd like the test message and/or the rejected message etc. please contact me direct so I can send it to you: this not being Bugzilla, I can't see any other way of getting it to you.

      Cheers,
      Martin

       
    • Martin Gregorie

      Martin Gregorie - 2008-07-09

      Further to my last: I sent messages from Evolution with a line starting
      'From ' as the start of a paragraph and in the middle of a paragraph.

      Evolution escaped 'From' with '>' but this escape was not removed by mstor/JavaMail because its still present in the message stored in my database. mstor/JavaMail correctly stored the message in my archive.

      I caused my archive to send the archived message back to me: the received raw text contains the escaped '>From ' though Evolution suppresses the escape character when it displays the message.

      This suggests to me that my problem is due to some old messages that I initially loaded into the archive. These were saved copies output by Pegasus and may not have escaped 'From ' in the content. Looks like I need to analyse the database content and report back.

      Martin

       
    • Martin Gregorie

      Martin Gregorie - 2008-09-02

      Further to my last, I modified my backup program to use Message(Session, InputStream) as you suggested. It now works when fed an InputStream containing the concatenated headers and body. Thanks. However, I found I had to apply two fixes to the body before the file was an acceptable input mbox file:

      1) byte strings containing '0x0aFrom ' are replaced with '0x0a>From ' as discussed

      2) trailing newlines in the body are removed and replaced with CRLFCRLF

      My mail loader was unable to parse the file until I added the second fix: I'd written 33,000 odd messages to the file but on input mstor/JavaMail only recognised 12,000, which I traced to a lack of a blank line between the body and the following envelope header. After making sure that the body was ended with CRLFCRLF all 33,000 messages were read, but an additional 52 were found and rejected, mostly due due to being runts (no subject, date sent or sender found). I haven't been able to determine what's causing this. I searched the mbox for spurious lines starting with 'From' but didn't find any.

      Three questions:

      1) Have you any suggestions for getting round this? I'm assuming that the envelope header starts with 'From' followed by a space or tab and is case sensitive. Should I be escaping 'from' as well as 'From'?

      2) Should mstor be checking for a trailing blank line in the body when it writes a message and supplying one if it isn't there?

      3) I've assumed that the newline preceding 'From' is either CRLF or LF. Should I also accept CR?
         

      Regards,
      Martin

      Martin

       
      • Ben Fortuna

        Ben Fortuna - 2008-09-03

        Hi Martin,

        1) Message delimiters (i.e. From_ lines) are case-sensitive, and SHOULD end with a CRLF (some implementations also allow LF). Because the mbox specification isn't really official (i.e. there's no RFC or something like that), there have been different interpretations of what a From_ line should be, however I'm pretty sure everyone agrees that the "From" part is case-sensitive.

        2) I'm pretty sure mstor is adding the required newlines when appending messages to an mbox file. Have a look at the createFrom_Line() method here:

        http://m2.modularity.net.au/projects/mstor/xref/net/fortuna/mstor/data/MessageAppender.html#174

        3) mstor supports lines ending with either CRLF or LF, so that it can read mbox files generated by an implementation that recognises LF as a line terminator. Here's the actual regex expression used:

        static final Pattern FROM__LINE_PATTERN = Pattern.compile("(\\A|\\n{2}|(\\r\\n){2})^From .*$", Pattern.MULTILINE);

        regards,
        ben

         
        • Martin Gregorie

          Martin Gregorie - 2008-09-03

          Thanks, Ben.

          I can't see any problems with that, though it leaves me puzzled about why I needed to mess about with the end of the body. Could it be more JavaMail pickyness.

          Ditto with the input parsing: though its probably more JavaMail fussiness I can live with an error rate of 0.15% though it does niggle a bit. My loader writes messages it doesn't like to separate files for manual repair and input - in some of these I can spot the problem, in others not.

          One thing I forgot to mention: there's an issue with a few messages which contain a main header setting Content Encoding to base64 (mostly sent by Outlook) which get rejected because the body isn't a multiple of 4 bytes. These either complain about the length or that there are missing Base64 padding characters. Have you run into this?

           

Log in to post a comment.