From: Antoni M. <ant...@gm...> - 2008-02-25 15:55:08
|
I've been working with the MboxCrawler. There are two issues with it. 1. The mbox files from the ubuntu mailing list archives - brought up by Jose don't work with the crawler because they use the ' at ' string instead of the proper @ sign - for spam avoidance. This breaks because the DataObjectFactory uses the getFrom() method which tries to convert that string into a javax.mail.Address instance - this fails obviously. W'd need to rewrite the DataObjectFactory to work with String [] getHeader(String id) .. and perform the conversion ourselves. This shouldn't be too difficult. 2. I don't quite understand the mapping between the Message structure and the list of data objects. I noticed it when the validator started complaining. It turns out that each part in a multipart email is translated into a separate data object. These data objects don't have any types (only the first, the message itself, and the attachments have proper types). What to do with them? The validator won't stand them and there is no proper class for a message part in NMO (yet at least). I'd go for adding a MessagePart or MimePart class in NMO (a subclass of InformationElement). What do you think? All kinds of comments welcome. -- Antoni Myłka ant...@gm... |
From: Leo S. <leo...@df...> - 2008-02-25 16:09:26
|
It was Antoni Mylka who said at the right time 25.02.2008 16:54 the following words: > I've been working with the MboxCrawler. There are two issues with it. > > 1. The mbox files from the ubuntu mailing list archives - brought up > by Jose don't work with the crawler because they use the ' at ' string > instead of the proper @ sign - for spam avoidance. This breaks because > the DataObjectFactory uses the getFrom() method which tries to convert > that string into a javax.mail.Address instance - this fails obviously. > W'd need to rewrite the DataObjectFactory to work with > > String [] getHeader(String id) > > .. and perform the conversion ourselves. This shouldn't be too difficult. > > 2. I don't quite understand the mapping between the Message structure > and the list of data objects. I noticed it when the validator started > complaining. It turns out that each part in a multipart email is > translated into a separate data object. These data objects don't have > any types (only the first, the message itself, and the attachments > have proper types). What to do with them? The validator won't stand > them and there is no proper class for a message part in NMO (yet at > least). I'd go for adding a MessagePart or MimePart class in NMO (a > subclass of InformationElement). What do you think? > I wonder if they are dataobjects .... but I am ok when you create them as InformatioElements, go ahead. MessagePart or MimePart are both fine, MimePart sounds more like the RFC.. lg Leo > All kinds of comments welcome. > -- ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann Deutsches Forschungszentrum fuer Kuenstliche Intelligenz DFKI GmbH Trippstadter Strasse 122 P.O. Box 2080 Fon: +49 631 20575-116 D-67663 Kaiserslautern Fax: +49 631 20575-102 Germany Mail: leo...@df... Geschaeftsfuehrung: Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 ____________________________________________________ |
From: Christiaan F. <chr...@ad...> - 2008-02-25 20:52:31
|
Antoni Mylka wrote: > 1. The mbox files from the ubuntu mailing list archives - brought up > by Jose don't work with the crawler because they use the ' at ' string > instead of the proper @ sign - for spam avoidance. This breaks because > the DataObjectFactory uses the getFrom() method which tries to convert > that string into a javax.mail.Address instance - this fails obviously. > W'd need to rewrite the DataObjectFactory to work with > > String [] getHeader(String id) > > .. and perform the conversion ourselves. This shouldn't be too difficult. This will probably work, yes. It will probably violate some spec as well but as long as you implement it as a heuristic that is applied when there is no '@' sign present in the address, it will not ruin the behavior on mails that do follow the specs. I suspect that the mbox Javamail provider is at fault. The Address Javadoc says "This class represents an Internet email address using the syntax of RFC822". RFC 822 says [1]: C.5.5. AT-SIGN The string " at " no longer is used as an address delimiter. Only at-sign ("@") serves the function. This indicates that in certain older formats, things were different. Perhaps it's legal to encode addresses this way in mbox. Nevertheless, you should not need to know this, getFrom should just return a correctly parsed address. I recommend that you look at the source code of the getFrom method though, in order to catch all address format variants. Note that the From header does not contain only the email address, but a list of one or more recipients, each having an optional name and a mandatory address. An alternative is to catch the exception that occurs when you try to parse the header with the regular getFrom method and only then perform the alternative parsing. > 2. I don't quite understand the mapping between the Message structure > and the list of data objects. I noticed it when the validator started > complaining. It turns out that each part in a multipart email is > translated into a separate data object. These data objects don't have > any types (only the first, the message itself, and the attachments > have proper types). What to do with them? The validator won't stand > them and there is no proper class for a message part in NMO (yet at > least). I'd go for adding a MessagePart or MimePart class in NMO (a > subclass of InformationElement). What do you think? Can you elaborate on the DataObjects that are produced and that do not correspond to the entire mail or to attachments? I can't think of anything. I do know that there are message parts for which no DataObject representation is created, perhaps you mean those? [1] http://www.faqs.org/rfcs/rfc822.html Regards, Chris -- |
From: Jose G. L. <jg...@gs...> - 2008-02-27 15:09:42
|
Cool!!! I'm going to test this new crawler in a few days :D Antoni Mylka escribió: > I've been working with the MboxCrawler. There are two issues with it. > > 1. The mbox files from the ubuntu mailing list archives - brought up > by Jose don't work with the crawler because they use the ' at ' string > instead of the proper @ sign - for spam avoidance. This breaks because > the DataObjectFactory uses the getFrom() method which tries to convert > that string into a javax.mail.Address instance - this fails obviously. > W'd need to rewrite the DataObjectFactory to work with > > String [] getHeader(String id) > > .. and perform the conversion ourselves. This shouldn't be too difficult. > > 2. I don't quite understand the mapping between the Message structure > and the list of data objects. I noticed it when the validator started > complaining. It turns out that each part in a multipart email is > translated into a separate data object. These data objects don't have > any types (only the first, the message itself, and the attachments > have proper types). What to do with them? The validator won't stand > them and there is no proper class for a message part in NMO (yet at > least). I'd go for adding a MessagePart or MimePart class in NMO (a > subclass of InformationElement). What do you think? > > All kinds of comments welcome. -- José Gato Luis | Libre Software Engineering Lab (GSyC) Tel: (+34)-914 888 105 | Universidad Rey Juan Carlos jg...@gs... | Edif. Departamental II - Despacho 116 http://libresoft.urjc.es/ | c/Tulipán s/n 28933 Móstoles (Madrid) |