Hello, Chris
Christiaan Fluit pisze:
> Hi Antoni,
>
> Hmmm, wish I joined this discussion earlier... :) I'm trying to wrap my
> head around all the issues surrounding this complicated subject.
Oh indeed. I've been battling with this for a week.
> There seem to be some problems with content streams returned by the
> MimeSubCrawler that potentially get misclassified. Reading various
> mails, I sense that some decisions I made when creating the original
> ImapCrawler and its DataObjectFactory were (perhaps unknowingly) reversed.
The only change I've intentionally introduced is that I dropped the idea
of the cachedDataObjectsMap. Before the crawler had to know the url the
message and obtain the urls of children from the metadata rdf container.
Right now it is a simple list, over which both the AbstractMailCrawler
and the MimeSubCrawler iterate. Notice the "cementery for old code" at
the bottom of the AbstractJavaMailCrawler.
> Here's what I remember from creating the original ImapCrawler, perhaps
> it is of some help.
>
> When I created this crawler, I deliberately chose to let it fully
> process the MIME parts that were typically displayed by mailers as the
> body of a mail, and only let the parts that were displayed as
> attachments be returned as DataObjects so that they could be processed
> by Extractors (with some exceptions for messages forwarded as
> attachment).
I didn't change this approach.
> Reasons for this that I can remember:
>
> - I wanted the body text to be a property of the DataObject that
> represents the entire mail, rather than having a child DataObject
> representing the body (relevant for multipart/mixed and
> multipart/alternative).
This is still so. The refactoring I did allowed me to create explicit
unit tests for this behavior. Take a look at the
crawler.mail.DataObjectFactoryTest class and see what you don't like.
> - in case of multipart/alternative, only one of these parts suffices
> (that's what the "alternative" is about), you can skip the others. In
> that case, I preferred plain text over HTML, i.e. take the easiest format.
It's still the case. see the test.
> In other words, the DataObject hierarchy should in the first place
> reflect the mental model that users have of a mail, rather than the MIME
> hierarchy, which often does not distinguish between bodies and attachments.
>
> When the body part was plain text, I just read the entire content
> stream, added this text as a property in the model and set the
> InputStream of the FileDataObject to 'null', thereby preventing any
> Extractor from redoing that work. This also prevents any of the problems
> you mention, such as plain text bodies being misclassified by the MIME
> type identifier.
I will reevaluate the test I did.
> I'm not sure what I did with HTML parts, i.e. whether the *crawler*
> performed the HTML extraction and put the result in the model. Probably
> not, judging from the fact that I also needed to introduce a "content
> MIME type", so that the entire message could be classified as
> "message/rfc822" and the body as "text/html". Performing HTML text
> extraction for bodies in the (sub)crawler is in my opinion a good
> solution though. It's only these two types of mail bodies that actually
> need special attention, the rest are just plain old DataObjects that the
> Extractor and SubCrawler frameworks can process.
In the IMAPCrawler, the textual content wasn't covered at all, at least
I couldn't find it.
> Does this help?
>
> Some other remarks:
>
>> 1. State that the DataObject stream is only valid within the call to
>> objectNew, and only within the thread that called objectNew. (which is
>> actually true, it has just never been said explicitely)
>
> Sounds good. Somehow I have the feeling that we originally decided
> against this, but I can't remember the use case.
>
> This may solve a lot of issues related to crawling. Please remember
> though that we also support direct retrieval of DataObjects from the
> DataAccessor, e.g. for apps that allow the user to open a result. How is
> this affected by these changes? Perhaps simply a matter of those apps
> still needing to invoke "dataObject.dispose" themselves, or is there
> more to it?
You're right. We can't make it automatic in this respect. DataObjects
returned by Accessors will still need an obligatory call to dispose().
>> A simple workaround would be:
>> - mail crawlers only return full messages, without crawling inside of them
>> - crawling inside a single email is done by the mimesubcrawler
>> - the mime subcrawler extracts the fulltext
>
> Yes! The last point would be restricted to those MIME parts whose plain
> text is added as full-text to the passed DataObject. For the others
> (attachments) you can just create new DataObjects that will be
> recursively processed by the application. Attached messages are no
> exception, I think: they are just new DataObjects that are again
> classified as message/rfc822, again processed by this SubCrawler, etc.
This won't work. I dropped this approach
- if a message has a forwarded message, which has another forwarded
message, then we'd need three mail subcrawlers with three copies of the
message byte stream - the memory consumption rises with each nesting level.
- it proved very difficult to produce a data object that would be
identifiable by the mime type identifier as message/rfc822. There is no
method in the javax.mail api that would return raw stream of the message
(including the headers). getInputStream and getContent return the
content, not the headers. We'd have to use Part.writeTo and create some
clever hack that would convert an OutputStream into an InputStream with
as little memory as possible.
>> This would mean though that the attachments are only visible after an
>> additional processing step, and would incur additional memory overhead
>> (two copies of the MimeMessage at the same time)
>
> Why? How is this different from what the ImapCrawler did before parts of
> its functionality was moved to MimeSubCrawler?
IMapCrawler calls getMessage() on a folder. It gets an IMAPMessage that
encapsulates an open stream. On the first call to getSubject or any
other header-related method, javamail downloads all headers from that
stream, into an internal buffer. We create a FileDataObject with the
stream that contains raw bytes of that message (which is non-trivial),
and then the MimeSubCrawler creates a MimeMessage which downloads the
entire content of the stream.
>> 2. everything has correct mime types (before, a plain text email
>> received the text/plain mime type instead of message/rfc822
>
> I don't see this problem occurring with the older code. Aren't you
> confusing the regular MIME type property with the content MIME type?
Once again, I'll have a look at my test again and will get back to you.
>> problem introduced: it's impossible to turn the fulltext extraction
>> off. I consider it a reasonable tradeoff though.
>
> You can turn it off by passing 'null' as InputStream for the DataObject.
>
I meant: The Aperture user can't programatically force the IMAPCrawler
not to return the fulltext.
Will get back to you.
Antoni Mylka
antoni.mylka@...
|