From: Antoni M. <ant...@gm...> - 2008-08-13 22:27:44
|
Christiaan Fluit pisze: > Hi, I'm back :) > > Antoni Myłka wrote: >>>> A simple workaround would be: >>>> - mail crawlers only return full messages, without crawling inside of them >>>> - crawling inside a single email is done by the mimesubcrawler >>>> - the mime subcrawler extracts the fulltext >>> Yes! The last point would be restricted to those MIME parts whose plain >>> text is added as full-text to the passed DataObject. For the others >>> (attachments) you can just create new DataObjects that will be >>> recursively processed by the application. Attached messages are no >>> exception, I think: they are just new DataObjects that are again >>> classified as message/rfc822, again processed by this SubCrawler, etc. >> This won't work. I dropped this approach >> - if a message has a forwarded message, which has another forwarded >> message, then we'd need three mail subcrawlers with three copies of the >> message byte stream - the memory consumption rises with each nesting level. > > ... then so be it. > > I wonder how this is different from what we had before, when the > ImapCrawler created all DataObjects, i.e. including the attachments. How > were the InputStreams of attachments handled by JavaMail? Were they > wrappers around a larger InputStream or byte array holding the entire > mail, were they copies of the byte array, ...? > > Perhaps it's just me starting to lose the overview... > I don't quite know anymore. The results of my tests with the expanded crawl report show, that the current trunk of the DataSourceFactory attached two nie:mimeType triples to most messages. You can see that there are 18 thousand text/plain elements in my mailbox. With the new version there are only 3 thousand. That's because the mime type of the content, has been correctly added as a nmo:contentMimeType triple. As to how were the attachment streams implemented, that's hidden behind the javamail apis. The fact is that each MimeMessage instance encapsulates a byte array with the entire message content. This array is filled lazily, but it is always an in-memory byte array. The mbox mail provider doesn't try to optimize it at all, the IMAPMessage too, the initialization is lazy but once you touch anything, you get the whole message in memory. >> - it proved very difficult to produce a data object that would be >> identifiable by the mime type identifier as message/rfc822. There is no >> method in the javax.mail api that would return raw stream of the message >> (including the headers). getInputStream and getContent return the >> content, not the headers. We'd have to use Part.writeTo and create some >> clever hack that would convert an OutputStream into an InputStream with >> as little memory as possible. > > If I remember correctly, the ImapCrawler used to hand out those MIME > types, not the MIME type identifier. In other words, both the parent > DataObjects representing entire mails and the child DataObjects that > represent attachments had their mimeType and contentMimeType properties > already set when they were reported to the CrawlerHandler. Therefore you > didn't need the raw InputStream, only the content stream, which was > passes as the InputStream of those DataObjects. > > Does this help you at all? :) As I said, I'm afraid I've lost oversight > a bit, so I'm just posting some details from memory that come to mind, > hoping that it will be of any help. > The most important part of the refactoring I did is the fact that the whole Message->RDF mapping is in the DataObjectFactory class. This made it possible to test it without any crawler. Please examine the DataObjectFactoryTest class carefully, as it shows exactly what's being done now. Now, the crawlers are responsible for getting messages from the source, the DataObjectFactory converts them to RDF. Antoni Mylka ant...@gm... |