From: Christiaan F. <chr...@ad...> - 2006-06-19 10:03:57
|
I'm running in a performance problem while integrating ImapCrawler in Aduna AutoFocus. I believe this problem may also occur elsewhere, depending on how you apply Aperture, hence this mail. AutoFocus has a wizard that shows the IMAP folder tree so that the user can choose which folder(s) should be crawled. This wizard uses the DataAccessor implementation provided by the ImapCrawler class to retrieve the folder tree so that I can reuse all functionality for contacting the IMAP server and I can be sure that the folder tree will correspond fully to the tree that the crawler will see. The performance problem is in retrieving the metadata of each folder. This metadata contains a statement for every message in that folder, stating that that message resides in that folder. Retrieving the message UIDs leads to a lot of extra unnecessary network traffic when you're only interested in the folder structure. Luckily the crawler prefetches all message data at once for each folder, but still... Another problem is how to differentiate message URIs from nested folder URIs. There is no statement in the folder metadata stating the data type of the nested items. When you look at the URIs that we currently use, you can simply check whether they end with ";TYPE=LIST", but the URL format is still under consideration and I don't like to see such assumptions in my client code. A simple workaround is by adding a boolean switch to ImapCrawler that indicates whether only folder metadata should be generated or also message metadata. I will add this switch now so I can continue applying and testing ImapCrawler (already fixed a number of other issues as well). The default will be to include all metadata, so existing applications are not affected. It seems to me that this problem may also occur with the other DataAccessor types. For example, every mail DataAccessor will have the same problem. Likewise, when you want to create a Windows Explorer-like component showing a folder tree based on the FileDataAccessor, you may also want to specify that you're interested in folder metadata only. A more generic solution is to extend the DataAccessor methods with an extra boolean parameter or add extra methods with this parameter. As folders are already given special consideration at the API level (e.g. the existence of FolderDataObject), I believe this can be justified. Are there any other solutions imaginable? I thought for a while about specifying this at the schema level (just indicate which parts you want) so it can be used to more precisely define the output but this wouldn't solve this particular problem, as both subfolders and messages are indicated using the same partOf property. Chris -- |
From: Leo S. <sau...@df...> - 2006-06-19 12:15:47
|
Christiaan Fluit schrieb: > I'm running in a performance problem while integrating ImapCrawler in > Aduna AutoFocus. I believe this problem may also occur elsewhere, > depending on how you apply Aperture, hence this mail. > > AutoFocus has a wizard that shows the IMAP folder tree so that the user > can choose which folder(s) should be crawled. This wizard uses the > DataAccessor implementation provided by the ImapCrawler class to > retrieve the folder tree so that I can reuse all functionality for > contacting the IMAP server and I can be sure that the folder tree will > correspond fully to the tree that the crawler will see. > For exactly that reason we designed the "StructuredAccess" class, https://gnowsis.opendfki.de/wiki/ApertureHierachicalAccess "Therefore, this Interface is a parallel add-on for convenience, independent from ApertureDataCrawler <https://gnowsis.opendfki.de/wiki/ApertureDataCrawler>. Whereas ApertureDataCrawler <https://gnowsis.opendfki.de/wiki/ApertureDataCrawler> is only for incremental crawling, the ApertureHierachicalAccess <https://gnowsis.opendfki.de/wiki/ApertureHierachicalAccess> is for one-time crawling when someone (the user ?) wants to see the hierarchy inside. It must be noted, that IF a DataSource? <https://gnowsis.opendfki.de/wiki/DataSource> supports ApertureHierachicalAccess <https://gnowsis.opendfki.de/wiki/ApertureHierachicalAccess>, then the extracted Data that is stored in some database has also to have the hierarchical structure visible somehow. So all data that build the Hierarchy expressed in ApertureHierachicalAccess <https://gnowsis.opendfki.de/wiki/ApertureHierachicalAccess> should also be returned by ApertureDataCrawler <https://gnowsis.opendfki.de/wiki/ApertureDataCrawler>." perhaps you should implement that interface? > The performance problem is in retrieving the metadata of each folder. > This metadata contains a statement for every message in that folder, > stating that that message resides in that folder. Retrieving the message > UIDs leads to a lot of extra unnecessary network traffic when you're > only interested in the folder structure. Luckily the crawler prefetches > all message data at once for each folder, but still... > HierarchicalAccess would be it. If we could reuse some of the crawler code to do it. > Another problem is how to differentiate message URIs from nested folder > URIs. There is no statement in the folder metadata stating the data type > of the nested items. When you look at the URIs that we currently use, > you can simply check whether they end with ";TYPE=LIST", but the URL > format is still under consideration and I don't like to see such > assumptions in my client code. > we could use the URI scheme as defined by the RFC. I have code that may help That is defined by the RFC 2192: http://www.networksorcery.com/enp/rfc/rfc2192.txt some examples: <imap://mi...@mi.../users.*;type=list> we did it in gnowsis like this: imap://sau...@ex.../INBOX/;UID=234 This class implements "quite" correct folder & message URIs according to RFC. the only thing that misses is the UTF-7-IMAP encoding scheme of funny characters in folder paths - do you know how to do that? https://gnowsis.opendfki.de/browser/trunk/gnowsis_email/WEB-INF/src/org/gnowsis/email/config/StoreConfig.java this class does the inverse - parsing URIs and returning MAILAPI objects for them: https://gnowsis.opendfki.de/browser/trunk/gnowsis_email/WEB-INF/src/org/gnowsis/email/config/UrlParser.java I would recommend we switch the URI generation of the IMAP crawler to something like the StoreConfig class, so that it is clear how to get the URIs. We have a wiki page that gathers some of the problems we faced (but not enough text there, extend it) https://gnowsis.opendfki.de/wiki/EmailDeveloping also add more information you find to this page? > A simple workaround is by adding a boolean switch to ImapCrawler that > indicates whether only folder metadata should be generated or also > message metadata. I will add this switch now so I can continue applying > and testing ImapCrawler (already fixed a number of other issues as > well). The default will be to include all metadata, so existing > applications are not affected. > I would go for hierarchical access. again - this is no crawling, its prefetching structure infomration for the user interface, exactly what HierarchicalAccess was made for. > It seems to me that this problem may also occur with the other > DataAccessor types. For example, every mail DataAccessor will have the > same problem. Likewise, when you want to create a Windows Explorer-like > component showing a folder tree based on the FileDataAccessor, you may > also want to specify that you're interested in folder metadata only. > > A more generic solution is to extend the DataAccessor methods with an > extra boolean parameter or add extra methods with this parameter. As > folders are already given special consideration at the API level (e.g. > the existence of FolderDataObject), I believe this can be justified. > > Are there any other solutions imaginable? I thought for a while about > specifying this at the schema level (just indicate which parts you want) > so it can be used to more precisely define the output but this wouldn't > solve this particular problem, as both subfolders and messages are > indicated using the same partOf property. > well, this all sounds like hacks that make the normal thing complicated and the complicated thing complicated. A good design rationale is to make the simple thing simple (=crawling) and the complicated thing complicated (=hierarchical access) and implementing the other class for this special reason is a good way to go. reuse is nice, but the code gets unmanageable with so many boolean switches. after all, Implementing HierarchicalAccess should not take longer than 700 lines, given we put the URI generation in an extra class. Leo > > Chris > -- > > > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > -- ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann DFKI GmbH P.O. Box 2080 Fon: +49 631 205-3503 67608 Kaiserslautern Fax: +49 631 205-3472 Germany Mail: leo...@df... ____________________________________________________ |