Christiaan Fluit schrieb:
I'm running in a performance problem while integrating ImapCrawler in 
Aduna AutoFocus. I believe this problem may also occur elsewhere, 
depending on how you apply Aperture, hence this mail.

AutoFocus has a wizard that shows the IMAP folder tree so that the user 
can choose which folder(s) should be crawled. This wizard uses the 
DataAccessor implementation provided by the ImapCrawler class to 
retrieve the folder tree so that I can reuse all functionality for 
contacting the IMAP server and I can be sure that the folder tree will 
correspond fully to the tree that the crawler will see.
  
For exactly that reason we designed the "StructuredAccess" class,
https://gnowsis.opendfki.de/wiki/ApertureHierachicalAccess

"Therefore, this Interface is a parallel add-on for convenience, independent from ApertureDataCrawler. Whereas ApertureDataCrawler is only for incremental crawling, the ApertureHierachicalAccess is for one-time crawling when someone (the user ?) wants to see the hierarchy inside. It must be noted, that IF a DataSource? supports ApertureHierachicalAccess, then the extracted Data that is stored in some database has also to have the hierarchical structure visible somehow. So all data that build the Hierarchy expressed in ApertureHierachicalAccess should also be returned by ApertureDataCrawler."

perhaps you should implement that interface?


The performance problem is in retrieving the metadata of each folder. 
This metadata contains a statement for every message in that folder, 
stating that that message resides in that folder. Retrieving the message 
UIDs leads to a lot of extra unnecessary network traffic when you're 
only interested in the folder structure. Luckily the crawler prefetches 
all message data at once for each folder, but still...
  
HierarchicalAccess would be it.
If we could reuse some of the crawler code to do it.
Another problem is how to differentiate message URIs from nested folder 
URIs. There is no statement in the folder metadata stating the data type 
of the nested items. When you look at the URIs that we currently use, 
you can simply check whether they end with ";TYPE=LIST", but the URL 
format is still under consideration and I don't like to see such 
assumptions in my client code.
  
we could use the URI scheme as defined by the RFC. I have code that may help

That is defined by the RFC 2192:
http://www.networksorcery.com/enp/rfc/rfc2192.txt

some examples:
<imap://michael@minbari.org/users.*;type=list>
we did it in gnowsis like this:
imap://sauermann@example.com/INBOX/;UID=234

This class implements "quite" correct folder & message URIs according to RFC. the only thing that misses is the UTF-7-IMAP encoding scheme of funny characters in folder paths - do you know how to do that?
https://gnowsis.opendfki.de/browser/trunk/gnowsis_email/WEB-INF/src/org/gnowsis/email/config/StoreConfig.java

this class does the inverse - parsing URIs and returning MAILAPI objects for them:
https://gnowsis.opendfki.de/browser/trunk/gnowsis_email/WEB-INF/src/org/gnowsis/email/config/UrlParser.java

I would recommend we switch the URI generation of the IMAP crawler to something like the StoreConfig class, so that it is clear how to get the URIs.

We have a wiki page that gathers some of the problems we faced (but not enough text there, extend it)
https://gnowsis.opendfki.de/wiki/EmailDeveloping

also add more information you find to this page?
A simple workaround is by adding a boolean switch to ImapCrawler that 
indicates whether only folder metadata should be generated or also 
message metadata. I will add this switch now so I can continue applying 
and testing ImapCrawler (already fixed a number of other issues as 
well). The default will be to include all metadata, so existing 
applications are not affected.
  
I would go for hierarchical access.
again - this is no crawling, its prefetching structure infomration for the user interface,
exactly what HierarchicalAccess was made for.
It seems to me that this problem may also occur with the other 
DataAccessor types. For example, every mail DataAccessor will have the 
same problem. Likewise, when you want to create a Windows Explorer-like 
component showing a folder tree based on the FileDataAccessor, you may 
also want to specify that you're interested in folder metadata only.

A more generic solution is to extend the DataAccessor methods with an 
extra boolean parameter or add extra methods with this parameter. As 
folders are already given special consideration at the API level (e.g. 
the existence of FolderDataObject), I believe this can be justified.

Are there any other solutions imaginable? I thought for a while about 
specifying this at the schema level (just indicate which parts you want) 
so it can be used to more precisely define the output but this wouldn't 
solve this particular problem, as both subfolders and messages are 
indicated using the same partOf property.
  
well, this all sounds like hacks that make the normal thing complicated and the complicated thing complicated.
A good design rationale is to make the simple thing simple  (=crawling) and the complicated thing complicated (=hierarchical access)
and implementing the other class for this special reason is a good way to go.

reuse is nice, but the code gets unmanageable with so many boolean switches.
after all, Implementing HierarchicalAccess should not take longer than 700 lines, given we put the URI generation in an extra class.

Leo

Chris
--


_______________________________________________
Aperture-devel mailing list
Aperture-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aperture-devel
  


-- 
____________________________________________________
DI Leo Sauermann       http://www.dfki.de/~sauermann 
DFKI GmbH
P.O. Box 2080          Fon:   +49 631 205-3503
67608 Kaiserslautern   Fax:   +49 631 205-3472
Germany                Mail:  leo.sauermann@dfki.de
____________________________________________________