Christiaan Fluit schrieb:
For exactly that reason we designed the "StructuredAccess" class,
I'm running in a performance problem while integrating ImapCrawler in
Aduna AutoFocus. I believe this problem may also occur elsewhere,
depending on how you apply Aperture, hence this mail.
AutoFocus has a wizard that shows the IMAP folder tree so that the user
can choose which folder(s) should be crawled. This wizard uses the
DataAccessor implementation provided by the ImapCrawler class to
retrieve the folder tree so that I can reuse all functionality for
contacting the IMAP server and I can be sure that the folder tree will
correspond fully to the tree that the crawler will see.
"Therefore, this Interface is a parallel add-on for convenience,
independent from ApertureDataCrawler.
is only for incremental crawling, the ApertureHierachicalAccess
is for one-time crawling when someone (the user ?) wants to see the
It must be noted, that IF a DataSource?
then the extracted Data that is stored in some database has also to
have the hierarchical structure visible somehow. So all data that build
the Hierarchy expressed in ApertureHierachicalAccess
should also be returned by ApertureDataCrawler."
perhaps you should implement that interface?
HierarchicalAccess would be it.
The performance problem is in retrieving the metadata of each folder.
This metadata contains a statement for every message in that folder,
stating that that message resides in that folder. Retrieving the message
UIDs leads to a lot of extra unnecessary network traffic when you're
only interested in the folder structure. Luckily the crawler prefetches
all message data at once for each folder, but still...
If we could reuse some of the crawler code to do it.
we could use the URI scheme as defined by the RFC. I have code that may
Another problem is how to differentiate message URIs from nested folder
URIs. There is no statement in the folder metadata stating the data type
of the nested items. When you look at the URIs that we currently use,
you can simply check whether they end with ";TYPE=LIST", but the URL
format is still under consideration and I don't like to see such
assumptions in my client code.
That is defined by the RFC 2192:
we did it in gnowsis like this:
This class implements "quite" correct folder & message URIs
according to RFC. the only thing that misses is the UTF-7-IMAP encoding
scheme of funny characters in folder paths - do you know how to do that?
this class does the inverse - parsing URIs and returning MAILAPI
objects for them:
I would recommend we switch the URI generation of the IMAP crawler to
something like the StoreConfig class, so that it is clear how to get
We have a wiki page that gathers some of the problems we faced (but not
enough text there, extend it)
also add more information you find to this page?
I would go for hierarchical access.
A simple workaround is by adding a boolean switch to ImapCrawler that
indicates whether only folder metadata should be generated or also
message metadata. I will add this switch now so I can continue applying
and testing ImapCrawler (already fixed a number of other issues as
well). The default will be to include all metadata, so existing
applications are not affected.
again - this is no crawling, its prefetching structure infomration for
the user interface,
exactly what HierarchicalAccess was made for.
well, this all sounds like hacks that make the normal thing complicated
and the complicated thing complicated.
It seems to me that this problem may also occur with the other
DataAccessor types. For example, every mail DataAccessor will have the
same problem. Likewise, when you want to create a Windows Explorer-like
component showing a folder tree based on the FileDataAccessor, you may
also want to specify that you're interested in folder metadata only.
A more generic solution is to extend the DataAccessor methods with an
extra boolean parameter or add extra methods with this parameter. As
folders are already given special consideration at the API level (e.g.
the existence of FolderDataObject), I believe this can be justified.
Are there any other solutions imaginable? I thought for a while about
specifying this at the schema level (just indicate which parts you want)
so it can be used to more precisely define the output but this wouldn't
solve this particular problem, as both subfolders and messages are
indicated using the same partOf property.
A good design rationale is to make the simple thing simple (=crawling)
and the complicated thing complicated (=hierarchical access)
and implementing the other class for this special reason is a good way
reuse is nice, but the code gets unmanageable with so many boolean
after all, Implementing HierarchicalAccess should not take longer than
700 lines, given we put the URI generation in an extra class.
Aperture-devel mailing list
DI Leo Sauermann http://www.dfki.de/~sauermann
P.O. Box 2080 Fon: +49 631 205-3503
67608 Kaiserslautern Fax: +49 631 205-3472
Germany Mail: firstname.lastname@example.org