#11 Incorrect refresh behaviour of ImapCrawler

closed-fixed
None
5
2008-04-13
2006-07-31
No

Based on private communication with Bill Shannon
(creator of JavaMail), I've come up with a different
strategy to incrementally crawl an IMAP folder. This
strategy needs to be implemented:

Strategy

- Retrieve all non-deleted messages using
Folder.search(new FlagTerm(Flags.Flag.DELETED, false)).
This is preferable over checking the deleted flag of
each mail individually as prefetching these flags takes
considerable time and using a search lets the *server*
execute this selection. Folder itself uses a naive,
client-side approach but that it overridden by a
server-based approach in IMAPFolder.

- Determine the subset of non-expunged messages (no
need to prefetch anything).

- Pre-fetch message UIDs for this set.

- When the UIDValidity of the folder has changed or was
not registered (= initial crawl), we need to crawl all
messages, else we try to incrementally crawl it:

- See if the set of retrieved message UIDs is equal to
the set of stored message UIDs (can be done by
calculating message URIs and looking them up in
AccessData). Also see if the set of subfolders is the same.

- Only report a new FolderDataObject when the set of
messages and/or subfolders has changed.

- When the set of stored messages UIDs is different,
process every message and see if it needs to be
reported as a new Message or not. Report all stored
message UIDs that are no longer part of this set as
removed.

Discussion

  • Christiaan Fluit

    Logged In: YES
    user_id=617090

    Issues this should solve:

    - currently, the FolderDataObject is always reported as
    modified when the server does not support .getUIDNext (e.g.
    Aduna's mail server ;) )

    - less information needs to be prefetched because the check
    for deleted messages happens at the server side

    - crawler behaves correctly when UID validity is no longer
    guaranteed

     
  • Christiaan Fluit

    • assigned_to: nobody --> cfmfluit
     
  • Antoni Mylka

    Antoni Mylka - 2008-04-13
    • status: open --> closed-fixed
     
  • Antoni Mylka

    Antoni Mylka - 2008-04-13

    Logged In: YES
    user_id=1613065
    Originator: NO

    Implemented it exactly as described. It's done in the setCurrentFolder method of the ImapCrawler.

    See
    <http://aperture.svn.sourceforge.net/viewvc/aperture/trunk/aperture/src/java/org/semanticdesktop/aperture/crawler/imap/ImapCrawler.java?revision=1221&view=markup#l_424>
    The algorithm implemented there follows the above description pretty carefully.

    There was a little problem with reporting unmodified folders, the method that reports unmodified folders, recursively reports all mails and subfolders as unmodified. A little hack was necessary, see

    <http://aperture.svn.sourceforge.net/viewvc/aperture/trunk/aperture/src/java/org/semanticdesktop/aperture/crawler/imap/ImapCrawler.java?revision=1221&view=markup#l_509>

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks