#43 incremental IMAP crawling broken

1.3.0 - bugs
Antoni Mylka
crawlers (23)

...seems to be broken. He starts again from scratch - nevertheless, the history files will be generated.


    • summary: incremental IMAP crawling --> incremental IMAP crawling broken
    • milestone: --> 533940
  • Antoni Mylka
    Antoni Mylka

    Logged In: YES
    Originator: NO

    I assume that "starts again from scratch" means that all messages are reported as new on the second crawl.

    I tested it on gmail and the dfki imap server.
    - the messages were reported correctly as unmodified
    - the folders were always modified

    I fixed two minor bugs, now the folders without subfolders are correctly reported as unmodified and folders without messages don't throw any errors. Right now I can crawl a mailbox, all messages are reported as new, and then crawl it once more - all messages and folders are reported as unmodified. It works on gmail and on the dfki server. (well, gmail has this problem described in [2005759] but dfki server works flawlessly)

    Please test, and make sure that the access data is configured correctly.

  • Logged In: YES
    Originator: YES

    I have tested, and in the end it works at our test mail account, which is an mbox structure, and it don't works on my own account, which is based on mh...maybe there exists a relationship

    In some folders, I always get the exception
    02.07.2008 17:17:37 WARNING: Exception while crawling folder imap://reuschling@serv-4100/Mail%2fDFKI%5fPEEK;TYPE=LIST <<<< from AbstractJavaMailCrawler.crawlSingleFolder(..) [Thread 24]
    javax.mail.FolderClosedException: Lost folder connection to server
    at com.sun.mail.imap.IMAPFolder.checkOpened(IMAPFolder.java:328)
    at com.sun.mail.imap.IMAPFolder.getUID(IMAPFolder.java:1814)
    at org.semanticdesktop.aperture.crawler.imap.ImapCrawler.getMessageUri(ImapCrawler.java:901)
    at org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler.getCurrentFolderObject(AbstractJavaMailCrawler.java:630)
    at org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler.crawlSingleFolder(AbstractJavaMailCrawler.java:256)
    at org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler.crawlFolder(AbstractJavaMailCrawler.java:212)
    at org.semanticdesktop.aperture.crawler.imap.ImapCrawler.crawlObjects(ImapCrawler.java:157)
    at org.semanticdesktop.aperture.crawler.base.CrawlerBase.crawl(CrawlerBase.java:216)
    at de.dfki.catwiesel.synchronizer.importer.aperture.imap.ImapImporter.startImport(ImapImporter.java:153)
    at de.dfki.catwiesel.synchronizer.importer.ImporterInputQueue.startImport(ImporterInputQueue.java:110)
    at de.dfki.catwiesel.CatwieselDocumentStore.importData(CatwieselDocumentStore.java:720)
    at org.dynaq.index.Indexer.performCatwieselIndexing(Indexer.java:1150)
    at org.dynaq.index.Indexer.createIndex(Indexer.java:501)
    at org.dynaq.config.ImportConfigView$2.run(ImportConfigView.java:299)
    at java.lang.Thread.run(Thread.java:619)
    02.07.2008 17:17:37 INFO: Given resource was processed completely. Crawled 0 objects (exit code: completed) <<<< from CatwieselCrawlerHandler.crawlStopped(..) [Thread 24]

    ..so maybe there are some Exceptions that avoid the writing of the history file entries. I tried two notations:
    the #mh/DFKI_PEEK (which is a folder from me of course) and Mail/DFKI_PEEK, but both of them leads to the exception. I also have the stomak feeling that this is somehow folder-specific, some very big folders were indexed, some small ones not...

  • Antoni Mylka
    Antoni Mylka

    Logged In: YES
    Originator: NO

    once again I tried to reproduce this bug but failed,
    the "history files" you've sent don't contain the <aperture.timestamp> tags
    it's a feature we've added before the last release, it's really weird because the only reason I can thing of, why are those tags missing is that you may have a version of aperture that's at least two months old, which you say you don't

    To push the matter further somewhat, I would ask you to run the crawl through the imapcrawler script in the bin folder with the debug mode
    i've prepared two scripts: imapcrawlerdebug.bat and .sh. They will generate a file called imap-debug-output.txt. It will contain all IMAP messages exchanged between the server and the crawler. Please run something like:

    1. checkout the sources
    2. type ant testbuild
    3. go to the bin folder
    4. run

    imapcrawlerdebug.bat --native imapnative --accessDataFile imapaccessdata.zip --server <server-name> --username <login> --password <password> --sslnocert --folder <folder-name> -v

    replace <server-name>, <login> and <password> and <folder-name> with appropriate values

    after running this command create copies of imap-debug-output.txt and imapaccessdata.zip

    then run the same thing the second time with exactly the same arguments

    you should get a report with unmodified emails (or an error)

    send the first version of the imapaccessdata and the imap-debug-output.txt together with the second version of those files to me

    REMEBER TO OBFUSCATE THE imap-debug-output, since it will contain your login name, your password and subject lines of all emails in the given folder. Please choose a folder that's not that sensitive (like a mailing list). I will not upload it anywhere.

    I hope we will see some pattern there...

    It seems to be some sort of a server misconfiguration, or a javamail bug, it may not be possible for us to provide a workaround on the aperture level

    and as far as the exception is concerned
    1. does it break the crawl
    2. does it happen only on the second crawl or on both?

  • Antoni Mylka
    Antoni Mylka

    removed this issue from the 1.2.0.beta group

  • Antoni Mylka
    Antoni Mylka

    • milestone: 533940 -->
  • Antoni Mylka
    Antoni Mylka

    Implemented a workaround in r1456. the ignoreUidValidity switch in the imap data source lets the crawler ignore the UID validity, which allows it to work correctly on mh-backed IMAP stores, under the condition that no mails have been deleted between crawls.

    Please confirm.

  • This workaround seems to be the best we can do quickly in this situation. Maybe sometimes we will find a solution, possibly not depending on the ID but on the date/subject, etc, to support also deletions.

    The lack of this workaround is clear: To offer a robust application, you must choose between:

    - ignoreUidValidity and don't use the FileAccessData - where we are in the same situation as before, NO incremental indexing (maybe with some possible side effects,
    I don't know how Aperture can choose which entry he should update/overwrite, or 'insert as new'. Looks like the same problem)
    - ignoreUidValidity and use FileAccessData - with incremental indexing, but wrong emails will be deleted - which is a common usecase
    - don't use ignoreUidValidity - no incremental indexing, sounds like the first situation

    ...and all of these are not robust

    For my impression, this must be a temporary workaround. Just my 3cts

    • status: open --> open-remind
  • Antoni, today I remembered the mh-situation, and I have a question:
    Is there a special need to use the server-uid, which is in some cases not robust, instead of the EMail-Message-ID (from the email-header), which is unique in any case?
    Maybe you choose this situation because of performance reasons...but couldn't we fall-back to the mail-message-ID in the case of UidValidy is false?

  • Antoni Mylka
    Antoni Mylka

    That might be a good idea..

    The use of UIDs is mandated by the IMAP URI scheme defined (in the newest version) in http://www.rfc-editor.org/rfc/rfc5092.txt. Without UIDVALIDITY the UIDs are useless though, so falling back to message id's might be a good idea. You'd have to accept though that the IMAP accessor in will have to perform a linear search in this case.

    I'll forward this question to aperture-dev and see what others think.

    BTW, I assigned this issue to myself, because I haven't received any notification of your question. These trackers are lousy...

  • Antoni Mylka
    Antoni Mylka

    • assigned_to: nobody --> mylka
  • Antoni Mylka
    Antoni Mylka

    I've tested the crawler on the account you made available to me and I can't reproduce it anymore. It seems that the server has been reconfigured. Please see if it's possible to get another account backed by an 'mh' message store.

  • Antoni Mylka
    Antoni Mylka

    I've fixed it. Here's what I did.

    I've added a user account on my ubuntu laptop. Then I've set up an .mh_profile file with two mh folders, using exmh tool. This mh profile had a Path: MhMail entry.

    Then I downloaded the uw imap server from their original source. I tried for about 8 hours to do something with the ubuntu packaged version of uw-imapd but have been unable to do anything. I donwloaded the source, read the help file about configuring the deamon, set it up to have the mail in a Mail subfolder of the home folder, manually added appropriate paths in the inetd.conf, without installing the imapd file anywhere outside the home folder. It took a while to get all required libraries but eventually it did work. So I had an imap server with two folders backed by MH storage.

    It turned out that the only way to reliably test for the UID stickyness is by getting to the internal sun javamail provider class called IMAPProtocol, it allowed me to issue the EXAMINE <foldername> command and parse the responses by myself to detect the UIDNOTSTICKY flag.

    Now the uid stickyness is tested every time. If the uids aren't sticky the imap crawler generates uris based on the Message-ID header and the hash of the message, (I extracted the URI-generation code from MboxCrawler to a common method in the MailUtil class, now it's used by both crawlers).

    I also rewrote the DataAccessor implementation to work correctly with non-sticky uid folders. This implementation is crappy from the performance point of view but it works. It may be optimized later if need be.

    All kinds of comments welcome.

    Christian, please confirm.

  • Antoni Mylka
    Antoni Mylka

    After more than a year of waiting, I consider this issue fixed in rev 1991, please reopen if it's not.

  • Antoni Mylka
    Antoni Mylka

    • milestone: --> 1.3.0 - bugs
    • status: open-remind --> closed-fixed