From: Leo S. <leo...@df...> - 2009-11-16 22:31:55
|
and just for the sake of fun: IMAP sugks :-) if you don't know why, read this seminal source code comment which ended up on the best source comments ever: http://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered/185998#185998 I can't copy it to this mailinglist, because of profanity. (do I need to say more...) best Leo It was Leo Sauermann who said at the right time 16.11.2009 18:31 the following words: > It was Antoni Mylka who said at the right time 15.11.2009 19:12 the > following words: >> Aperturians >> >> Update, on my test folder the new 'useHeadersHash' feature skipped >> exactly 8 emails, who had near-duplicates. All this after fixing the >> hash conflicts issue 2897819. No original content was lost. >> >> I crawled a test folder on an Imap server I set up on my own laptop, >> backed by the mh storage. I copied 2638 emails from my own 'people' >> folder which contained my entire non-work correspondence from the last >> three years, so it's realistic. These are the results of crawling that >> folder with the userHeadersHash flag turned on and off. >> >> on off >> time [s] 400 4494 >> objects found 3355 3369 (was 3361 before the 2897819 fix) >> >> As I said, no original content was lost and the speedup is considerable. >> What should be the default? >> > > speed should be default! > > both of you argument that the loss is ok, and looking at the flames we > get in nepomuk-kde for being slow, speed is essential. > > best > Leo >> Antoni Mylka >> ant...@gm... >> >> Antoni Mylka pisze: >> >>> Aperturians >>> >>> I've been made aware of a problem with the ImapCrawler. After fixing the >>> age-old issue 1989505, (http://bit.ly/4ruQDC), incremental crawling of >>> mh-backed imap folders started working correctly but is VERY slow. >>> >>> Today I applied a fix. It introduces a flag in ImapDataSource named >>> useHeadersHash. My question to you is: "What should be its default value". >>> >>> When it is set to false - nothing changes. On mh-backed folders the UIDs >>> are unreliable and can't be used to generate message uris. Therefore I >>> use the form <folderPath>/<Message-ID>. Unfortunately there are many >>> cases when the same Message-ID appears multiple times in a folder, so I >>> extended it to <folderPath>/<Message-ID>-<messageHash>. The hash is a >>> rolling Adler32 hash of the entire content of the message (just as in >>> mbox). >>> >>> When the flag is set to true, the hash is computed as an SHA hash, of >>> the concatenation of values of selected headers (Message-ID, Date and >>> Received for the time being). One important advantage is that this makes >>> it possible to prefetch those values with a single call and computing a >>> URI doesn't require us to download the entire message. This speeds up >>> the crawling by a factor of 30 or more (both initial and incremental, in >>> one of my tests it was 90 seconds vs an hour, with exactly the same >>> output). >>> >>> Sometimes the same message, byte-by-byte identical appears twice in the >>> same folder. There were on the order of 10-20 such cases in my own >>> thunderbird profile. If this happens - the crawler (both imap and mbox) >>> simply DISREGARDS the second occurence. Therefore it IS POSSIBLE to have >>> more emails shown in thunderbird than in crawl results. This happens >>> regardless of the flag setting. >>> >>> Now comes the problem. It concerns the class of messages with the same >>> ID, but with different content. There are many cases where this might >>> happen (someone answering to a list with a cc to the sender, people >>> sending stuff with a cc to themselves, which later lands in the same >>> folder etc.). In most cases such email pairs differ by Date or the >>> server trail in Received headers, this produces a different hash, which >>> yields a different URI and we have two DISTINCT data objects. The new >>> scheme works correctly in most cases. I tested it with a folder >>> containing all ~2600 non-work-related mails I got in the last three >>> years and noticed no errors. It is theoretically possible though that >>> the crawler would skip near-duplicates. >>> >>> Therefore the question to you: what is more important - having it faster >>> by an order of magnitude, or preserving the tiny fraction of >>> near-duplicate emails. On one hand we have the general public, which >>> probably doesn't care because >>> - omitted emails are so few >>> - they would rarely contain any useful original content. >>> my folder had three such pairs. >>> - In case when one near-duplicate has attachments and the other one >>> doesn't - the attachments will never be lost regardless of which mail >>> comes first, though if one has one attachment and the other one has >>> a different attachment - the second attachment will be lost. None >>> of the three omitted emails I found yielded any data loss >>> On the other hand we have the forensics use case where every byte >>> counts, however small the potential for data loss might be. >>> >>> For the time being the default is false, if you want speedup, you need >>> to ask for it. In my opinion though - true would be a more common-sense >>> default good enough for 99% of users. >>> >>> All kinds of comments welcome >>> >>> Antoni Mylka >>> ant...@gm... >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day >> trial. Simplify your report design, integration and deployment - and focus on >> what you do best, core application coding. Discover what's new with >> Crystal Reports now. http://p.sf.net/sfu/bobj-july >> _______________________________________________ >> Aperture-devel mailing list >> Ape...@li... >> https://lists.sourceforge.net/lists/listinfo/aperture-devel >> > > > -- > _____________________________________________________ > Dr. Leo Sauermann http://www.dfki.de/~sauermann > > Deutsches Forschungszentrum fuer > Kuenstliche Intelligenz DFKI GmbH > Trippstadter Strasse 122 > P.O. Box 2080 Fon: +43 6991 gnowsis > D-67663 Kaiserslautern Fax: +49 631 20575-102 > Germany Mail: leo...@df... > > Geschaeftsfuehrung: > Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) > Dr. Walter Olthoff > Vorsitzender des Aufsichtsrats: > Prof. Dr. h.c. Hans A. Aukes > Amtsgericht Kaiserslautern, HRB 2313 > _____________________________________________________ > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------------ > Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day > trial. Simplify your report design, integration and deployment - and focus on > what you do best, core application coding. Discover what's new with > Crystal Reports now. http://p.sf.net/sfu/bobj-july > ------------------------------------------------------------------------ > > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > -- _____________________________________________________ Dr. Leo Sauermann http://www.dfki.de/~sauermann Deutsches Forschungszentrum fuer Kuenstliche Intelligenz DFKI GmbH Trippstadter Strasse 122 P.O. Box 2080 Fon: +43 6991 gnowsis D-67663 Kaiserslautern Fax: +49 631 20575-102 Germany Mail: leo...@df... Geschaeftsfuehrung: Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender) Dr. Walter Olthoff Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes Amtsgericht Kaiserslautern, HRB 2313 _____________________________________________________ |