Re: [Aperture-devel] Incremental crawling of mh-backed folders

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

and just for the sake of fun:

IMAP sugks :-)

if you don't know why, read this seminal source code comment which ended
up on the best source comments ever:

http://stackoverflow.com/questions/184618/what-is-the-best-comment-in-source-code-you-have-ever-encountered/185998#185998

I can't copy it to this mailinglist, because of profanity.
(do I need to say more...)

best
Leo

It was Leo Sauermann who said at the right time 16.11.2009 18:31 the
following words:
> It was Antoni Mylka who said at the right time 15.11.2009 19:12 the
> following words:
>> Aperturians
>>
>> Update, on my test folder the new 'useHeadersHash' feature skipped
>> exactly 8 emails, who had near-duplicates. All this after fixing the
>> hash conflicts issue 2897819. No original content was lost.
>>
>> I crawled a test folder on an Imap server I set up on my own laptop,
>> backed by the mh storage. I copied 2638 emails from my own 'people'
>> folder which contained my entire non-work correspondence from the last
>> three years, so it's realistic. These are the results of crawling that
>> folder with the userHeadersHash flag turned on and off.
>>
>>                  on              off
>> time [s]         400             4494
>> objects found    3355            3369 (was 3361 before the 2897819 fix)
>>
>> As I said, no original content was lost and the speedup is considerable.
>> What should be the default?
>>   
>
> speed should be default!
>
> both of you argument that the loss is ok, and looking at the flames we
> get in nepomuk-kde for being slow, speed is essential.
>
> best
> Leo
>> Antoni Mylka
>> ant...@gm...
>>
>> Antoni Mylka pisze:
>>   
>>> Aperturians
>>>
>>> I've been made aware of a problem with the ImapCrawler. After fixing the
>>> age-old issue 1989505, (http://bit.ly/4ruQDC), incremental crawling of
>>> mh-backed imap folders started working correctly but is VERY slow.
>>>
>>> Today I applied a fix. It introduces a flag in ImapDataSource named
>>> useHeadersHash. My question to you is: "What should be its default value".
>>>
>>> When it is set to false - nothing changes. On mh-backed folders the UIDs
>>> are unreliable and can't be used to generate message uris. Therefore I
>>> use the form <folderPath>/<Message-ID>. Unfortunately there are many
>>> cases when the same Message-ID appears multiple times in a folder, so I
>>> extended it to <folderPath>/<Message-ID>-<messageHash>. The hash is a
>>> rolling Adler32 hash of the entire content of the message (just as in
>>> mbox).
>>>
>>> When the flag is set to true, the hash is computed as an SHA hash, of
>>> the concatenation of values of selected headers (Message-ID, Date and
>>> Received for the time being). One important advantage is that this makes
>>> it possible to prefetch those values with a single call and computing a
>>> URI doesn't require us to download the entire message. This speeds up
>>> the crawling by a factor of 30 or more (both initial and incremental, in
>>> one of my tests it was 90 seconds vs an hour, with exactly the same
>>> output).
>>>
>>> Sometimes the same message, byte-by-byte identical appears twice in the
>>> same folder. There were on the order of 10-20 such cases in my own
>>> thunderbird profile. If this happens - the crawler (both imap and mbox)
>>> simply DISREGARDS the second occurence. Therefore it IS POSSIBLE to have
>>> more emails shown in thunderbird than in crawl results. This happens
>>> regardless of the flag setting.
>>>
>>> Now comes the problem. It concerns the class of messages with the same
>>> ID, but with different content. There are many cases where this might
>>> happen (someone answering to a list with a cc to the sender, people
>>> sending stuff with a cc to themselves, which later lands in the same
>>> folder etc.). In most cases such email pairs differ by Date or the
>>> server trail in Received headers, this produces a different hash, which
>>> yields a different URI and we have two DISTINCT data objects. The new
>>> scheme works correctly in most cases. I tested it with a folder
>>> containing all ~2600 non-work-related mails I got in the last three
>>> years and noticed no errors. It is theoretically possible though that
>>> the crawler would skip near-duplicates.
>>>
>>> Therefore the question to you: what is more important - having it faster
>>> by an order of magnitude, or preserving the tiny fraction of
>>> near-duplicate emails. On one hand we have the general public, which
>>> probably doesn't care because
>>>  - omitted emails are so few
>>>  - they would rarely contain any useful original content.
>>>    my folder had three such pairs.
>>>  - In case when one near-duplicate has attachments and the other one
>>>    doesn't - the attachments will never be lost regardless of which mail
>>>    comes first, though if one has one attachment and the other one has
>>>    a different attachment - the second attachment will be lost. None
>>>    of the three omitted emails I found yielded any data loss
>>> On the other hand we have the forensics use case where every byte
>>> counts, however small the potential for data loss might be.
>>>
>>> For the time being the default is false, if you want speedup, you need
>>> to ask for it. In my opinion though - true would be a more common-sense
>>> default good enough for 99% of users.
>>>
>>> All kinds of comments welcome
>>>
>>> Antoni Mylka
>>> ant...@gm...
>>>
>>>     
>>
>>
>> ------------------------------------------------------------------------------
>> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
>> trial. Simplify your report design, integration and deployment - and focus on 
>> what you do best, core application coding. Discover what's new with
>> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
>> _______________________________________________
>> Aperture-devel mailing list
>> Ape...@li...
>> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>>   
>
>
> -- 
> _____________________________________________________
> Dr. Leo Sauermann       http://www.dfki.de/~sauermann 
>
> Deutsches Forschungszentrum fuer 
> Kuenstliche Intelligenz DFKI GmbH
> Trippstadter Strasse 122
> P.O. Box 2080           Fon:   +43 6991 gnowsis
> D-67663 Kaiserslautern  Fax:   +49 631 20575-102
> Germany                 Mail:  leo...@df...
>
> Geschaeftsfuehrung:
> Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats:
> Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
> _____________________________________________________
>   
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
> trial. Simplify your report design, integration and deployment - and focus on 
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july
> ------------------------------------------------------------------------
>
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>   

-- 
_____________________________________________________
Dr. Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +43 6991 gnowsis
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo...@df...

Geschaeftsfuehrung:
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313
_____________________________________________________