From: Matthias A. <mat...@gm...> - 2006-05-18 17:59:14
|
(collecting replies to two postings) Sunil Shetye <sh...@bo...> writes in response to Frederic Marchal's text (not cited): [Frederic suggested to save the (UID,size) pair to detect message recycling.] > There are a few objections to this: > > - fetchmail currently does not store the size in the fetchids file, so > doing this will require a modification of the fetchids file format. fetchmail's uid.c code (just revised a bit so as to be more readable) does the equivalent of (Perl RE syntax): __user__|domain_ ___id____ [ \t]*([^@]\*)@([^ ]+)[ \t]* +([^ \t]*).* A typical line that can be parsed looks like this: some user@re...@po... d3b07384.d113edec And the parser since 6.3.0 is robust, so that we can add a blank or TAB and anything after a blank after the ID: some user@re...@po... d3b07384.d113edec foo=bar baz=23|joe=mary Saving any information such as deleted flags, size, retrieval date is thus backwards compatible in the sense that older fetchmail versions would just ignore the extra information and strip it when rewriting the ID. Exception: scratch lists are saved verbatim, i. e. for unqueried hosts, the extra information would be kept. > - fetchmail does not get the size of all mails right now (in the > default setup), only of the mails it is going to download. Your > suggestion would imply that fetchmail will have to get UIDs as well > as sizes of all mails for comparison. This would imply a lot of > delay in getting mails, especially when 'keep' is on. 1. POP3 "LIST" is fast, has affordable traffic requirements and part of regular fetchmail operation. 2. The size is, as suggested, only needed for messages that are supposed to be deleted after retrieval. Discounting --flush, fetchmail marks the message for deletion after retrieving it, and at that time, the POP3 server's notion of the message size is known. If the size is cached in the .fetchids file, that part is covered. For new messages, the size will have to be retrieved anyways sooner or later to support ESMTP SIZE. > - Some mailservers keep the flags of a mail in the mail itself by > adding a header like Status:. So, the size of a mail may actually > change when it turns from 'new' to 'old'. Due to size mismatch, > mails from such mailservers will get downloaded again. That is a rather important concern. The refinement of the suggestion would be to hash all but a few non-constant headers with some decent hash function. MD5 would be simple, but we shouldn't hardcode anything here. > - Some mailservers misreport sizes! I remember one such server being > reported: > > <http://lists.ccil.org/pipermail/fetchmail-friends/2002-January/005572.html> Can we assume that the size is stable even if inaccurate? Frederic Marchal <fre...@wo...> replied: > For which a change was proposed if I remember correctly... It was even > suggested to use a database such as sqlite. I admit I haven't followed > the status of that proposition. Beside, there is no hurry. It doesn't > have to be a patch within two days. It may be part of a major version > change just to make fetchmail even better :-) sqlite won't happen for fetchmail 6.3.X. Extending .fetchids as suggested above might. > processing would remain the same. The fetchids file may even be purged > of the old UID when the next poll confirms the mails are really gone. That happens automatically the next time that POP3 chat with the server completes successfully. >> - Some mailservers keep the flags of a mail in the mail itself by >> adding a header like Status:. So, the size of a mail may actually >> change when it turns from 'new' to 'old'. Due to size mismatch, >> mails from such mailservers will get downloaded again. > You got me there :-) In that case, the mail would be downloaded a second > time and deleted. The "message of same size" (say, automated messages in a fixed format, stock reports, particular logs) problem isn't solved though - and they might even get the same UID if the server used just the MD5 or a file name with a recycled inode number... The question is (1) if it's worth the added effort to track sizes, or (2) if we should rather go Sunil's simple "play it safe and redownload if in doubt" route; or (3) refine my patch to assume QUIT succeeded the moment it is handed off to the write() call. That also has the effect of redownloading messages if the QUIT fails, but will retry the DELEte if fetchmail doesn't reach the point where it would send QUIT. > If the deletion fails again, the mail may be downloaded once more if > its size keeps changing or it may eventually be deleted without any > further download if the size is not changed any more. If the download > is the action that causes the connection drop (such as a proxy with a > virus scanner taking too much time and exceeding the timeout of the > server), it will be downloaded twice and deleted the third time. A fetchmail-internal queue might fix this much later on if someone could document the need for such. > And this, only if the connection is dropped after the delete command > is sent AND the size of the mail is changed by the server on the next > poll. > > The worst case occurs if the size of the mail keeps changing at every > poll. I have no solution here. It's not funny at all to have to deal > with broken servers :-( I think I might just charge users for adding more workarounds. Then they have the cheap route of complaining to the server's operator, or the expensive one of having the fix. :-P I think size isn't necessarily the best complement here to detect changes. I like the hash approach better, but this seems quite intrusive as well. A hash that precludes some dynamic headers such as [X-]Status and similar may be needed anyways to emulate UIDL for --keep setups on servers that don't support UIDL -- but if they lack all UIDL and TOP, the network impact of --keep will be rather painful. (Perhaps fetchmail should just refuse --keep on such servers.) > Now, you may argue that it is a lot of programming for the sole benefit > of not deleting unread mails when the connection drop... I have nothing > to say against that. I'm not likely to be the one to program it although > I really wish I could. > > Note that this is only a solution I propose to solve the problem of the > mails being either duplicated, deleted without delivery or left forever > on the server when a connection drop. I'm happy with my current > configuration which work fine with pop3 + uidl and a server that doesn't > recycle the UID too quickly and a connection quite reliable enough. If your server uses a timestamp into the UID and has a strictly monotonically increasing system lock, the risk is pretty low. -- Matthias Andree |