From: Sunil S. <sun...@ro...> - 2011-04-30 18:15:20
|
--- On Fri, 29/4/11, Andrea Righi <rig...@gm...> wrote: > > > > 1. Parse the response to the SELECT command: ... > A0001 SELECT "INBOX" > * FLAGS (\Answered \Flagged \Draft \Deleted \Seen) > * OK [PERMANENTFLAGS (\Answered \Flagged \Draft \Deleted > \Seen \*)] > * OK [UIDVALIDITY 2] > * 518299 EXISTS > * 0 RECENT > * OK [UIDNEXT 652168] > A0001 OK [READ-WRITE] INBOX selected. (Success) Parsing will not give the unseen count information for you. Also, I realised later that this method will not work when 'idle' is enabled. > > > > 2. Send a STATUS command This may not work correctly on open folders. > > > > 3. Send an extended SEARCH command This is not supported by your server. So, at the IMAP protocol level, we have run out of options. I will check if the fetchmail parsing can be improved directly in fetchmail itself. The following is a recap of the problem: ========================================================================== fetchmail gets the unseen message list using the SEARCH command. There is no standard mechanism to get the count and range of unseen messages. fetchmail currently reads the response to the SEARCH command in a buffer of size MSGBUFSIZE=8k. In a normal case when there are only a few new mails, the response fits nicely in this buffer: IMAP> A0010 SEARCH UNSEEN IMAP< * SEARCH 499998 499999 500000 IMAP< A0010 OK SEARCH completed However, when there are too many unseen mails, the buffer response gets truncated (possibly in the middle of a number). IMAP> A0011 SEARCH UNSEEN IMAP< * SEARCH 490001 490002 490003 ... 499998 499999 500000 IMAP< A0011 OK SEARCH completed IMAP_SEARCH_MAX=1k (which in reality should be a function of MSGBUFSIZE and count) tries to avoid the truncation by searching for unseen mail in batches. IMAP> A0012 SEARCH 1:1000 UNSEEN IMAP> A0013 SEARCH 1001:2000 UNSEEN IMAP> A0014 SEARCH 2001:3000 UNSEEN The problems with this approach are: - In the most common case, there are very few unseen mails. This method of searching is inefficient in those cases. - If the mailbox size is huge (say having 500k mails), fetchmail gets caught in an unproductive cycle. ========================================================================== -- Sunil Shetye. |
From: Sunil S. <sun...@ro...> - 2011-05-03 06:05:27
|
Hi, ----- Original Message ----- > So, at the IMAP protocol level, we have run out of options. > > I will check if the fetchmail parsing can be improved directly in > fetchmail itself. Attached are the patches which should solve this issue. The first patch has nothing to do with this issue. It only removes excess calls to strlen(). The second patch works by sending a "SEARCH UNSEEN" command and reading the response in batches (when required). Andrea, please test the patches for your mailbox. Matthias, please evaluate the patches for incorporation in fetchmail. -- Sunil Shetye. |
From: Matthias A. <mat...@gm...> - 2011-05-03 15:12:23
|
Am 03.05.2011 06:05, schrieb Sunil Shetye: > Hi, > > > ----- Original Message ----- >> So, at the IMAP protocol level, we have run out of options. >> >> I will check if the fetchmail parsing can be improved directly in >> fetchmail itself. > > Attached are the patches which should solve this issue. > > The first patch has nothing to do with this issue. It only removes excess calls to strlen(). > > The second patch works by sending a "SEARCH UNSEEN" command and reading the response in batches (when required). > > Andrea, please test the patches for your mailbox. > > Matthias, please evaluate the patches for incorporation in fetchmail. Hi Sunil, the first patch is easy, accepted, thank you. As so the second (with _split and caching), I wonder if it's the right approach. It looks a bit as though we're curing the symptoms instead of the cause. The cause is a buffer too small to handle the response, and the cure would be to make the buffer dynamic and possibly resize it if we don't get a \r?\n in it. A dynamic buffer could also save us the strcpy(argbuf, buf) copy which, IMO, is quite wasteful. What I propose is to have three branches, details below the list: - one for 6.3.X (new name to be devised), for a final release and possibly later critical/security fixes - one for a new 6.4.X or 7.X.X branch (master) where such fixes as the IMAP search ranges can be made - one for the future 7.X.X or 8.X.X branch (next) for radical changes such as the C++ migration and major structural changes. Details: 1 - create a side branch from master for 6.3.X, revert the SSLv2 removal, fix STARTTLS timeouts (the code is almost completed) and clear up the MD5 mess. Release 6.3.20 from the side branch. 2 - master advances to 7.0.X, sporting feature removals from the "next" branch (7.0.0-alpha2) such as IMAP2, POP2, SSLv2, support for pre-SUSv3 systems and other cleanups, but not the C++ migration. Revamp the socket handling, using dynamic buffers, use Rainer's patch to speed up POP3 UID handling with PATRICIA data structures (this gives us the edge over getmail ;-)) - The SSL handling is not done properly, but needs to be. Look at the _ssl_ctx and all that stuff, we're wasting humongous amounts of static memory (FD_SETSIZE is 1024 on my system, and sizeof(SSL) is 512, so we waste 512 kibiB of RAM just for SSL support) just because the socket code sticks to simple ints, rather than using a proper structure of our own. 3 - next advances to 8.0.X including a C++ migration, with major overhaul: - I have started converting the code (on the "next" branch in Git) to C++ with Standard Template Library and Boost. Boost is, for the biggest part, a headers-only library, so it doesn't create additional run-time dependencies, and it is peer-reviewed by the C++ gurus and often the basis for future standards <http://www.boost.org/>. IMAPCapa* is an example of how compact the code is. C++ features such as the string type, or STL containers such as vector, map, set, list/deque/stack; or Boost functions for intervals, will greatly simplify the code. As will exceptions. - I plan to support UID-based fetches in IMAP too, so we can fetch from READ-ONLY folders and need no longer tamper with \Seen flags. In that case, UIDs are strictly monotonically increasing (unless UIDVALIDITY has changed), so we know for certain where we need to start the SEARCH or FETCH, too, so we avoid empty fetches. (Note Boost's Interval type!) - I plan to support pipelined operation for common protocols to avoid the roundtrip delays incurred through lock-step behaviour. It's not possible everywhere (for instance, when doing a binary search for POP3 UIDLs), but should help maxing out the wires or WiFi even in a single thread once we get to downloading. - I want to clean up the code to an extent where we can ultimately fetch from several accounts in parallel. Opinions, comments, remarks, concerns, ideas are solicited. Thanks. Best regards, Matthias |
From: Sunil S. <sun...@ro...> - 2011-05-03 20:25:19
|
Hi Matthias, ----- Original Message ----- > As so the second (with _split and caching), I wonder if it's the right > approach. It looks a bit as though we're curing the symptoms instead of > the cause. The cause is a buffer too small to handle the response, and > the cure would be to make the buffer dynamic and possibly resize it if > we don't get a \r?\n in it. > > > A dynamic buffer could also save us the strcpy(argbuf, buf) copy which, > IMO, is quite wasteful. I had given a thought for implementing dynamic buffers. However, some issues came to my mind: - What if the buffer required is very long? For example, an extreme response like: IMAP< * SEARCH 1 2 3 4 5 6 ... 999995 999996 999997 99998 99999 100000 will require a dynamic buffer of size 575k. Is it okay to allocate such a huge buffer? - Who will free the buffer? Obviously, it will have to be statically allocated in order to free it later. - How long will it remain? Is it safe to free it immediately on the next call? In short, without a proper memory manager, it may not be safe to just allocate a dynamic buffer of such a huge size. Note that the strcpy(argbuf, buf) can still be avoided by directly passing argbuf (if available) to gen_recv(). Also, note that that patch is a bug fix. The current method range search for UNSEEN mails is too slow for huge mailboxes. So, a fix for this issue is required for 6.3.X. > What I propose is to have three branches, details below the list: > > - one for 6.3.X (new name to be devised), for a final release and > possibly later critical/security fixes > > - one for a new 6.4.X or 7.X.X branch (master) where such fixes as the > IMAP search ranges can be made I think, it is necessary to fix the range search issue in 6.3.X itself as fetchmail is now unusable with IMAP when the mailbox is huge and keep is on. -- Sunil Shetye. |
From: Matthias A. <mat...@gm...> - 2011-05-04 01:14:29
|
Am 03.05.2011 20:25, schrieb Sunil Shetye: > Hi Matthias, > > > > ----- Original Message ----- >> As so the second (with _split and caching), I wonder if it's the right >> approach. It looks a bit as though we're curing the symptoms instead of >> the cause. The cause is a buffer too small to handle the response, and >> the cure would be to make the buffer dynamic and possibly resize it if >> we don't get a \r?\n in it. >> >> >> A dynamic buffer could also save us the strcpy(argbuf, buf) copy which, >> IMO, is quite wasteful. > > I had given a thought for implementing dynamic buffers. However, some issues came to my mind: > > - What if the buffer required is very long? For example, an extreme response like: > > IMAP< * SEARCH 1 2 3 4 5 6 ... 999995 999996 999997 99998 99999 100000 > > will require a dynamic buffer of size 575k. Is it okay to allocate such a huge buffer? Possibly yes, although some limit would be in order. Do we really need to optimize 6.3.X with obtrusive changes? This will make merging between master and next harder. > - Who will free the buffer? Obviously, it will have to be statically allocated in order to free it later. > > - How long will it remain? Is it safe to free it immediately on the next call? > In short, without a proper memory manager, it may not be safe to just allocate a dynamic buffer of such a huge size. I'll sidestep these for now. > Note that the strcpy(argbuf, buf) can still be avoided by directly passing argbuf (if available) to gen_recv(). OK. It's a separate microoptimization anyways, not worth discussing. Someone will do it ;-) > Also, note that that patch is a bug fix. The current method range search for UNSEEN mails is too slow for huge mailboxes. So, a fix for this issue is required for 6.3.X. It may be painful, but if it works, it's not a bug unless the documentation promises otherwise (which it does not). For a 25,000 message mailbox with 1% unseen, the amount of SEARCH transactions seems to be reasonably low. I think POP3 + uidl + keep in that situation would burn something between 3 and 30 seconds CPU time on a Phenom II 2.5 GHz CPU w/ GCC 4.5 compiled code while chewing on UIDLs. The fix was contributed by Rainer Weikusat and is in the next branch and in 7.0.0-alpha2. >> What I propose is to have three branches, details below the list: >> >> - one for 6.3.X (new name to be devised), for a final release and >> possibly later critical/security fixes >> >> - one for a new 6.4.X or 7.X.X branch (master) where such fixes as the >> IMAP search ranges can be made > > I think, it is necessary to fix the range search issue in 6.3.X itself as fetchmail is now unusable with IMAP when the mailbox is huge and keep is on. Try POP3 UIDL on a mailbox with the same message (86000) count and let me know if you still think IMAP is "unusable". Hint: check what Rainer and I have been benchmarking when he contributed the PATRICIA implementation. I've avoided it in 6.3.X because I have portability concerns. I will take your patch in spite of my concerns, and test it. I suggest that you check the manual pages of strlcpy and strlcat, these are alternatives to snprintf(buf, siz, "%s%s", s1, s2). glibc is one of the major distros lacking it, but we ship replacements. Also note that <strings.h> has strcasecmp and, more usefully, strncasecmp, and we have our replacements for OSes lacking them, too. |
From: Sunil S. <sun...@ro...> - 2011-05-04 08:46:36
|
----- Original Message ----- > Observations: > > - the patch makes an undocumented assumption that the buffer for a > second call to gen_recv_split is always large enough to copy the > remainder. If it isn't, you lose data. Doesn't happen in fetchmail now, > but is a maintenance concern, so I'll add an assert() for now. > > - you don't guard the prefix length - I've added that. > > - you don't check if the data-to-be-cached fits into the rs structure - > I've added that. Good. All required. > - /** foo */ isn't a typo, but a marker for Doxygen so it actually looks > at the comment for documentation, and extracts it. Also note the > ordering of these comments versus the comma; alternatively you can write > /**< foo */ to document the PREVIOUS argument in a prototype. Okay. Note that the prototype was just copied from gen_recv() and a parameter was added later. > Fixup: match prefix caseblind, add some guards, streamline phase > handling. Required. > Add a few asserts to catch abuse, and use strlcpy/strlcat rather than > snprintf because some implementations of the latter are unsuitable for > detecting buffer exhaustion. Required. > Fixup: remove unused variables. How did I miss that? Thanks for the review. -- Sunil Shetye. |
From: Matthias A. <mat...@gm...> - 2011-05-04 12:24:39
|
Am 04.05.2011 08:46, schrieb Sunil Shetye: > Okay. Note that the prototype was just copied from gen_recv() and a parameter was added later. Oops :) I've fixed the transact.c documentation for Doxygen, and fixed some unsafe macro expansions (missing parentheses) along the way. |
From: Sunil S. <sun...@ro...> - 2011-05-04 12:50:23
Attachments:
0001-correct-call-to-gen_send.patch
|
----- Original Message ----- > And it went unnoticed possibly because you use different compiler flags > (I routinely compile with -pedantic -Wall and often also with -Wextra or > -W, and I routinely also try Intel C++ and Clang, and I take care to > have at least -O1 or usually -O2 set -- my current computer is fast > enough and compiles fetchmail (after configure) in < 2.5 s real time.) Is there a way of enabling those CFLAGS flags as well as the configure flags by default? Every time I switch machines and checkout fetchmail again (which is rare), I seem to miss out setting the same set of flags again. I have enabled the warnings now. I have also added -Wformat=2 and got some warnings. Attached is a minor patch to fix those warnings. -- Sunil Shetye. |
From: Matthias A. <mat...@gm...> - 2011-05-04 15:53:55
|
Am 04.05.2011 12:50, schrieb Sunil Shetye: > ----- Original Message ----- > >> And it went unnoticed possibly because you use different compiler flags >> (I routinely compile with -pedantic -Wall and often also with -Wextra or >> -W, and I routinely also try Intel C++ and Clang, and I take care to >> have at least -O1 or usually -O2 set -- my current computer is fast >> enough and compiles fetchmail (after configure) in < 2.5 s real time.) > > Is there a way of enabling those CFLAGS flags as well as the configure flags by default? Every time I switch machines and checkout fetchmail again (which is rare), I seem to miss out setting the same set of flags again. Hm. These are compiler-specific, so we'd have to test explicitly. > I have enabled the warnings now. I have also added -Wformat=2 and got some warnings. > > Attached is a minor patch to fix those warnings. Good catch, although I have the nasty feeling I've fixed similar bugs in other .c files before. I wonder how these slipped past my attention... |
From: Matthias A. <mat...@gm...> - 2011-05-04 02:08:45
|
Am 04.05.2011 01:14, schrieb Matthias Andree: > I will take your patch in spite of my concerns, and test it. Observations: - the patch makes an undocumented assumption that the buffer for a second call to gen_recv_split is always large enough to copy the remainder. If it isn't, you lose data. Doesn't happen in fetchmail now, but is a maintenance concern, so I'll add an assert() for now. - you don't guard the prefix length - I've added that. - you don't check if the data-to-be-cached fits into the rs structure - I've added that. - /** foo */ isn't a typo, but a marker for Doxygen so it actually looks at the comment for documentation, and extracts it. Also note the ordering of these comments versus the comma; alternatively you can write /**< foo */ to document the PREVIOUS argument in a prototype. My changes have been pushed out to the Git master branch, please check my two Fixup commits: commit 646f36e1c1369fd65c0d641cae48fa4425613462 Author: Matthias Andree <mat...@gm...> Date: Wed May 4 02:02:30 2011 +0200 Fixup: match prefix caseblind, add some guards, streamline phase handling. Add a few asserts to catch abuse, and use strlcpy/strlcat rather than snprintf because some implementations of the latter are unsuitable for detecting buffer exhaustion. M transact.c commit d7d43f53e1da9d5c57961ea26fefa609de1e30e7 Author: Matthias Andree <mat...@gm...> Date: Wed May 4 01:58:46 2011 +0200 Fixup: remove unused variables. M imap.c |
From: Matthias A. <mat...@gm...> - 2011-05-04 11:07:42
|
On 04.05.2011 08:46, Sunil Shetye replied: >> Fixup: remove unused variables. > > How did I miss that? Hi Sunil, As fetchmail 6.3.X is C89-style, variable declarations are far away from their use, so missing "first/last" was easy. Add to that the humongous text length of some code parts... And it went unnoticed possibly because you use different compiler flags (I routinely compile with -pedantic -Wall and often also with -Wextra or -W, and I routinely also try Intel C++ and Clang, and I take care to have at least -O1 or usually -O2 set -- my current computer is fast enough and compiles fetchmail (after configure) in < 2.5 s real time.) > Thanks for the review. Thanks for the counter-review. Best regards Matthias |