[fetchmail-devel] [Bug #18853] IMAP IDLE + SSL: Error: re-poll failed

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Bug #18853, was updated on 2013-Jan-03 22:49
Here is a current snapshot of the bug.

Project: fetchmail
Category: None
Status: Open
Resolution: None
Bug Group: None
Priority: 4
Submitted by: w-f
Assigned to : m-a
Summary: IMAP IDLE + SSL: Error: re-poll failed

Details: I use fetchmail with SSL and IMAP IDLE. The IMAP connection freezes after some time and the re-poll failed after 28 min. It looks like there is a ssl socket timeout before a IMAP IDLE re-poll is issued.

-----
Jan  3 20:31:35 mini2 fetchmail[14842]: IMAP> A0005 IDLE
Jan  3 20:31:35 mini2 fetchmail[14842]: IMAP< + OK
Jan  3 20:59:35 mini2 fetchmail[14842]: IMAP> DONE
Jan  3 20:59:35 mini2 fetchmail[14842]: re-poll failed
Jan  3 20:59:35 mini2 fetchmail[14842]: socket error while fetching from web...@fi...@imap.strato.de
Jan  3 20:59:35 mini2 fetchmail[14842]: 6.3.24 querying imap.strato.de (protocol IMAP) at Thu, 03 Jan 2013 20:59:35 +0100 (CET): poll completed
Jan  3 20:59:35 mini2 fetchmail[14842]: Query status=2 (SOCKET)
-----

The re-poll does not fail, when the idle timeout is reduced from 28 min to 10 minutes. This should also fulfil the RFC 2177, because the RFC suggests to re-issue a IDLE command at LEAST every 29 minutes.

RFC 2177 

   "The server MAY consider a client inactive if it has an IDLE command
   running, and if such a server has an inactivity timeout it MAY log
   the client off implicitly at the end of its timeout period.  Because
   of that, clients using IDLE are advised to terminate the IDLE and
   re-issue it at least every 29 minutes to avoid being logged off.
   This still allows a client to receive immediate mailbox updates even
   though it need only "poll" at half hour intervals."

Other IMAP Clients uses a shorter IDLE re-poll timer or issues a NOOP Command.
Microsoft Outlook:  10 min
Apple Mail:  IMAP NOOP every 1 minute (RFC 3501: "The NOOP command can also be used to reset any inactivity autologout timer on the server.")

------
Patch (Reduce  idle_timeout  to 600)
diff fetchmail-6.3.24-patched/imap.c fetchmail-6.3.24/imap.c 
718,719c718
< 	/* mytimeout = idle_timeout = 1680; */ /* 28 min */
<         mytimeout = idle_timeout = 600;  /* reduce to 10 min to prevent ssl socket timeout */
---
> 	mytimeout = idle_timeout = 1680; /* 28 min */

Enviroment:
fetchmail 6.3.24
OS:  Mac OS 10.7.5
IMAP Server:  Provider Strato (imap.strato.de)

---
Best regards,
Wolfgang

Follow-Ups:

Date: 2013-Dec-13 00:51
By: m-a

Comment:
It's six unanswered probes, of course.
-------------------------------------------------------

Date: 2013-Dec-13 00:47
By: m-a

Comment:
Well, it's quite clear from the tcpdump logs that the server has silently gone down, i. e. there was not the usual FIN / FIN|ACK handshake to shut down the connection orderly.

When fetchmail tried to re-poll after 28 minutes, it sent out data, and then the server responded with a RST - a sign that it knew of no connection.

I can imagine two ways how this may fail:

1. some firewall (including your router) erases its connection record, and does not let the FIN packages from the server back through. Probably not Linux netfilter, my Ubuntu 3.8 kernel on 12.04LTS has a timeout of 5 days for established TCP connections.

2. some firewall on the server's end erases the connection record, so that fetchmail's "DONE" triggers the connection reset from the remote end

fetchmail has been setting the SO_KEEPALIVE option since release 6.3.20, and OpenSSL does not apparently flip that switch.  

However, the default keepalive timer on Linux and FreeBSD kicks in after two hours - so it's not useful for checking IDLE. 

You'd have to lower that, see if running these commands helps, this set is for Linux:

sysctl net.ipv4.tcp_keepalive_time=300
sysctl net.ipv4.tcp_keepalive_intvl=24
sysctl net.ipv4.tcp_keepalive_probes=6

On FreeBSD and maybe on Darwin/MacOS X, try these (milliseconds here):

sysctl -w net.inet.tcp.keepidle=300000
sysctl -w net.inet.tcp.keepintvl=24000
sysctl -w net.inet.tcp.keepcnt=6

On the next connection, sockets will start probing keepalives after 5, rather than 120, minutes, and resent probes at 24, rather than 75, second intervals, and tolerate five (rather than nine) unanswered probes. Meaning that some seven minutes after a silent disconnect, the kernel would detect that the connection is gone and issue an error code.

It would be interesting to see a tcpdump around that time then, to conclude what the new error mode would then be.

Can both of you rule out that your firewall loses the connection early?   The firewall timeouts for erasing TCP connections without traffic must obviously be 30 minutes or longer.

If so, has either of you ever contacted the server's operator and asked if their IMAP server and firewalls are fit to support idle times of up to 30 minutes?

-------------------------------------------------------

Date: 2013-Dec-12 23:13
By: rainforest1155

Comment:
I'm also using Strato and am seeing exactly the same problem. Every 28 min after the poll started, I'm getting the same error like Wolfgang. For example:
Dec 11 19:11:45 beagle fetchmail[32718]: re-poll failed
Dec 11 19:11:45 beagle fetchmail[32718]: socket error while fetching from my...@my...@imap.strato.de
Dec 11 19:11:45 beagle fetchmail[32718]: Query status=2 (SOCKET)

Any chance this issue could be revisited? 

Let me know if I can provide you with any further details. 
system: Debian 7.2
fetchmail 6.3.21-4 

Just note that I'm still a newbie at this. 

Thanks,
Sebastian
-------------------------------------------------------

Date: 2013-Apr-29 21:37
By: w-f

Comment:
the Link with the complete dtrace files is active again.
https://www.hidrive.strato.com/lnk/RBTrVUhJ

Best regards,
Wolfgang
-------------------------------------------------------

Date: 2013-Apr-26 19:34
By: m-a

Comment:
I was about to revisit your bug report; unfortunately, the log file link has expired. Can you re-upload the logs? Thank you.
-------------------------------------------------------

Date: 2013-Jan-06 22:43
By: w-f

Comment:
Hello Matthias,

I did a dtrace for the SSL case and non-SSL case. In both variants the  TCP KeepAlive is enabled.

199552      20     14 socket(0x2, 0x1, 0x6)		 = 5 0
199559       8      2 setsockopt(0x5, 0xFFFF, 0x8)		 = 0 0
199634   21589     72 connect(0x5, 0x7FAF0AC01320, 0x10)		 = 0 0

But there is an other difference:
The non-ssl socket uses the system call recvfrom to wait for data from the imap server during the idle time. The ssl socket uses the system call "read" to wait for data from the imap server during the idle time. The tcp stack sends no KeepAlive Packets during the system call read.

You can download  the complete dtrace files for both variants from https://www.hidrive.strato.com/lnk/RBTrVUhJ

Best regards,
Wolfgang
-------------------------------------------------------

Date: 2013-Jan-06 13:29
By: m-a

Comment:
Wolfgang,

could you dtrace setsockopt and/or all socket option, for (1) the SSL case, (2) the non-SSL case?  I unconditionally enable SO_KEEPALIVE after the socket has been opened, and never see them disabled - perhaps your OpenSSL library behaves differently.

Best regards,
Matthias
-------------------------------------------------------

Date: 2013-Jan-06 13:20
By: m-a

Comment:
Thanks for the detailed information.

However, this pretty much looks like a networking issue. The server just hung up, but there was no FIN|ACK handshake from closing down the connection, so fetchmail was not informed of the shutdown or close until it tried to send the next packet.

This pretty much looks like a networking issue, broken router or NAT or Masquerading. I'll grant that TCP keepalives might help.

Unfortunately, strato does not appear to support STARTTLS, so we cannot test that.
-------------------------------------------------------

Date: 2013-Jan-05 19:50
By: w-f

Comment:
The issue is not an server imap timeout. There are no problems when using an unencrypted imap session.

The dtruss  shows no activity during the idle time:
22867/0xaf6c4:    629137      33     24 write(0x5, "\027\003\0", 0x25)		 = 37 0
22867/0xaf6c4:    629147       9      3 __sysctl(0x7FFF60CF8FE8, 0x2, 0x7FFF60CF8FFF)		 = 0 0
22867/0xaf6c4:    629150       5      0 getuid(0x0, 0x7FFF60CF8C58, 0x0)		 = 273 0
22867/0xaf6c4:    629152       5      0 getgid(0x0, 0x7FFF60CF8C58, 0x0)		 = 1 0
22867/0xaf6c4:    629184       9      3 setitimer(0x0, 0x7FFF60CF7E00, 0x0)		 = 0 0

dtrace: error on enabled probe ID 1791 (ID 778: syscall::read:return): out of scratch space in action #13 at DIF offset 44
dtrace: error on enabled probe ID 1791 (ID 778: syscall::read:return): out of scratch space in action #13 at DIF offset 44
dtrace: error on enabled probe ID 1789 (ID 780: syscall::write:return): out of scratch space in action #13 at DIF offset 44
22867/0xaf6c4:    629354     343     20 sigreturn(0x7FFF60CF7B50, 0x1E, 0x7FFF60CF7B50)		 = 0 Err#-2
22867/0xaf6c4:    629358       8      0 __pthread_canceled(0x0, 0x7FE06C015000, 0x7FFF60CF7C08)		 = -1 Err#22
22867/0xaf6c4:    629372       8      1 setitimer(0x0, 0x7FFF60CF7E00, 0x0)		 = 0 0
22867/0xaf6c4:    629454      66     58 write(0x5, "\027\003\0", 0x1F)		 = 31 0
22867/0xaf6c4:    629481      15      8 __sysctl(0x7FFF60CFB0F8, 0x2, 0x7FFF60CFB10F)		 = 0 0
22867/0xaf6c4:    629485       6      0 getuid(0x0, 0x7FFF60CFAD68, 0x0)		 = 273 0
22867/0xaf6c4:    629487       6      0 getgid(0x0, 0x7FFF60CFAD68, 0x0)		 = 1 0
22867/0xaf6c4:    629533      10      4 setitimer(0x0, 0x7FFF60CF9F30, 0x0)		 = 0 0
22867/0xaf6c4:    629566      13      1 setitimer(0x0, 0x7FFF60CF9F30, 0x0)		 = 0 0
22867/0xaf6c4:    629569       5      0 setitimer(0x0, 0x7FFF60CFBFD0, 0x0)		 = 0 0
22867/0xaf6c4:    629596      14      7 __sysctl(0x7FFF60CFB118, 0x2, 0x7FFF60CFB12F)		 = 0 0
22867/0xaf6c4:    629600       5      0 getuid(0x0, 0x7FFF60CFAD88, 0x0)		 = 273 0
22867/0xaf6c4:    629602       5      0 getgid(0x0, 0x7FFF60CFAD88, 0x0)		 = 1 0
22867/0xaf6c4:    629648      10      2 sigaction(0xE, 0x7FFF60D00098, 0x7FFF60D000C0)		 = 0 0
22867/0xaf6c4:    629653       8      2 setitimer(0x0, 0x7FFF60D00278, 0x0)		 = 0 0
22867/0xaf6c4:    629849      18      9 close(0x5)		 = 0 0
22867/0xaf6c4:    629852       6      1 setitimer(0x0, 0x7FFF60D00298, 0x0)		 = 0 0
22867/0xaf6c4:    629854       6      0 sigaction(0xE, 0x7FFF60D00098, 0x7FFF60D000C0)		 = 0 0
22867/0xaf6c4:    629867       9      3 __sysctl(0x7FFF60CFF1E8, 0x2, 0x7FFF60CFF1FF)		 = 0 0

Also there is no network traffic during the idle time.  And the response for first packet TCP Push from fetchmail to the imap server is a Reset. 

tcp dump IMAPS Session:
------
18:42:48.547781 IP (tos 0x0, ttl 64, id 25398, offset 0, flags [DF], proto TCP (6), length 77)
    mini2.home.fischer-net.net.60975 > imap.strato.de.imaps: Flags [P.], cksum 0x4c5f (incorrect -> 0x2baf), seq 743:780, ack 4504, win 33396, length 37
18:42:48.568631 IP (tos 0x0, ttl 248, id 36786, offset 0, flags [DF], proto TCP (6), length 71)
    imap.strato.de.imaps > mini2.home.fischer-net.net.60975: Flags [P.], cksum 0x5a30 (correct), seq 4504:4535, ack 780, win 5135, length 31
18:42:48.568693 IP (tos 0x0, ttl 64, id 64318, offset 0, flags [DF], proto TCP (6), length 40)
    mini2.home.fischer-net.net.60975 > imap.strato.de.imaps: Flags [.], cksum 0x4c3a (incorrect -> 0xd50a), seq 780, ack 4535, win 33380, length 0

19:10:48.624932 IP (tos 0x0, ttl 64, id 12889, offset 0, flags [DF], proto TCP (6), length 71)
    mini2.home.fischer-net.net.60975 > imap.strato.de.imaps: Flags [P.], cksum 0x4c59 (incorrect -> 0xfd06), seq 780:811, ack 4535, win 33396, length 31
19:10:48.646571 IP (tos 0x0, ttl 248, id 18517, offset 0, flags [DF], proto TCP (6), length 40)
    imap.strato.de.imaps > mini2.home.fischer-net.net.60975: Flags [R.], cksum 0x574c (correct), seq 4535, ack 811, win 0, length 0

19:11:49.041641 IP (tos 0x0, ttl 64, id 1235, offset 0, flags [DF], proto TCP (6), length 64)
    mini2.home.fischer-net.net.61035 > imap.strato.de.imaps: Flags [S], cksum 0x4c52 (incorrect -> 0x0b3e), seq 771336041, win 65535, options [mss 1460,nop,wscale 1,nop,nop,TS val 1029692052 ecr 0,sackOK,eol], length 0
19:11:49.063272 IP (tos 0x0, ttl 248, id 26043, offset 0, flags [DF], proto TCP (6), length 52)
    imap.strato.de.imaps > mini2.home.fischer-net.net.61035: Flags [S.], cksum 0x1718 (correct), seq 2700448559, ack 771336042, win 4356, options [mss 1452,nop,wscale 0,sackOK,eol], length 0
-----

Now for me looks like the issue is, fetchmail opens the Socket for the SSL Session without a TCP KeepAlive.

When I use an unencrypted imap session, the TCP KeepAlive is active. The tcpdump shows the TCP KeepAlive packets during the imap idle time.

Best Regards,
Wolfgang
-------------------------------------------------------

Date: 2013-Jan-03 23:02
By: m-a

Comment:
Have Strato fix their server timeouts instead, the 30 minute timer is a "MUST" clause in RFC-3501 section 5.4, see http://tools.ietf.org/html/rfc3501#section-5.4

If there is a socket timeout, I'd like to see an strace or truss trace proving it, and possibly an accompanying tcpdump or wireshark/tshark dump of the protocol with timestamps, to see when the connection gets closed. Chances are that there is a problem on your end - are you using IPv4 NAT or Masquerading?
-------------------------------------------------------

For detailed info, follow this link:
http://developer.berlios.de/bugs/?func=detailbug&bug_id=18853&group_id=1824

[fetchmail-devel] [Bug #18853] IMAP IDLE + SSL: Error: re-poll failed

Client daemon to move mail from POP and IMAP to your local computer

[fetchmail-devel] [Bug #18853] IMAP IDLE + SSL: Error: re-poll failed