I've been tracking a data corruption/missing http status problem and I think I have enough data for a useful bug report.
The problem centers around the handling of queued requests in a pipeline when preceding requests are failed in libcurl after committing to a request/response transaction. In the table below, I show six GET requests pipelined on one connection. 'Time' is relative seconds since the creation of the connection. The first three requests are processed normally. The fourth request times out while processing the response body. The fifth request times out waiting for the response header. The sixth request is allowed to proceed but appears to be out-of-sync with the response stream. I haven't dumped the data in verbose mode but I'd guess that the sixth request is consuming the remainder of the fourth request's response body in some demented fashion.
Request | Time | Event |
---|---|---|
0 | 0 | HEADEROUT issued |
0 | 1 | First HEADERIN data |
0 | 13 | Request completed, 200 status |
1 | 0 | HEADEROUT issued |
1 | 13 | First HEADERIN data |
1 | 15 | Request completed, 200 status |
2 | 0 | HEADEROUT issued |
2 | 15 | First HEADERIN data |
2 | 20 | Request completed, 200 status |
3 | 0 | HEADEROUT issued |
3 | 20 | First HEADERIN data |
3 | 30 | Timeout declared (easy error 28) |
3 | 30 | Request failed, easy 28 status |
4 | 0 | HEADEROUT issued |
4 | 30 | Timeout declared (easy error 28), no HEADERIN data seen |
4 | 30 | Request failed, easy 28 status |
5 | 13 | HEADEROUT issued |
5 | 30 | First DATAIN received, NO HEADERIN data for this request |
5 | 43 | Connection closed. This may be in response to server socket close. |
5 | 43 | Request appears to succeed but CURLINFO_RESPONSE_CODE returns 0. |
The sixth request appears to succeed as far as multi_perform and multi_info_read are concerned. But fetching CURLINFO_RESPONSE_CODE returns 0 for status.
As a workaround, checking the status as above appears to be useful. I'm not certain that's 100% reliable or that the connection will be broken immediately at that point. This has the potential of going very wrong as far as data integrity goes.
If I understand this correctly, solutions would include:
I wonder if there's a non-pipelined scenario where this is a problem as well: request timeouts on keep-alive connections.
In the non-pipelined case, stopping a single request has no impact on the following ones.
Is that because the connection is closed or because canceling the single request also sinks all response data when the cancel is performed after the request is sent?
If there's outstanding data in any direction when the transfer is terminated, the connection will be closed. I think that's what "bleeds" over into what it does even when pipelining is activated.
What I've been seeing disagrees with the above description so I dug in with the debugger to investigate. There are two locations where timeouts are detected:
The implementation of the two differs greatly. Detector 1. seems to be the most likely to fire and it is the one that generates the issues I see. Differences:
The state and connection management code looks a bit... sensitive. But some changes that look in order:
I'm generally able to detect this problem by examining the HTTP status of a completed request. If data is delivered to the wrong request, I will get a '0' status back because the HTTP status line was never read as a header, only as data. But if a timeout is detected at just the wrong time, an HTTP status line for a preceding request could be fed to a later one leading to data corruption.
I'm going to work on moving the signalling logic to the top of multi_runsingle() and preventing the consideration of connections for pipelining that have 'bits.close' set. That will probably be my local fix for 7.38.0.
An example of what I'm investigating for a workaround:
Just keeping people apprised of my work on this. For the second time, HTTP state
internals defeated me. The 'bits.close' state does quite a bit of unexpected
flipping between 'on' and 'off' states during request processing. It ended up
being 'on' too often during a check for connection compatibility with pipelining.
This resulted in pipelining mostly being inhibited.
So, I added a bit whose semantics I control. It's used for only one purpose:
preventing requests being assigned to a connection once that connection is
known to be unusable (poisoned). A better solution compatible with the HTTP
state plumbing will have to come from devs more familiar with it.
I did extensive testing of http pipelining with respect to timeouts.
For this I wrote a testsuite that you can find here: https://github.com/CarloWood/curl_http_pipeline_tester
Running the testsuite (VERBOSE off) on vanilla libcurl
(5b12a4b4e404e5b5a75d3509cce21553edc9a974) I get the following
output:
2014-10-28 18:48:00.944097: : Request #0 added [now running: 1]
<html><body>Reply 1 on connection 1 for request #0</body></html>
2014-10-28 18:48:00.944979: : Request #0 finished [now running: 0]
2014-10-28 18:48:00.945011: : Request #1 added [now running: 1]
2014-10-28 18:48:00.945022: : Request #2 added [now running: 2]
2014-10-28 18:48:00.945032: : Request #3 added [now running: 3]
2014-10-28 18:48:00.945058: : Request #4 added [now running: 4]
<... 1.001327 seconds ...>
2014-10-28 18:48:01.946385: : Request #1 TIMED OUT! [now running: 3]
2014-10-28 18:48:01.946506: : Request #2 TIMED OUT! [now running: 2]
2014-10-28 18:48:01.946561: : Request #4 TIMED OUT! [now running: 1]
2014-10-28 18:48:01.946575: : Request #5 added [now running: 2]
2014-10-28 18:48:01.946585: : Request #6 added [now running: 3]
2014-10-28 18:48:01.946607: : Request #7 added [now running: 4]
<html><body>Reply 2 on connection 1 for request #1</body></html>
<... 0.099109 seconds ...>
2014-10-28 18:48:02.045716: : Request #3 finished [now running: 3]
2014-10-28 18:48:02.045758: : Request #8 added [now running: 4]
<html><body>Reply 1 on connection 2 for request #5</body></html>
<html><body>Reply 1 on connection 3 for request #6</body></html>
<html><body>Reply 1 on connection 4 for request #7</body></html>
<html><body>Reply 1 on connection 5 for request #8</body></html>
<... 0.100185 seconds ...>
2014-10-28 18:48:02.145943: : Request #5 finished [now running: 3]
2014-10-28 18:48:02.145979: : Request #6 finished [now running: 2]
2014-10-28 18:48:02.145991: : Request #7 finished [now running: 1]
2014-10-28 18:48:02.146004: : Request #8 finished [now running: 0]
2014-10-28 18:48:02.146017: : Request #9 added [now running: 1]
<html><body>Reply 2 on connection 2 for request #9</body></html>
<... 0.100499 seconds ...>
2014-10-28 18:48:02.246516: : Request #9 finished [now running: 0]
This shows the data corruption that Monty was talking about, since we still get the output "<html><body>Reply 2 on connection 1 for request #1</body></html>" for the timeout out request #1. Also in total 5 connections have been used instead of 2 (not 1, we need to create a new connection when the pipe breaks due to the timeout of #1). Request #5, #6, #7 and #8 all cause a new connection because curl forgot that the server can do pipelining. After receiving the reply on connection 2 for #5 it knows that this server does pipelining again, so request #9 is using connection 2 too.
After applying Monty's patch, I get this result:
2014-10-28 18:57:48.487717: : Request #0 added [now running: 1]
<html><body>Reply 1 on connection 1 for request #0</body></html>
2014-10-28 18:57:48.488689: : Request #0 finished [now running: 0]
2014-10-28 18:57:48.488720: : Request #1 added [now running: 1]
2014-10-28 18:57:48.488740: : Request #2 added [now running: 2]
2014-10-28 18:57:48.488750: : Request #3 added [now running: 3]
2014-10-28 18:57:48.488760: : Request #4 added [now running: 4]
<... 1.001437 seconds ...>
2014-10-28 18:57:49.490197: : Request #1 TIMED OUT! [now running: 3]
2014-10-28 18:57:49.490252: : Request #2 TIMED OUT! [now running: 2]
2014-10-28 18:57:49.490266: : Request #4 TIMED OUT! [now running: 1]
2014-10-28 18:57:49.490281: : Request #5 added [now running: 2]
2014-10-28 18:57:49.490292: : Request #6 added [now running: 3]
2014-10-28 18:57:49.490302: : Request #7 added [now running: 4]
<html><body>Reply 1 on connection 3 for request #3</body></html>
<html><body>Reply 1 on connection 5 for request #5</body></html>
<... 0.100334 seconds ...>
2014-10-28 18:57:49.590636: : Request #3 finished [now running: 3]
2014-10-28 18:57:49.590673: : Request #5 finished [now running: 2]
2014-10-28 18:57:49.590688: : Request #8 added [now running: 3]
2014-10-28 18:57:49.590700: : Request #9 added [now running: 4]
<html><body>Reply 1 on connection 6 for request #6</body></html>
2014-10-28 18:57:49.590815: : Request #6 finished [now running: 3]
<html><body>Reply 1 on connection 7 for request #7</body></html>
2014-10-28 18:57:49.590855: : Request #7 finished [now running: 2]
<html><body>Reply 2 on connection 3 for request #8</body></html>
<html><body>Reply 2 on connection 5 for request #9</body></html>
<... 0.100399 seconds ...>
2014-10-28 18:57:49.691254: : Request #8 finished [now running: 1]
2014-10-28 18:57:49.691289: : Request #9 finished [now running: 0]
As you see now we even made 7 connections in total. The reason for this
is that although the timed out #2 and #4 attempt a new connection
now, they time out immediately after and also close those connections as
if the pipe broke (again)...
We end up as before with 4 connections: 3, 5, 6 and 7.
Last edit: Carlo Wood 2014-10-28
I fixed all http pipelining bugs that I could find with my testsuite.
You can find/get the commits from https://github.com/CarloWood/curl
The testsuite now gives this as output:
2014-10-28 19:07:28.399012: : Request #0 added [now running: 1]
<html><body>Reply 1 on connection 1 for request #0</body></html>
2014-10-28 19:07:28.399972: : Request #0 finished [now running: 0]
2014-10-28 19:07:28.400096: : Request #1 added [now running: 1]
2014-10-28 19:07:28.400111: : Request #2 added [now running: 2]
2014-10-28 19:07:28.400134: : Request #3 added [now running: 3]
2014-10-28 19:07:28.400145: : Request #4 added [now running: 4]
<... 1.000572 seconds ...>
2014-10-28 19:07:29.400717: : Request #1 TIMED OUT! [now running: 3]
2014-10-28 19:07:29.400759: : Request #2 TIMED OUT! [now running: 2]
2014-10-28 19:07:29.400773: : Request #4 TIMED OUT! [now running: 1]
2014-10-28 19:07:29.400786: : Request #5 added [now running: 2]
2014-10-28 19:07:29.400797: : Request #6 added [now running: 3]
2014-10-28 19:07:29.400808: : Request #7 added [now running: 4]
<html><body>Reply 1 on connection 2 for request #3</body></html>
<html><body>Reply 2 on connection 2 for request #5</body></html>
<html><body>Reply 3 on connection 2 for request #6</body></html>
<html><body>Reply 4 on connection 2 for request #7</body></html>
<... 0.100344 seconds ...>
2014-10-28 19:07:29.501152: : Request #3 finished [now running: 3]
2014-10-28 19:07:29.501188: : Request #5 finished [now running: 2]
2014-10-28 19:07:29.501201: : Request #6 finished [now running: 1]
2014-10-28 19:07:29.501213: : Request #7 finished [now running: 0]
2014-10-28 19:07:29.501227: : Request #8 added [now running: 1]
2014-10-28 19:07:29.501237: : Request #9 added [now running: 2]
<html><body>Reply 5 on connection 2 for request #8</body></html>
<... 0.100530 seconds ...>
2014-10-28 19:07:29.601767: : Request #8 finished [now running: 1]
<html><body>Reply 6 on connection 2 for request #9</body></html>
<... 0.037036 seconds ...>
2014-10-28 19:07:29.638803: : Request #9 finished [now running: 0]
After #1 times out, the connection is marked for closing (and it is actually closed, unlike in the previous two cases), a new connection is created (connection 2) which is used for request #3 that was in the pipe but didn't timeout yet, as well as all newly added requests.
In total 2 connections are made, and only 1 connection exists at any time.
There was still a problem left when I tested this in practise: during my tests I was running the server on localhost, so connects were immediate. That caused me to miss the fact that multiple connection were STILL being made during the time between connect and connection established.
Pushed a commit to the curl-7_38_0_pipelinefix branch that fixes this
(Don't make many connections when pipelining after a timeout.)
https://github.com/CarloWood/curl/commits/curl-7_38_0_pipelinefix
Hmm, if life was only this simple...
0x7fffe0a9d700 CURLIO 0x7fffc9173c80 * Connection cache is full, closing the oldest one.
0x7fffe0a9d700 CURLIO : 0x7fffc9173c80 Connection closed (#4); 0 connections remaining [0x7fffc8015110].
0x7fffe0a9d700 CURLIO 0x7fffc9173c80 * Closing connection 4
0x7fffe0a9d700 CURLIO 0x7fffc9173c80 * Removed connection 0x7fffd8484e18 to bundle at 0x7fffd849acb8, 0 connections left.
0x7fffe0a9d700 CURLIO 0x7fffc9173c80 * Destroying bundle at 0x7fffd849acb8
0x7fffe0a9d700 CURLIO 0x7fffc9173c80 * The cache now contains 9 members
...
0x7fffe0a9d700 CURLIO 0x7fffc916d0d0 * Created a new bundle at 0x7fffd876fd58
0x7fffe0a9d700 CURLIO 0x7fffc916d0d0 * Added connection 0x7fffd8484e18 to bundle at 0x7fffd876fd58, now 1 connections.
0x7fffe0a9d700 CURLIO 0x7fffc916d0d0 * Added connection 19. The cache now contains 10 members
...
and there we go again
0x7fffe0a9d700 CURLIO 0x7fffc9073d90 * Server doesn't support pipelining
...
0x7fffe0a9d700 CURLIO 0x7fffc916d0d0 * Added connection 0x7fffd8484e18 to bundle at 0x7fffd876fd58, now 1 connections.
0x7fffe0a9d700 CURLIO 0x7fffc9073d90 * Added connection 0x7fffd95f1848 to bundle at 0x7fffd876fd58, now 2 connections.
0x7fffe0a9d700 CURLIO 0x7fffc90574c0 * Added connection 0x7fffd959c2d8 to bundle at 0x7fffd876fd58, now 3 connections.
0x7fffe0a9d700 CURLIO 0x7fffc89ec6a0 * Added connection 0x7fffd95ef888 to bundle at 0x7fffd876fd58, now 4 connections.
0x7fffe0a9d700 CURLIO 0x7fffc89e9c10 * Added connection 0x7fffd93b2588 to bundle at 0x7fffd876fd58, now 5 connections.
0x7fffe0a9d700 CURLIO 0x7fffc8f634b0 * Added connection 0x7fffd9786a38 to bundle at 0x7fffd876fd58, now 6 connections.
0x7fffe0a9d700 CURLIO 0x7fffc8c703b0 * Added connection 0x7fffd93b1a68 to bundle at 0x7fffd876fd58, now 7 connections.
0x7fffe0a9d700 CURLIO 0x7fffc9042600 * Added connection 0x7fffd8f5c558 to bundle at 0x7fffd876fd58, now 8 connections.
0x7fffe0a9d700 CURLIO 0x7fffc916d850 * Added connection 0x7fffd8f5cc28 to bundle at 0x7fffd876fd58, now 9 connections.
0x7fffe0a9d700 CURLIO 0x7fffc903d4b0 * Added connection 0x7fffd8f46b18 to bundle at 0x7fffd876fd58, now 10 connections.
0x7fffe0a9d700 CURLIO 0x7fffc92629e0 * Added connection 0x7fffd9733428 to bundle at 0x7fffd876fd58, now 11 connections.
0x7fffe0a9d700 CURLIO 0x7fffc8cd4020 * Added connection 0x7fffd9733af8 to bundle at 0x7fffd876fd58, now 12 connections.
0x7fffe0a9d700 CURLIO 0x7fffc9174320 * Added connection 0x7fffd9615e78 to bundle at 0x7fffd876fd58, now 13 connections.
0x7fffe0a9d700 CURLIO 0x7fffc916c7c0 * Added connection 0x7fffd9616548 to bundle at 0x7fffd876fd58, now 14 connections.
0x7fffe0a9d700 CURLIO 0x7fffc91768f0 * Added connection 0x7fffd9616c18 to bundle at 0x7fffd876fd58, now 15 connections.
0x7fffe0a9d700 CURLIO 0x7fffc89e67e0 * Added connection 0x7fffd96172e8 to bundle at 0x7fffd876fd58, now 16 connections.
0x7fffe0a9d700 CURLIO 0x7fffc916e610 * Added connection 0x7fffd96179b8 to bundle at 0x7fffd876fd58, now 17 connections.
0x7fffe0a9d700 CURLIO 0x7fffc917d7b0 * Added connection 0x7fffd9599b38 to bundle at 0x7fffd876fd58, now 18 connections.
0x7fffe0a9d700 CURLIO 0x7fffc91b1930 * Added connection 0x7fffd959a208 to bundle at 0x7fffd876fd58, now 19 connections.
0x7fffe0a9d700 CURLIO 0x7fffc9186ac0 * Added connection 0x7fffd975e328 to bundle at 0x7fffd876fd58, now 20 connections.
0x7fffe0a9d700 CURLIO 0x7fffc813ee50 * Added connection 0x7fffd959a8d8 to bundle at 0x7fffd876fd58, now 21 connections.
0x7fffe0a9d700 CURLIO 0x7fffc9283c90 * Added connection 0x7fffd959afa8 to bundle at 0x7fffd876fd58, now 22 connections.
0x7fffe0a9d700 CURLIO 0x7fffc8e1ede0 * Added connection 0x7fffd959b678 to bundle at 0x7fffd876fd58, now 23 connections.
0x7fffe0a9d700 CURLIO 0x7fffc813fde0 * Added connection 0x7fffd8f59428 to bundle at 0x7fffd876fd58, now 24 connections.
0x7fffe0a9d700 CURLIO 0x7fffc916cf20 * Added connection 0x7fffd8f59af8 to bundle at 0x7fffd876fd58, now 25 connections.
0x7fffe0a9d700 CURLIO 0x7fffc8f62d00 * Added connection 0x7fffd8f5a1c8 to bundle at 0x7fffd876fd58, now 26 connections.
0x7fffe0a9d700 CURLIO 0x7fffc9282710 * Added connection 0x7fffd8f5a898 to bundle at 0x7fffd876fd58, now 27 connections.
0x7fffe0a9d700 CURLIO 0x7fffc8e1d8d0 * Added connection 0x7fffd8f5af68 to bundle at 0x7fffd876fd58, now 28 connections.
0x7fffe0a9d700 CURLIO 0x7fffc8abb270 * Added connection 0x7fffd8f5b638 to bundle at 0x7fffd876fd58, now 29 connections.
0x7fffe0a9d700 CURLIO 0x7fffc8c2c3e0 * Added connection 0x7fffd8f5bd08 to bundle at 0x7fffd876fd58, now 30 connections.
0x7fffe0a9d700 CURLIO 0x7fffc91a44f0 * Added connection 0x7fffd9583f88 to bundle at 0x7fffd876fd58, now 31 connections.
Thanks Carlo, but as Steve pointed out on the mailing list the other day, we much prefer having patches posted to the mailing list for review and discussions: http://curl.haxx.se/dev/contribute.html#How_to_get_your_changes_into_the
Can you please post them there?
Hi Daniel,
I'm still not satisfied with the result.
The objective is to have only a single connection (I know that
you can set a limit on the number of simultaneous requests in the
pipeline and when THAT limit is reached then of course a new
connection should be created; but for now lets assume that that
limit is not or not reached; in that case want just one connection
to be active-- the whole point of doing pipelining).
If a bundle has only one connection, then the normal reaction of
libcurl is to destroy the bundle when that connection is closed.
The reasons why a connection can be closed are:
1) One (or more) requests timeout.
2) The server closes the connection (connection reset by peer).
3) The connection cache runs full and the pipeline connection
happens to be the oldest idle connection.
Perhaps there are more reasons that I didn't run into yet.
My first commits solved 1), and a commit that I didn't push yet
should fix 2). Then I ran into 3)... and I'm starting to think
that the attempt to keep the bundle alive each of the (special)
cases is not the right approach.
A better approach is look at it from the other side: when a
server connection is made and it is not yet known if the that
connection support pipelining or not then there are the following
possibilities:
1) Libcurl creates a new connection for every new request (this
is the current behavior).
2) Libcurl creates a single connection, and queues all other
requests until it received all headers for that first request
and decided if the server supports pipelining or not.
If it does, it uses the same connection for the the queued
requests. If it does not, then it creates new connections
for each request as under 1).
3) Libcurl assumes every connection can do pipelining and just
sends all requests over the first connection, until it finds
out that this server can't do pipelining - in which case the
pipe 'breaks' and already existing code will cause all requests
to be retried - now knowing that the server does not support
pipelining.
4) Libcurl doesn't wait for headers but uses information from
the user to decide if the server is supposed to support
http pipelining; meaning that it picks strategy 2) or 3) based
on a flag set on the easy handle by the user.
I think that option 4 is far the best experience for the user;
however it is the hardest to implement because it requires to
implement both 2) and 3) as well as add the code to add a new flag.
The easiest solution is probably to always pick 3), but that is
almost unacceptable 'dirty' for those connections that do not
support pipelining; My current approach (for the user application)
is to add only a single request and let that finish, so the application
itself can detect if pipelining is supported or not, and when not
to add that site to the blacklist. Doing that option 3) would not
lead to 'dirty' failures, but I can hardly say that that seems like
a good general solution; it would be far more favorable to add
a flag for easy handles. Hence, if the choice is to not add
a new flag to easy handles, so the user can specify preferences
on a per easy handle case (as opposed to being forced to use a
dedicated multi handle for pipelining) then option 2) seems the
only reasonable choice.
Note that just option 2) (instead of 4) isn't that bad: if pipelining
is supported and the first request would cause a stall of -say-
10 seconds, then also in that case all requests added behind that
in the pipeline would be stalled ("queued" on the server, instead
of client-side on the viewer). The main difference is that if
pipelining is NOT supported then you don't create a lot of parallel
connections right away and hence there is an extra delay more or
less equal to the response time of the first request (typical less
than a second) for the additional requests; which subsequently occurs
every time that libcurl destroys the bundle structure (ie, when ALL
connections are closed). I don't see that as something bad: if all
connections are closed then apparently we're done with downloading
the bulk - getting an extra sub-second delay the next burst seems
very acceptable to me. Of course we only need to do with when the
multi handle is flagged to support pipelining in the first place.
The benefit is far larger!
Carlo, you're talking about further improvements and those are not at all related to the bug report you comment on here.
Development and design discussions here will not reach the proper audience. Please take that to the curl-library list, thanks!
Is this still a problem after the recent pipelining fixes?
Haven't had an opportunity to look over the fixes or give them a try. Everything is in github at this point?
Let's put it like this: we did quite some pipelining polishing and that is now in the master branch on github. And I believe this particular bug is fixed in there.
That's not the end of the pipeline fixes though (nor the end of identified pipelining bugs). Carlo has pending patches in his branch but has had to slow down his work for now due to medical reasons so it isn't clear right now how the path forward from here goes or at what speed.
Closing this issue now. If the problems remain, we either reopen this or we create a new bug with more and new info.