Good day, Matthew,
On Mon, 28 Oct 2002, Matthew Bloch wrote:
> I'm just off tomorrow to go fix our UML host machine which, after I started a
> small build going on one of the UMLs, stopped responding to any TCP
> connections. Pings came back fine, and TCP connections to the UMLs *and* to
> the host machine were answered, but no data came back from them. So the host
> kernel seems to have stiffed itself in some way; the IP stack seems okay but
> whatever process is meant to be connected to incoming connections isn't
> responding, or data isn't coming through.
I'm running into the same problem.
Zaphod (see
http://www.stearns.org/slartibartfast/uml-coop.current.html for more info)
is running into the same problem; the tcp connection finishes, but no
userspace app can be forked (I'm guessing on that part). I have a vague
suspicion it _might_ be related to a problem we had on thhe same host
where running ps/w/top (on an already open ssh connection) never returns;
Jeff was kind enough to narrow that down to a problem with the host kernel
task lock.
Here's a tcpdump of the connection attempt:
64.91.163.166.33473 > 66.59.109.137.1500: SWE 1287174553:1287174553(0) win
2048 <mss 512,sackOK,timestamp 14605734 0,nop,wscale 0> (DF)
66.59.109.137.1500 > 64.91.163.166.33473: SE 822818040:822818040(0) ack
1287174554 win 5792 <mss 1460,sackOK,timestamp 87494946
14605734,nop,wscale 0> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: . ack 1 win 2048
<nop,nop,timestamp 14605773 87494946> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14605773 87494946> (DF) [tos 0x2,ECT]
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14605890 87494946> (DF)
66.59.109.137.1500 > 64.91.163.166.33473: SE 822818040:822818040(0) ack
1287174554 win 5792 <mss 1460,sackOK,timestamp 87495286
14605890,nop,wscale 0> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: . ack 1 win 2048
<nop,nop,timestamp 14606088 87495286,nop,nop,sack sack 1 {0:1} > (DF)
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14606124 87495286> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14606592 87495286> (DF)
66.59.109.137.1500 > 64.91.163.166.33473: SE 822818040:822818040(0) ack
1287174554 win 5792 <mss 1460,sackOK,timestamp 87495886
14606592,nop,wscale 0> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: . ack 1 win 2048
<nop,nop,timestamp 14606679 87495886,nop,nop,sack sack 1 {0:1} > (DF)
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14607528 87495886> (DF)
66.59.109.137.1500 > 64.91.163.166.33473: SE 822818040:822818040(0) ack
1287174554 win 5792 <mss 1460,sackOK,timestamp 87497086
14607528,nop,wscale 0> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: . ack 1 win 2048
<nop,nop,timestamp 14607852 87497086,nop,nop,sack sack 1 {0:1} > (DF)
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14609400 87497086> (DF)
66.59.109.137.1500 > 64.91.163.166.33473: SE 822818040:822818040(0) ack
1287174554 win 5792 <mss 1460,sackOK,timestamp 87499506
14609400,nop,wscale 0> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: . ack 1 win 2048
<nop,nop,timestamp 14610221 87499506,nop,nop,sack sack 1 {0:1} > (DF)
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14613144 87499506> (DF)
66.59.109.137.1500 > 64.91.163.166.33473: SE 822818040:822818040(0) ack
1287174554 win 5792 <mss 1460,sackOK,timestamp 87504326
14613144,nop,wscale 0> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: . ack 1 win 2048
<nop,nop,timestamp 14614934 87504326,nop,nop,sack sack 1 {0:1} > (DF)
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14620632 87504326> (DF)
64.91.163.166.33473 > 66.59.109.137.1500: P 1:414(413) ack 1 win 2048
<nop,nop,timestamp 14632632 87504326> (DF)
66.59.109.137.1500 > 64.91.163.166.33473: R 822818041:822818041(0) win 0
(DF)
(To everyone, not just Matthew) Is there any chance that we're
exposing some race in the host kernel with repeated process name changes
or some other facet of uml? </mode straw-grasping=off>
> I'm can't be sure exactly what's going on until we can attach a monitor &
> keyboard to the server, but we were using a 2.4.19 host kernel, and whatever
> the latest 2.4.19 UML patch for 2.4.19 was at the start of the month.
>
> We've got about 6-7hrs tomorrow to sort out our machine; does anyone have
> opinions on the most stable combination of UML host and client kernels to use
> in a production environment (apart from "not at all, bozo" :-) )? Or indeed
> what might have gone wrong? I'm interested in the most likely ways that a
> UML kernel could stiff its host.
No promises, but David Coulson was seeing daily lockups with
2.4.20-pre9 which went away with 2.4.20-pre10. I just updated Zaphod from
the 2.4.20-pre7-ac2 to 2.4.20-pre11. I have my fingers crossed.
Hell, I have my _toes_ crossed.
Cheers,
- Bill
---------------------------------------------------------------------------
If it happens once, it's a bug. If it happens twice, it's
a feature. If it happens more than twice, it's a design philosophy.
(Courtesy of Slashdot)
--------------------------------------------------------------------------
William Stearns (wstearns@...). Mason, Buildkernel, named2hosts,
and ipfwadm2ipchains are at: http://www.stearns.org
--------------------------------------------------------------------------
|