|
From: Massimo R. <rim...@in...> - 2016-09-11 12:29:44
|
Hi,
this is my first appearance on this mailing list, therefore I apologize for writing a problem report as first post.
I am experiencing a rather strange process crash problem inside UML. In short: processes started inside an UML instance randomly (but reproduceably) crash when UML is running on a specific host. Everything works perfectly when the *same*filesystem image and UML kernel are used on other hosts.
The filesystem image has been created using debootstrap (detailed commands in the report below).
Examples of reproduceable problems that I have been observing:
* The following command always segfaults:
localedef -i en_US -A /usr/share/locale/locale.alias -f UTF-8 en_US.UTF-8
A line similar to the following is correspondingly logged by the kernel:
localedef[49]: segfault at 0 ip 00000000004079fd sp 0000007fbfce84e0 error 4 in localedef[400000+47000]
* After clearing /var/lib/apt/lists, 'apt-get update' successfully downloads some index files, then hangs forever with a message "[Connecting to]" (yes, without a host name). It is not just the output that misses the server name: a sniffer reveals attempts by apt-get to resolve exactly one more server name than those listed in /etc/apt/sources.list, but this excess server has an empty name (a SRV query is sent for '_http._tcp', which instead should be followed by a valid name as in '_http._tcp.httpredir.debian.org'). Of course, the DNS replies with a failure.
* When systemd is used as init, from time to time I have observed crashes of the udev service or failures to write the journal. Errors similar to the following are logged by the UML kernel:
systemd-journald[58]: Failed to write entry (25 items, 576 bytes), ignoring: Invalid argument
Please consider that the latter symptom has nothing to do with the problem discussed at this URL (which, however, I have also been experiencing in certain experimental settings): https://www.mail-archive.com/deb...@li.../msg1440903.html. Indeed, I am *not*using systemd as init in the tests I am reporting about below.
I have performed tests with several host/UML/filesystem/configuration combinations, as explained in the following.
==============================================================
To document test outcomes, I am using the following abbreviations:
- Host A: Intel i7-6700 3.4GHz, 32GB RAM, running Debian GNU/Linux testing (stretch) with kernel 4.6.0-1-amd64 (supplied by Debian, no modifications)
- Host B: VirtualBox 5.1.0r108711 VM with 1 (virtual CPU), PIIX3 chipset, no execution cap, 7GB RAM, running Ubuntu 16.04.1 LTS (xenial) with kernel 4.4.0-31-generic (supplied by Ubuntu, no modifications), on top of Host A
- Host C: VirtualBox 5.1.0r108711 VM with 4 (virtual) CPUs, PIIX3 chipset, no execution cap, 2GB RAM, running Ubuntu 16.04 LTS (xenial) with kernel 4.4.0-28-generic (supplied by Ubuntu, no modifications), on top of a host with an Intel i7 2.2GHz CPU, 8GB RAM running Mac OS X 10.11.6 (El Capitan)
- UML 1: 4.3.2 64-bit UML kernel, custom compiled into a static executable from vanilla, with a few patches of little relevance applied (https://github.com/maxonthegit/netkit/tree/master/devel/kernel-patches). For reference, the configuration file is https://github.com/maxonthegit/netkit/blob/master/devel/netkit-kernel-config-4.2.1-x86_64
- UML 2: 4.3.5 64-bit UML kernel, obtained from http://uml.devloop.org.uk
- Network SLIRP: using slirp as transport, compiled from the Debian source package (https://packages.debian.org/source/stretch/slirp) with Debian patches applied (by 'apt-get source') and FULL_BOLT enabled.
UML command line argument: eth0=slirp,,/path/to/slirp
- Network NAT: using tuntap as transport, and enabling netfilter's masquerading on the host. Set up on the host using the following commands:
tunctl -u user -g user -t tun
ifconfig tun 10.0.0.1 up
echo 1 >/proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -j MASQUERADE
A corresponding default route is set up inside UML.
UML command line argument: eth0=tuntap,tun
- Suite: I tried to run UML on filesystem images created using debootstrap for both the 'sid' (unstable) and the 'stretch' testing Debian releases. I used the following commands to generate the images:
dd if=/dev/zero of=/path/to/fs/image.img bs=1 count=0 seek=10G
/sbin/mkfs.ext4 -t ext4 -F /path/to/fs/image.img
mount -o loop /path/to/fs/image.img fs-mount-location
debootstrap --include=debconf-utils,locales sid fs-mount-location
cat > temp-fs-mount/startup.sh <<EOF
#!/bin/bash
mount -t proc none /proc
mount -t sysfs none /sys
mount -t tmpfs none /run
mountpoint /dev || mount -t devtmpfs none /dev
echo "nameserver 8.8.8.8" > /etc/resolv.conf
ip addr add 10.0.2.15/8 dev eth0
ip link set eth0 up
ip route add default dev eth0
/bin/bash
EOF
chmod +x fs-mount-location/startup.sh
umount fs-mount-location
I have been using the *same*2 filesystem image files for all the tests, reverting them to the original state after each test by keeping a backup copy.
Test outcomes are as follows:
- OK: everything works flawlessly inside UML.
- FAIL: software crashes are observed inside UML, as in the examples discussed above (localedef, apt-get, sometimes systemd-journald).
==============================================================
The UML command line was as follows:
kernel-x86_64 umid=test-vm mem=1073741824 ubd0=image.img rw con=null con0=fd:0,fd:1 eth0=<as above> init=/startup.sh
Here are the outcomes:
+------+-----+---------+---------+---------+
| Host | UML | Network | Suite | OUTCOME |
+------+-----+---------+---------+---------+
| A | 1 | SLIRP | sid | FAIL |
| A | 1 | NAT | sid | FAIL | <= strace
| A | 2 | SLIRP | sid | FAIL |
| A | 2 | NAT | sid | FAIL |
| B | 1 | SLIRP | sid | OK |
| B | 1 | NAT | sid | OK |
| B | 2 | SLIRP | sid | OK |
| B | 2 | NAT | sid | OK |
| C | 1 | SLIRP | sid | OK | <= strace
| C | 1 | NAT | sid | OK |
| C | 2 | SLIRP | sid | OK |
| C | 2 | NAT | sid | OK |
| A | 1 | SLIRP | stretch | FAIL |
| A | 1 | NAT | stretch | FAIL |
| A | 2 | SLIRP | stretch | FAIL |
| A | 2 | NAT | stretch | FAIL |
| B | 1 | SLIRP | stretch | OK |
| B | 1 | NAT | stretch | OK |
| B | 2 | SLIRP | stretch | OK |
| B | 2 | NAT | stretch | OK |
| C | 1 | SLIRP | stretch | OK |
| C | 1 | NAT | stretch | OK |
| C | 2 | SLIRP | stretch | OK |
| C | 2 | NAT | stretch | OK |
+------+-----+---------+---------+---------+
So, one could come to an easy conclusion: something is wrong with host A. But what?
Here are some additional clues:
- I have performed an 'apt-get upgrade' on host A only a few days ago.
- Everything else works perfectly on host A.
- The problem occurs with two different host kernel versions (4.4.0-1-amd64 and 4.6.0-1-amd64).
- If I 'chroot' in the *same*filesystem image(s) used for the UML tests, everything works perfectly on host A.
- Changing the amount of RAM assigned to UML instances does not solve the problem on host A.
- It does not depend on the processes running on host A: I made a test after booting the host with init=/bin/bash and the problem still occurred.
- Although I am assigning the same amount of RAM to all the UML instances (purposely without 'M' or 'G' suffixes to avoid ambiguities), when UML boots it reports slightly different memory amounts:
On host A: Memory: 1024952K/1056060K available (3108K kernel code, 862K rwdata, 1032K rodata, 121K init, 294K bss, 31108K reserved, 0K cma-reserved)
On host C: Memory: 1024824K/1063076K available (3108K kernel code, 862K rwdata, 1032K rodata, 121K init, 294K bss, 38252K reserved, 0K cma-reserved)
- I have collected an strace of a working and a failing instance of localedef, collected in the scenarios marked in the table above using:
strace -r -s 256 -T -yy -o out-strace-file -ff -v /usr/bin/localedef -i en_US -A /usr/share/locale/locale.alias -f UTF-8 en_US.UTF-8
These straces are available at the following addresses:
http://pastebin.com/bfjPERkK (failing, PID 378)
http://pastebin.com/BaWTCvjn (failing, PID 379)
http://pastebin.com/pCTttzPz (working, PID 56)
http://pastebin.com/ejXseFsq (working, PID 57)
After sanitization, it is pretty evident that everything is essentially identical up to the SIGSEGV at a 'brk' call. The following command can be used to compare a sane and a faulty output:
diff -y <(awk '{$1=""; $NF=""; print}' localedef-strace-failing.378 | sed -r 's/\[[0-9]+\]//g') <(awk '{$1=""; $NF=""; print}' localedef-strace-working.56 | sed -r 's/\[[0-9]+\]//g') | less -S
==============================================================
I have more or less run out of ideas about how to overcome this problem. Any suggestions are welcome. Thank you very much even for just reading so far.
As a side note, this resembles the kind of problem that was previously reported here: https://sourceforge.net/p/user-mode-linux/mailman/message/34978201/
Regards,
Massimo
|