From: Sean Atkinson <sean@ne...> - 2004-03-26 09:38:37
I'm interested in the possibility of migrating entire domains between
hosts using Xen's suspend/resume feature. My hope is that
virtualisation could ease some of the pain in presenting a consistent
environment between hosts so the domain behaves sensibly when resumed.
I've dedicated partitions on two machines for migrating DOM1 - a small
root file system (<200M) and a larger read-only /usr (1.4G) to minimise
state transfer between migrated domains. I modified the scripts in a
Red Hat 9 installation to boot DOM1 in Xen. This included enabling
XDMCP only in DOM1's GDM so DOM0 can log into it using "X -query dom1",
which is handy.
I also noticed that rc.sysinit's fsck failed on /usr with "Error writing
block 521" since I only gave DOM1 read-only access to the partition. It
turns out that fsck's -a switch was trying to write to the partition
even though it's mounted read-only, so I replaced it with -n and it
works now. I'm not sure what this means for "read only" file systems
not protected by Xen's resource management layer...
For my migration I used run level 3 and didn't try leaving any TCP
connections open for SSH or X etc. I suspended DOM1's 160M memory to a
49M file, and copied that with the xc_dom_create.py configuration script
and both partitions block-for-block to another host. Having setup the
new DOM0 to be as similar as possible, I then resumed the domain.
Apparently it almost worked - DOM1 responded to pings, although SSH
attempts just stuck. Also similar warnings like "Timer ISR: Time went
backwards: -20221230000000" were issued very fast, with the 5th digit
decrementing about once every second. Resuming the domain on the
original host worked as normal, with only warnings about init ids
re-spawning too fast after the time jump.
Perhaps it would be more interesting to perform such an experiment on
more similar hardware - my two hosts aren't very well matched so I
didn't really expect too much, but I should have an two identical
machines to test shortly.
If this isn't a completely brain dead thing to be trying, longer term
problems would be migrating IP addresses and maintaining open network
connections, which could all get pretty funky.
A smaller root would help to reduce the size of writable file systems
copied (e.g. /lib uses 80M including static kernel modules). Perhaps it
would even make sense for it to be a RAM disk? Or of networking's fixed
maybe all writable file systems could be network mounted?
I was also thinking that a little cooperation from a domain before
suspending could help, for example disabling any swap. Memory could be
"deflated" with balloon and unused file system blocks ignored, or more
simply both could be filled with zeroes to help compression.
Any thoughts on all this madness please?
Sean Atkinson <sean@...>