[uml-devel] Re: more on COW (long)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> What does COW have to do with the LVM driver?  They're completely 
> independent of each other.

They are independent, last time we emailed about this you said, as below
that you wanted to move the COW layer up, where I want to move it
down :), above the ubd device is the request queue where the LVM
system lives, it should be possible to have a LVM module that can
understand the COW format, I think it would work best, from what I
have seen, as part of the device mapper LVM, in 2.6 and there is a
backport to 2.4 apparently

The setup in that case would be to use losetup to make devices using
one for each COW and backing file so you can only have 1/2 of the
available loops used for COW files and the other 1/2 as the backing
files i.e. 4 COW files with 4 backing files from 8 loop back devices
the usual default, as of Feb 2002 you can now have up to 256 loop
devices. But I think that a hard limit of 128 COW files per system is
too small, and if the max_loop is left at the default of 8... 4 COW files
per system would be a great pain, (yes, if they all use the exact same
backing file you could get 255 or 7)

And then you setup LVM using the COW plugin, using the loop
devices as the parts for the LVM device, I am pretty sure that the
COW format does not much resemble the current LVM volume
group headers pvcreate is the command I think

Another option is to upgrade the loop driver much like cryptoloop
but even cryptoloop appears to be moving to using LVM instead of
the loop infrastructure, which few of the kernel hackers seem to like

> The COW stuff is inside the ubd driver, rather than below it.  I 
> want it
> to be above the ubd driver, so that it is completely outside, and a 
> ubd device deals with only one file.

The COW stuff is currently embedded into the ubd driver, this makes
new COW formats & ISAM a pain to link in, you said you wanted to
move it up, to I was assuming the LVM stuff, I wanted to move it down
to a layer below the ubd so the request queue looks normal, I want it
to look to the ubd driver as a single file so none of the COW stuff is
embedded into the ubd driver itself, I also wanted a arch that allows
plugging in of new formats easily so I can stuff ISAM in for example
User side plugins seems easier to work on than kernel plugins.

> This will clean up the code, allow stackable COW files as a trivial
> side-effect, and allow COW volumes to be mounted on the host.  What 
> won't you be able to do if this happens?

Yep, once the ubd driver regards the COW device as a single file
the stacking is more or less free, regardless of above/below the ubd
driver layer, I think of COW as sort of a lib_moo style on the
userspace side it could also be used for the uml_moo program and
be used as a userspace filesystem plugin program, I don't really
want to think about trying to use fstat or mmap from the middle of
the request queue stack, and I don't want those details creeping
back into the ubd driver once they are out of there, since what
needs mmapping needs to change based on format, 

What won't LVM do, well I don't normally use it so I could be
easily wrong, but booting from LVM appeared somewhat different
from regular devices, not a lot harder but different and might take
a initrd for some setups that don't use it now, I would also like
the option of running LVM on top of COW and while I think
LVM can be stacked on LVM I would prefer to have them
separate so that LVM upgrades don't break COW, the ubd
driver and the COW are closely related so changing them together
is less of a problem, I think it should look like a plain device once
you are inside of the UML not having the UML userspace required
to setup a LVM (unless I want to do that also)

> This is another thing a separate COW driver will make possible.  A 
> COW volume
> could mounted on the host and be passed to UML as a device.

Yes any of the un-embedded versions should allow this, I was
thinking the user space file system, then we can keep using mmap
and the other normal features of a XYZ_user.c part :) and have 
someone else working on support of the kernel module.

> Just to be clear, you're suggesting copying a COW file on the host, 
> then
> switching the ubd driver to use the copy?  I don't see how that 
> helps, because
> the COW file needs to be quiescent during the copy.

Um, not exactly, I should not compose e-mail at midnight.
starting with ./linux ubd0=COW1,backing
I was suggesting creating a new COW file COW2,COW1
for example while COW1 is mounted and then at the right time
(after sync or when 2.6 sends the barrier request) replacing
the fd in the ubd structure with the pointer to COW2 which
then will process all of the writes, and COW1 is now effectively
read only and can be closed and the R/O copy from the COW2
creation is used, COW1 and all lower COW files can then be
merged either manually with uml_moo, or as a background task
at cow_user level which after completion can replace the COW1
entry in the COW2 stack, I think the LVM snapshots work
almost the same way and in both cases fsck needs to be run on
rollback to restore the filesystem.

> This is better than stuffing geometry in the COW header.  Are there 
> any actual
> uses that you know of for specifying the geometry of a ubd device?

> Again, interesting, but are there any uses for this?

Well yes, I want to be able to read real disk images either from a raw
device, yes we could stick in a special case ioctl to check for a real
device and read its geometry but ick it is easier to specify it, or from
a dd'ed image file of a hard disk which would not work even with the
ioctl, so ubd1C102H15S16 is better from my view :), when I do this
now i have to rewrite the ubd_driver with the new CHS to get the
partitions and I don't really like that much.

None of my real hard disks seem to use H128S32 and S63 is much
more common, and the fake disk geometries are even worse, and I
would like to have a header where I can specify the geometry so that
it is saved with the data and I don't have to keep track of what the
real disk had, see DISK & HEADER one allows me to merge in the
data from the disk image and the other uses just a header so it can
have them separate, and then I usually want a COW on top so it
is read only, I don't want to write to SGI disk images very often
as a example I have actually had, but I do want to mount them.

> > P for padding i.e. 512 byte header
> > padding so that raw devices can be
> > checked without failing due to I/O
> > errors dropping out.
> 
> I think this happens automatically since I/O happens with 512 
> granularity anyway.

Um, no the header of the COW is tried first and it was not a
multiple of 512 that is why the raw device broke it tried to
read the V2 header and had a I/O error and then did not
try the plain file case due to the I/O error though it would
have worked as a plain file, he just wanted ubd1=/dev/raw1
without a COW file so there was no header to read, but
it still had the 512 alignment restriction so read->I/O error

> > M for mmap size index,data
> 
> Do you mean the mmap unit?

Um yes, how big a chunk to mmap in each case
I would guess a default of one page for index/bitmap
would be about right, and no mapping for data  and
other sizes for the expert user or performance tests

My current test maps in all of the bitmap or index and none
of the data, other settings would be nice to have.

> > V to select what version header to create
> > V0 is a raw data file don't check for COW
> > V1 the first version header not portable
> > V2 the second one with the wrong math
> > V3 my version with the separated offset length
> > separated allows for using a program to fix the
> > math errors on detection without having to
> > redo the header format
> 
> I don't want to support old COW versions.  I don't see any point in 
> allowing UML to create V1 or V2 COW files.

I was writing it very generally, V1 and V2 are just normal cases and
it was not hard to put in, also someone always seems to want to do
odd things like run broken versions every once in a while.
As of now V2 is the current format, V3 is being worked on but
is not yet present.
In any case I was planning on a warning message to indicate that V1
is obsolete and add the same message later to V2 once V3 is working,
and it is only a question when creating a new COW file, I have had
V3,V2,V1,backing as a test case last year so sometime you do want
it if only to make sure it still works.

> V0 is needed, but that's the wrong name.  Some other switch is 
> needed to
> tell the driver to treat the file as data even if it looks like a 
> COW file.

Ok how about D -- direct (also uppercase)
V0 is just the free case since it is entry 0 in the table of formats to
try in my table just like V1/2/3/4 are the same entries in the table
it is very easy if they are seq numbers

> > V4 my version of ISAM
> 
> This is a new cow_format, not a new version of the COW file (and I 
> realize
> that sounds confusing :-).  IOW, you set the cow_format field in the 
> header
> to 1, after getting me to assign 1 as the ISAM cow_format.

Why not, we already have a type field the cow_version why do we
need another integer to express that it is a new format plus the data
in the header changes, the current variable for ISAM does
not really map to anything similar in a COW file the index and the
cow bitmap resemble each other but the length computation is very
different sector count/8 vs sector count*8 so there is not much
code in common between a COW & ISAM how it is laid out on
disk is different, how you find a sector is different, when you need
to update the header is different etc.

Also there is not currently a cow_format in V2 header and this has a
different header layout also, additionally DISK & HEADER from the
list of other formats I think would be useful, though they can also be
degenerate cases of COW or ISAM with index/bitmap length = 0
But I would need the CHS in the headers in the latter case, and I
want them in any case.

The method I was using has separate cow_v1/2/3.c files each of
which is self contained the cow_user.c I am building has
common_open/close/pread/pwrite that calls the cow_vX.c and
isam.c through a ops structure linked in but everything is isolated

I am still basing ISAM off of my proposed V3 cow file since I had
it working, and I really like offset,length pairs, they seldom break.
For example this would have made the V2 bug where the map
overlaps data a trivial fix of- run a program like uml_fix to check the 
COW file and rewrite it if it had the overlap with the new offset and
length just moving the data to fix the problem and then new versions
of UML that detects the overlap would say ~"overlapped header run
uml_fix" but the header format would not need to change just the math
to calculate it, which is where we had the error, I also included
overlap checks in the header write routine for my V3 and ISAM.

> > L for symlinks as names
> 
> What this?  I don't remember seeing this before.

To not follow symlinks like we do now and allow that person
who wanted symlinks pointing to where the backing files are kept
to not show up as fixed paths in the COW file, I regarded his
comment as a reasonable request, If you want the COW file to
lie it should pay attention.

> > U for update in place
> > which only really applies to the moo
> > program, but I was thinking should
> > have all the options defined.
> 
> Well, if the ubd (or COW) driver isn't going to do update in place 
> (and I don't
> see why that makes sense), then it should just be specific to 
> uml_moo.

First define it generally so that it is not used differently on moo/UML
and second it might make sense later, like the snapshot uml_moo
to merge layers of COW files below the new snapshot to save space

It had not occurred to me that someone would want to generate a
COW file on a mounted system at runtime, but now it seems clear
that could be very useful.

That is why really I like general solutions, they allow strange things
you did not think of initially to be put in without too much trouble
once it becomes clear why it would be useful.

> > ?? sector size should have a option
> > ?? for page size i.e. use 64k even on i386
> > so COW can be read on say a alpha
> 
> Yeah.
> 
> > A to set AIO mode
> 
> I think this should be auto-detected and used.  Maybe there should 
> be a no-AIO
> for debugging.  In any case, AIO is a lot less useful with ubd-mmap, 
> unless
> there is an AIO way to wait to a page to be faulted in.

AIO was a pain to auto detect, the compile time check will
find it in some cases where glibc emulates it with threads, having glibc
try to create threads from a UML did not work at all last time
so AIO would need both a off and on option last time I tried it
defaulting off would prevent breakage, but default on has better
performance, so off by default with A for on should be ok.
But if it tries to auto-detect a command to force off is needed
in case glibc is being fancy.

Yes it can be useful even with mmap since it allows read/writes
to complete out of order on the host, so if they are really going
to disk you can get a large speedup by having the host reorder
the requests properly just like it does with its own requests,
plus you can have several requests outstanding at once with
AIO just like you can use the cache on the ide drives with
TCQ, the writes would really benefit and would not block
the reads where the data is mmaped in, I sort of remember
a AIO mmap somewhere.

> > R for readonly so all options are uppercase
> > S for sync data at each read/write
> > ?? for sync on barriers for 2.6
> 
> Are you thinking barriers for journalling filesystems?  That would 
> be useful, but I've done no thinking about that.

The new request queue has what I think are the JBD barriers
sent in the queue and I was thinking that would be nice
to sync when we got them, much less overhead than the
ubd=sync option and it should work properly with little effort
Sort of the equivalent of the ide flush command.
And just like the real thing you may not want it turned on :(
even if it is available

> The paged index would be useful (and that's probably the only way 
> mapping the
> index could be useful on x86).  It's not obvious to me that you can 
> keep the
> right parts mapped.  Also, the fact that only little pieces of the 
> index
> are mapped at any time make it a lot harder to make mapping pay.  
> With only
> two words being changed at a time, the writes are pretty damn 
> cheap.

I was no longer planning on just doing two words of update
currently my version is very inefficient and does once sector at a
time and the common_pread/write puts them together, but I was
planning on having runs continue N sectors at a time and that
is harder, for ISAM the index is directly mapped, hey it is proof
of concept I don't want to optimize until the infrastructure is better
I want easy to write and clear to read for the first versions, they
will work better (no crashes) we can make it fancy later but I
want to sketch out what flags I think will be needed later now
so as to avoid forgetting something, my earlier version used the
paged bitmap but it was somewhat complex.

> You'd use __get_free_pages for this, not kmalloc.  But that's 
> exactly what
> happens, and it works cleanly.  You allocate the pages, overmap them 
> from
> the device, and then unmap and replace the original pages on free.

I don't remember the current V2 doing either, but I think it is needed
currently V2 vmalloc's the data area and then just reads it in, I was
actually using mmap itself, that way I did not have to worry about
writing the bitmap/index out at the end, I also use pread/pwrite
since as far as I know every block currently has a seek to the
block's offset anyway :) I don't know of any real seeks on a block
device they all have block numbers

Yet another upgrade for the future, vmalloc is not liked by
Linus, I was just using kmalloc so I only had one type of
memory allocated, get_pages is better in some cases.

> > The types of headers I think should be present
> > COW -- what we have now
> > ISAM -- which will work without sparse files
> > DISK -- just to keep disk image info CHS etc
> > HEADER -- like disk but treats the data in
> > the disk as a separate file sort of like the 
> > backing file in a COW setup
> > several different HEADERs could be setup
> > for one image file, each with a different layout
> > for the C/H/S as a example
> 
> Those are interesting.  However, I must again ask who's crying for 
> these :-)

Me :) for one, and someone else wanted ISAM to run on a DOS?
partition which will not do sparse files...
I want to keep disk images with different CHS geometry

> I am looking for ubd-mmap data corruption, but I don't see any 
> there.  In
> the case that the data can't be mapped (wrong size, wrong alignment, 
> etc), then it just falls back to the old read/write code.

It was a while ago when I was trying to figure out the ubd=mmap
option, seems I was wrong :(, I don't even see where I was
looking, it might also have been in your separate cow layer

Some examples of where I might have been lost:

physmem_subst_mapping seems to be a function which reads in
by mmapping over the buffer in the request and sets up
a hash to keep track of what was mapped where, but it has
no comments at all, unless you just added some recently.

file_io from os/file.c is also quite arcane without comments
the paging in by calling down to get one char from each
page looks like it should work but makes a read or a write
route through several indirect layers and passing in the copy
function pointer is interesting when debugging, I have not
figured out why you wanted to put the interfaces in
those exact spots though.

________________________________________________________________
The best thing to hit the internet in years - Juno SpeedBand!
Surf the web up to FIVE TIMES FASTER!
Only $14.95/ month - visit www.juno.com to sign up today!