Re: [Tapioca-devel] Archive format

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, 4 Jul 2001, Eric Lee Green wrote:

> I think it makes everything easier except one thing: progress streams.
> We will have to see whether we can get a stream object out of an RMI

Another alternative is a reverse socket connection to a client-side thread
that will be handling the progress updates:

port = GimmeSocket(addr)
thread = ProgressMonitorThread(addr, port)
result = tapioca.exports.SendMeProgress(addr, port)
....

The tricky parts here are what to do about attempts to monitor a job that
has already exited.

> > 1. Archive volumes are written in fixed blocks of a configurable size.
> > Minimum IO block size is 16k.
>
> This may not work with backups to Jazz disks, DVD-RAM disks, or MO disks.
> Given my current employer, I very much want this to work with DVD-RAM
> disks and MO disks :-). I will experiment tomorrow and find out exactly
> what kind of write sizes these jukeboxes will handle.  I know that the MO
> disks do require all writes to be at least 2K, or a multiple thereof,

Well, my hope is that we could provide a buffer of a particular size, and
have the driver figure out how to transfer that to the device.  As long as
we provide the driver a buffer that is an even multiple of it's required
block size, it should be able to handle it.

Otherwise, we may have to do what you suggested, introducing another layer
below the archive IO engine that manages the actual IO size (vs. the
archive block size), doing multiple read/write operations as required.
Actually, we have to do something like this for reading archives from
pipes, since linux pipes will never give you more than PAGESIZE from a
single read operation, and may give you less.

> obvious. If I have one system with 100gb of data, and one system with 5gb
> of data, and multiplex the two, then the second stream will finish
> probably before the end of the first volume, except the actual tape writer
> really has no way of knowing that. It knows it hasn't seen any data from
> that particular stream for a while, but doesn't know whether it's just a
> case of the source for that stream being busy or something else.

Well, it should get an EOF on the socket.  That will tell it the stream is
done, at which point, it can mark it as inactive/finished/whatever.

> > we could say that this archive will consist of 8 streams, but only 3 can
> > be written concurrently.  This would allow the other 5 streams to be
> > written as previous streams are completed.
>
> The problem here is that if we want our volume header to include a list of
> *all* streams in the volume, that seems to indicate that either a) the
> central authority assigns stream ID's (thus knows what stream ID's are
> going to be assigned to everything started up), or b) everything starts up
> at the same time, and some are 'put to bed' for a while. I don't like the
> 'put to bed' thing, because sockets can time out, connections can time
> out, IP masq table entries can time out, etc. Nasty.
>
> So I guess we *could* do this. What we need to do, then, is have the
> central authority assign the stream ID's.

It's a very nice feature from an administration standpoint, because I can
say I want to backup my entire workground of 30 machines, and multiplex
them 4 at a time, and let the software figure out how to do it.  It avoids
me having to micro-manage the backup, and ensure the backup is completed
in the least amount of time.

> > 3. Each IO block contains a header that specifies it's archive stream ID,
> > sequence number, checksum, etc.
>
> The actual writer probably wants a backup stream ID and backup stream
> sequence number and checksum. It doesn't know anything about archive
> streams, it just knows what I/O blocks look like. Some other widget
> that knows about archive streams actually writes header info.

I think we agree, and are just getting terminology mixed up.... :-(

> We'll need to think about restores, and how to specify what to be
> restored. I'm thinking that the low level restorer thingy will just be fed
> a sequence of block numbers, and will fetch those blocks and put them out.
> Then something upstream will actually strip the proper file info out of
> the stream. This requires a simple SQL statement to grab the block numbers
> for the files and output them. Hmm, that tends to indicate that we want a
> 'last block' field in the database too, for indicating the block that
> contains the end of the file. (and perhaps can indicate a range of blocks
> to the restorer thingy).

Hmm, this would indicate that we can't do anything less than a full
restore without a catalog, even using low-level, command-line accessible,
utilities.  I can accept that.

The alternative is to have the lowlevel archive/backup/whatever scanner
have the ability to also look for paths and path extensions.  Essentially,
to treat the archive/backup/whatever as a filesystem tree.

If we eliminate this concept, then we can also eliminate certain other
unstated requirements that were in my head, like that all entries in a
particular directory be contiguous within a single archive stream.  I.e,
that /usr/bin comes after /usr and before anything else not in the /usr
heirarchy.

This could lead to some other interesting features, like multi-threading
backups on the agent and processing different filesystems in parallel.
Heck, we could even consider breaking a single backup-object (file,
database, etc) into segments that could be interspersed with other
segments within the archive stream!  Although, I think that is complexity
we can avoid at this point.

As far as the database goes though, I think we primarily want to store the
starting block of a new backup object (file, database, etc), rather than
all of the blocks or range of blocks that contain it.  I think if we try
to describe all of the blocks that make it up, we will essentially end up
indexing every single archive block in the database, and that is a *lot*
of data.

I think the restorer gets a starting block, an archive stream id, a data
stream id, and reads the media pulling out the blocks it needs.  There
will be an indicator in the archive that a particular data stream is now
finished, so we don't even need the ending block.

This does mean that the multiplexing writer thingy needs to be fair in how
it choses what archive stream to service next, so that we don't have 10GB
of data to read through to find the next block, unless the system it came
from had some kind of problem.

There could be some effeciency gained on a restore from being able to use
QFA seeks to skip ahead, but I don't think it's worth the extra catalog
storage.

> I'm not sure that there's much we can do about corruption in today's world
> except report it, write the block somewhere in case a human needs to see
> it, and continue. If we've trashed the archive somehow (via a defective
> program or whatever), I certianly don't want to restore corruped data!

Of course, we could always do ECC for the headers (not data), but then the
format get's to be *really* complex (it's going to be complex enough as it
is!), and I suspect our performance would get hit pretty hard.

> The platform and type of stream is indicated by the writer object. An NT
> writer object will have a different object ID than a Linux EXT2 writer
> object or a MacOS writer object, because they have different data formats.
> Writer objects may have, e.g., a pathname translator function, associated
> with them. We don't need to put more indicator ID's into the headers,
> because these indicator ID's are associated with the object that is
> creating this stream. Sort of, if I have a code 0x5324 that is a "Oracle
> NT Database Dump Object", this object may have a pathname code of 0x53
> associated with it ("SQL Database/Table Names"), an originating OS code of
> 0x05 (Win32), etc... but all we need in the records on tape is the object
> id of the creating object.
>
> This means we have a central repository of object information, but that's
> easy enough to accomplish (we're going to have a SQL database, after
> all!).

I think I got it....what if we could interconnect the writer objects for a
restore.  Let me explain: during a backup, the process is simple and
obvious (from the data flow perspective):

NTFS file --> NTFS_File_Obj --> ArchiveObj -> ....
EXT2FS file --> EXT2_File_Obj --> ArchiveObj -> ....
Oracle DB --> Oracle_DB_Obj ....

Let's think about those File and DB objects for a second -- they need to
read object specific data (ACLs, extended bits, file attributes, datafile
locations, etc), format it into a form they can read later, and read and
format the data from the file or database.  All of that get's sent off to
the archiving process.

For a restore to the original location, the process looks like:

... --> ArchiveObj -{NTFS_File_Stream}-> NTFS_File_Obj --> NTFS file
...

In the above, the NTFS_File_Obj gets (via push or pull) the same data
stream(s) it originally stored via the archive object (we hope!).  Since
it wrote that data, it knows how to decode it to get the filename, file
attributes, and the data section.  So, it has no problem recreating the
original file, with the original attributes, etc.

The problem is how to handle the foreign data case:

... -> ArchiveObj -{NTFS_File_Stream} -> EXT2_File_Obj -> EXT2FS file
or
... -> ArchiveObj -{NTFS_File_Stream} -> NTFS_File_Obj -> EXT2FS file

One way is as you suggest, to have a translater object to convert the
NTFS_File_Stream into a EXT2_File_Stream, that can be read by the
EXT2_File_Obj.  But that involves a set of objects that have to be
maintained in sync with the writer objects.  Instead, what if we leverage
the knowledge and code already in the NTFS_File_Obj, and tell it to write
it's data via an EXT2_File_Obj.  It looks like this:

... -> ArchiveObj -{NTFS_File_Stream} -> NTFS_File_Obj
        --> EXT2_File_Obj -> EXT2FS file

In this model, the NTFS_File_Obj still does the decoding of the data
stream it wrote.  But rather than create and write a file directly on the
filesystem, it was given a file object to do that.  It assigns a path (via
the platform independant represenation) to the EXT2 file object, tells the
object to open for writing, and writes only the file data to it.  We could
even make these things look like C++ stream objects, by giving them '<<'
and '>>' operators!

This can also be generalized for the restore to original case:

... -> ArchiveObj -{NTFS_File_Stream} -> NTFS_File_Obj(1)
        -{calls and functions}-> NTFS_File_Obj(1) -> NTFS file

In the above, the first NTFS_File_Obj can see that it is 'writing' to
itself, and thus in addition to the steps above, can call it's own
functions to set ACLs, owner, permissions, etc.

The second object is determined by what system/filesystem we are restoring
to.  Obviously, that can only be done once we know where the file is
going, so the first step is to initialize the first object, get it's path,
and determine the second, platform specific object, based on that.  But
the important thing is the 'translation' is done by the writer/reader
objects themselves, without needing another class of objects to handle
that.

Now, there are some connections make no sense.  For example, restoring
file data to an oracle database object would be non-sensical.  So there
are a couple simple rules to implement in code:

1. File objects connect only to other file objects.
2. All other objects connect to file objects, or themselves.

> I think an I/O block size has to be passed to the low-level stream
> creator, because it is in the best position to figure out the best way
> of spanning (or padding) at I/O block boundaries. An after-the-fact
> 'chunker' is not as good there.

Except that means that the lowlevel stream objects need to know about the
archive format, header sizes, etc.  That is not good.

I think the stream creators create exactly that, a stream of data (here's
some data, write it to the archive).  The archive object inserts whatever
headers and structures it needs to be able to validate and return that
data back to the stream object for a restore.  It also takes care of
overflowing data across block boundaries (with new structures, etc).

Think about the way we deal with a TCP socket, and pretend that the
archive object presents the same functionality to a file stream.  When
writing a TCP socket, we don't need to know the MTU of the ethernet device
it will eventually get carried over.  Other layers deal with structuring,
fragmenting, and padding.

The archive works the same way: if a file stream is too big to fit in a
single archive block, it get's fragmented.  Too small, it get's combined
with other file streams (which doesn't happen with ethernet, but oh well).
But as far as we (the low-level file stream thingy) are concerned, we have
a very simple send/recv type interface.

We could also reverse the relationship between the lowlevel streams and
the archiver, and put the archiver in control, with the streams providing
the interface to the file/database:

<file/db stream to archiver> I'm a new stream to be archived
<archiver to stream> Put 29768 bytes of data at this address
<stream to archiver> I put 29768 bytes of data at that address
....
<stream to archiver> I have no more data (EOF).
....

This is essentially what I implemented in the fileio module in BRU, for
dealing with all types of files (regular, sparse, compressed, etc).  This
allowed me to move the knowledge about how to read/write different files
out of the middle layers, which made them much, much simpler.
Unfortunately, the middle layer still knew everything about the archive
format, and was responsible for creating/decoding archive blocks to
provide to the archive layer.

The point is, that having any layer try to format/manage data for another
layer is difficult to maintain, and violates the black-box encapsulation
rules.

> Can we come up with catchier names than this? I do agree we have a
> terminology problem here. 'archive streams', 'backup streams'. No?

> Tomorrow I guess I get to work on terminology and see if I can come up
> with some definitions that make sense.

Let's see (brainstorming here),

Addict - (A)ttention (D)eficit (D)isorder (I)nfli(C)ting (T)ask
         (aka. the multiplexor)
Archie - A 'stream' of data representing an object in an archive
Archive - A collection of Archies representing a single backup
          set (constrained to a single system?)
Backup - A collection of one or more archives, one one or more media
          volumes.
Vault - The collection of all Backups done by Tapioca.

Hmm, none of these have anything to do with pudding, deserts, or even
food.  Anybody else?

-- 
Richard Fish, Unix/Linux Software Engineer, rj...@fi...