Re: [Tapioca-devel] Archive format

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Note: Sorry about the delay on commenting on this message. Had a slight
crisis at work (a member of the team quit and we had to go into emergency
mode to fill in for her). Things are settling down a little now so we
should be able to move forward a bit.

I am currently working on (generating code for) the network block protocol
for the plumbing.  That can go on in parallel to the other work. I have
also found a different crypto toolkit which is smaller than OpenSSL and
which is native C++, so will work better with the Plumber (which is C++).
The crypto toolkit is called "Crypto++". Now all I need is a working C++
compiler (the one that comes with Red Hat 7.1 is busted, it will not
create executables, it won't even compile "hello_world.cpp", I'm waiting
for the fixed one to download from ftp.redhat.com).

On Sat, 7 Jul 2001, Richard Fish wrote:
> On Wed, 4 Jul 2001, Eric Lee Green wrote:
> > This may not work with backups to Jazz disks, DVD-RAM disks, or MO disks.
> > Given my current employer, I very much want this to work with DVD-RAM
> > disks and MO disks :-). I will experiment tomorrow and find out exactly
> > what kind of write sizes these jukeboxes will handle.  I know that the MO
> > disks do require all writes to be at least 2K, or a multiple thereof,
>
> Well, my hope is that we could provide a buffer of a particular size, and
> have the driver figure out how to transfer that to the device.  As long as

I did some experiments, and it turns out that the Linux drivers, at least,
will handle breaking up any writes to the MO drive to 2K chunks. But the
writes have to be a multiple of 2K.

This indicates that we are going to need to maintain some information
about block devices that's different from what we maintain about tape
devices, and bear those in mind when we are deciding what block size to
use for a particular backup, but I don't see any real problem there. The
central authority knows what devices the data is going to be saved to, and
can tailor a buffer size that's a multiple of block size for all of them.

> > obvious. If I have one system with 100gb of data, and one system with 5gb
> > of data, and multiplex the two, then the second stream will finish

> > probably before the end of the first volume, except the actual tape writer
> > really has no way of knowing that. It knows it hasn't seen any data from
> > that particular stream for a while, but doesn't know whether it's just a
> > case of the source for that stream being busy or something else.
>
> Well, it should get an EOF on the socket.  That will tell it the stream is
> done, at which point, it can mark it as inactive/finished/whatever.

The multiplexor gets an EOF, but the actual tape writer, downstream of it,
doesn't. It just knows it's getting blocks, and plunks them to disk or
tape or whatever.

> It's a very nice feature from an administration standpoint, because I can
> say I want to backup my entire workground of 30 machines, and multiplex
> them 4 at a time, and let the software figure out how to do it.  It avoids
> me having to micro-manage the backup, and ensure the backup is completed
> in the least amount of time.

I like this. It does mean that we must have the central controller assign
all archive ID's rather than the client.

> > We'll need to think about restores, and how to specify what to be
> > restored. I'm thinking that the low level restorer thingy will just be fed
> > a sequence of block numbers, and will fetch those blocks and put them out.
> > Then something upstream will actually strip the proper file info out of
> > the stream. This requires a simple SQL statement to grab the block numbers
> > for the files and output them. Hmm, that tends to indicate that we want a
> > 'last block' field in the database too, for indicating the block that
> > contains the end of the file. (and perhaps can indicate a range of blocks
> > to the restorer thingy).
>
> Hmm, this would indicate that we can't do anything less than a full
> restore without a catalog, even using low-level, command-line accessible,
> utilities.  I can accept that.

Naw. If you feed a block range of -1:-1 to the restorer, it just dumps the
whole volume set to the output. Upstream, you can use a filter that
filters it by filename or whatever criteria you want, but the low level
tape reader knows nothing about files. It can't know anything about files,
because files depend upon knowing the details of archive streams. All it
knows about is blocks. But another widget upstream can strip out the file
data that we're interested in.

This works pretty much like 'tar' already works. If you restore a tape
with 'tar' and just want file '/foo/bar', it'll read through the entire
darned tape looking for /foo/bar, even after it's already restored
/foo/bar.

Obviously we want to use a catalog and only dump the blocks of interest,
but I really would prefer that the restore scanner thingy not have to know
anything about filesystems. Paths and path extensions vary greatly
depending upon operating system, and some things don't even really have a
path (like an Oracle database dump stream using, hmm, what was the name of
that standard protocol for getting a dump stream from databases?).

> The alternative is to have the lowlevel archive/backup/whatever scanner
> have the ability to also look for paths and path extensions.  Essentially,
> to treat the archive/backup/whatever as a filesystem tree.

I'd prefer that the tape/disk/punch card/etc. low level drivers not need
to know anything about files, and that this all be decided upstream. They
know about archive streams -- that should be easy enough, we've already
specified that any given i/o block contains data only from a single
archive stream, we can put that ID at the top of the i/o block and filter
on it -- but because we want to be able to restore any kind of data, not
just filesystem tree data, I'd prefer that we *not* embed any kind of
filesystem logic. For example, I can foresee an LDAP data sourcer that
would produce a stream of LDAP-formatted data. There's no reasonable way
to treat that as a filesystem dump or restore, and I wouldn't even try.
Let some specific LDAP-knowledgable widget handle knowing what an LDAP key
looks like and how to set LDAP variables in the destination LDAP
directory.

> If we eliminate this concept, then we can also eliminate certain other
> unstated requirements that were in my head, like that all entries in a
> particular directory be contiguous within a single archive stream.  I.e,
> that /usr/bin comes after /usr and before anything else not in the /usr
> heirarchy.

Correct. We want to keep the actual low-level i/o engine as simple and
stupid as possible, because I envision that we will actually have dozens
of these guys eventually -- e.g., one that will back up to CD-R's, one
that will back up to DVD-RAM disks, one that will back up to a sequence of
2GB files, etc. This also indicates that we don't want to think in terms
of "block numbers" -- we want to think in terms of "location identifiers".
For example, for the sequence of 2GB files case, the actual location of a
file could be "/data/storage/345:32768", meaning that the data we want is
at location 32768 in file /data/storage/345.

> This could lead to some other interesting features, like multi-threading
> backups on the agent and processing different filesystems in parallel.
> Heck, we could even consider breaking a single backup-object (file,
> database, etc) into segments that could be interspersed with other
> segments within the archive stream!  Although, I think that is complexity
> we can avoid at this point.

See my sequence of 2GB files case :-). I do agree that this would allow
some neat stuff like the multi-threading backups on the agent! But we'll
leave that for future versions :-).

> As far as the database goes though, I think we primarily want to store the
> starting block of a new backup object (file, database, etc), rather than
> all of the blocks or range of blocks that contain it.  I think if we try
> to describe all of the blocks that make it up, we will essentially end up
> indexing every single archive block in the database, and that is a *lot*
> of data.

We can't think in terms of blocks. We must think in terms of location.
As for the describing all blocks, that's actually fairly easy. You just
have two locations: the location containing the beginning of the file, and
the location containing the ending of the file. When you go to the
restore, you feed that range to the restorer, after doing a 'sort' and
'uniq' on the whole mess. I think that the ending block is a pretty easy
one for the 'processor' widget for a particular type of archive to figure
out.

Here's what happens. A processor widget gets a block on its input. It
records what start-of-files are in that block, and what end-of-files are
in that block. It then writes the block. It eventually gets a location
back for that block. It looks at all end-of-files for that block, goes
fetches the start-of-files data for all end-of-files in that block (hash
tables are great, eh!), and writes out the database record. Quite
simple. You forget that C++ has standard vector and hash table classes
as part of the STL! (Of course we have to remember to zap the data out of
the hash table after we write it!). I'm starting to feel a *LOT* better
about using C++ than I was feeling a few weeks ago. Having these kinds of
classes sitting there for use means we can think of doing things like
this, like we'd do them in Python, without worrying about all the code
we'd have to write to do it.

> This does mean that the multiplexing writer thingy needs to be fair in how
> it choses what archive stream to service next, so that we don't have 10GB
> of data to read through to find the next block, unless the system it came
> from had some kind of problem.

I agree. It should attempt to do a round-robin service on all of the
available inputs.

> There could be some effeciency gained on a restore from being able to use
> QFA seeks to skip ahead, but I don't think it's worth the extra catalog
> storage.

Again, I agree. The range is enough. The fact that we had to read 30
blocks to restore 5 blocks of data is trivial for most cases.

> > The platform and type of stream is indicated by the writer object. An NT
> > writer object will have a different object ID than a Linux EXT2 writer
> > object or a MacOS writer object, because they have different data formats.
> > Writer objects may have, e.g., a pathname translator function, associated
> > with them. We don't need to put more indicator ID's into the headers,
> > because these indicator ID's are associated with the object that is
> > creating this stream. Sort of, if I have a code 0x5324 that is a "Oracle
> > NT Database Dump Object", this object may have a pathname code of 0x53
> > associated with it ("SQL Database/Table Names"), an originating OS code of
> > 0x05 (Win32), etc... but all we need in the records on tape is the object
> > id of the creating object.
> >
> > This means we have a central repository of object information, but that's
> > easy enough to accomplish (we're going to have a SQL database, after
> > all!).
>
> I think I got it....what if we could interconnect the writer objects for a
> restore.  Let me explain: during a backup, the process is simple and
> obvious (from the data flow perspective):
>
> NTFS file --> NTFS_File_Obj --> ArchiveObj -> ....
> EXT2FS file --> EXT2_File_Obj --> ArchiveObj -> ....
> Oracle DB --> Oracle_DB_Obj ....
>
> Let's think about those File and DB objects for a second -- they need to
> read object specific data (ACLs, extended bits, file attributes, datafile
> locations, etc), format it into a form they can read later, and read and
> format the data from the file or database.  All of that get's sent off to
> the archiving process.
>
> For a restore to the original location, the process looks like:
>
> ... --> ArchiveObj -{NTFS_File_Stream}-> NTFS_File_Obj --> NTFS file
> ...
>
> In the above, the NTFS_File_Obj gets (via push or pull) the same data
> stream(s) it originally stored via the archive object (we hope!).  Since
> it wrote that data, it knows how to decode it to get the filename, file
> attributes, and the data section.  So, it has no problem recreating the
> original file, with the original attributes, etc.
>
> The problem is how to handle the foreign data case:
>
> ... -> ArchiveObj -{NTFS_File_Stream} -> EXT2_File_Obj -> EXT2FS file
> or
> ... -> ArchiveObj -{NTFS_File_Stream} -> NTFS_File_Obj -> EXT2FS file
>
> One way is as you suggest, to have a translater object to convert the
> NTFS_File_Stream into a EXT2_File_Stream, that can be read by the
> EXT2_File_Obj.  But that involves a set of objects that have to be
> maintained in sync with the writer objects.  Instead, what if we leverage
> the knowledge and code already in the NTFS_File_Obj, and tell it to write
> it's data via an EXT2_File_Obj.  It looks like this:
>
> ... -> ArchiveObj -{NTFS_File_Stream} -> NTFS_File_Obj
>         --> EXT2_File_Obj -> EXT2FS file

You still must have a converter object here to, e.g., map from an NTFS
filename to a EXT2FS filename. Whether the conversion is done at a low
level via calling a method for an object or via an external translator
component, it's going to have to be done.  Note that the converter object
can most certainly use the C++ classes used by the writer object to handle
most of that, I certainly wasn't suggesting that we had to maintain two
totally separate sets of classes, just that there was a step involved in
translating from one format to another. We might think further on how to
make this translation process easier, but I do not think we need to worry
about it for the moment. For the moment it will probably suffice to say
"you can restore NT files only to NT systems, and LDAP directories only to
LDAP directories". The beauty of a component architecture is that you can
always add more components into the stream later. We want something that
works in a relatively short time. I'm tired of tarring up my notebook to
my desktop then backing up my desktop, I want a real network backup again
and no, I'm not going to install Arkeia to do it, I want something Open
Source!

> the important thing is the 'translation' is done by the writer/reader
> objects themselves, without needing another class of objects to handle
> that.

No problem with that, just noting that we must have format translation
somewhere if we are going to do cross-platform restores, whether it is as
part of writer objects or as separate components that sit in the stream.
Separate components (built with those objects, sitting in a pipeline
between source and destination) are easier to hack into the pipeline
later, but impose a performance penalty. Still, how often are we going to
do cross-platform restores, and is the performance penalty going to be
severe enough that we really care?

> Now, there are some connections make no sense.  For example, restoring
> file data to an oracle database object would be non-sensical.  So there
> are a couple simple rules to implement in code:
>
> 1. File objects connect only to other file objects.
> 2. All other objects connect to file objects, or themselves.

All I would suggest is that we keep things simple wherever possible.
"Release early, release often" is the goal. If we get working code on the
site, then we can possibly get other contributors to do things like, e.g.,
hack on translation objects. Remember, this isn't you and me and Randy
sitting in our cubicles anymore, this is an Open Source project, and the
more we can parallelize the development, the more hackers we can attract
to working on it and have them make meaningful contributions. But nobody's
going to participate until we can do actual network backups and restores.
Thus I suggest that we defer talk on format translations, and concentrate
on the low level block format.

> > I think an I/O block size has to be passed to the low-level stream
> > creator, because it is in the best position to figure out the best way
> > of spanning (or padding) at I/O block boundaries. An after-the-fact
> > 'chunker' is not as good there.
>
> Except that means that the lowlevel stream objects need to know about the
> archive format, header sizes, etc.  That is not good.

No it doesn't. It needs to know that x bytes are reserved for a header,
but then it just fills in the rest of the buffer with its data and passes
it on to the next component, and passes it down the line with each
component filling in its own piece of the puzzle as desired. This does
mean that it needs to "know" that it owns, say, bytes 256 to (n-12) of a
32768-byte buffer block, but that's not too difficult.

> I think the stream creators create exactly that, a stream of data (here's
> some data, write it to the archive).

By "archive" do you mean the sum total of data being written to tape? Or
do you mean a single backup stream?

For simplicity's sake, it makes sense to organize each backup stream into
tape-block-sized (or some kind of io-block-sized) chunks, each of which is
tagged with what backup stream wrote it. This way the low level tape
reader can read back stuff based on what backup stream it came from,
without knowing anything about what's actually in those io-block-sized
chunks.

I think I see where you're going. You're saying that the stream processor
-- the thingy that takes the raw stream of objects from the stream creator
and does any processing necessary and figures out what data needs to be
logged by the database to eg. indicate start and end of files -- should be
the one that actually chunks it into IO-buffer-sized blocks, because that
makes it easier for it to decide where files start and stop (and span
things across block boundaries) without having to re-parse the
IO-buffer-sized block back into its component raw stream blocks. Okay.
That's fair. I think that'll work. It'll also make the agent side code
smaller, which makes it more feasible that we could possibly create rescue
disks for this thing, since the stream processor

This way filesystem readers can handle the spanning in a way that makes
sense. They know, e.g., that if they have a 'filename' block that's 105
characters long, it makes no sense to span that across a buffer boundary
that's 30 characters away, so they can create a 'padding' block of 30
bytes, and continue on to the next block.

There is no such thing as an 'archive object', by the way, just an
'archive block' object which the various writers know how to read and
write off of tape or disk or whatever media we're backing up to. I think
we need to get away from the whole notion of an "archive". What we have is
a "backup", which consists of one or more "backup streams", each of which
could represent any kind of data (some could be filesystem backup streams,
another might be a backup of the NT registry, etc.). Each kind of backup
stream type must have its own routines for deciding what sub-blocks within
the overall 'archive block' mean, and thus for chunking data.

> The archive object inserts whatever
> headers and structures it needs to be able to validate and return that
> data back to the stream object for a restore.  It also takes care of
> overflowing data across block boundaries (with new structures, etc).

Again, I state that we should ban the word 'archive' in favor of the term
'backup' and 'backup stream', where a 'backup' consists of blocks from
multiple 'backup streams'. That will probably save some confusion.

> The point is, that having any layer try to format/manage data for another
> layer is difficult to maintain, and violates the black-box encapsulation
> rules.

True, but somebody does have to know how to chunk things. We can't do
anything about that :-(. I do believe that moving the chunking out to the
processors, rather than putting it in the creators, is probably a good
thing because that results in smaller creators (and the creators are the
things that live out on the remote systems... the smaller we can get these
things, the more probable we can make rescue disks with them).

> > Can we come up with catchier names than this? I do agree we have a
> > terminology problem here. 'archive streams', 'backup streams'. No?
>
> > Tomorrow I guess I get to work on terminology and see if I can come up
> > with some definitions that make sense.
>
> Let's see (brainstorming here),
>
> Addict - (A)ttention (D)eficit (D)isorder (I)nfli(C)ting (T)ask
>          (aka. the multiplexor)
> Archie - A 'stream' of data representing an object in an archive
> Archive - A collection of Archies representing a single backup
>           set (constrained to a single system?)
> Backup - A collection of one or more archives, one one or more media
>           volumes.
> Vault - The collection of all Backups done by Tapioca.
>
> Hmm, none of these have anything to do with pudding, deserts, or even
> food.  Anybody else?

My head hurts. Let's move on to what a backup header (the thing at the
start of backup volumes) looks like, then what a Unix filesystem stream
looks like. Note that because we've defined this so flexibly, we can later
add, e.g., an ext2 filesystem stream thingy or etc., and by bumping the
object ID still be able to read the archive using the previous-generation
object. So I don't think we have to be quite as careful about the actual
(non-io-related) contents of the backup blocks as we thought, as long as
we get the backup block and backup header formats fixed in stone. From
thence onwards, as long as we can pull out the object ID of the entity to
be used to restore the stream out of the backup header, we can restore
this, and if we change the format of a backup stream, we just bump our
object ID and retain a copy of the old restorer at the original object ID
so that we can continue to restore streams of that format. Yes, a pain,
and to be avoided whenever possible, but backwards compatibility cruft is
inevitable. You of all people should know that :-).

Eric Lee Green                             mailto:er...@ba...
               BadTux: http://www.badtux.org