Re: [Tapioca-devel] Archive format

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Wed, 4 Jul 2001, Richard Fish wrote:
> On Mon, 2 Jul 2001, Eric Lee Green wrote:
>
> > I've put the preliminary architecture documentation up at
> > http://tapioca.sourceforge.net . That should give you a good idea of how
> > this is structured. Aside from the 'pudding', the rest of the
> architecture
> > should be somewhat familiar, except for the parts inflicted by Java (like
> > the tape server being multi-threaded rather than multi-process).
>
> Looks good.  One question I have though is the implementation of the
> tapicom protocol.

Right. I deliberately did not go into that. We're still quite a ways away
from caring about tapicom -- I want to get some network backups done
without having to worry about chunking things into tape-sized bits myself
to work around the limits of 'tar'! My home network needs something a
little higher power than 'tar', sigh...

> I have just updated my java reference library, and see that java has a
> built in RPC mechanism called RMI (Remote Method Invocation).  It's
> simpler than CORBA or COM, although it only works between java objects.

Yes. This is definitely worth investigating for Tapicom, since the client
and the server will both be Java (as vs. the pudding, which is C or C++).

> Maybe we should consider using RMI instead?  The advantage here to me is
> that we don't have to write a lot of code to format, write, read, and
> parse data for every operation we perform, while also trying to figure out
> the error detection, reporting, and handling mechanisms.  If we want to
> add a user to the database, we just call something like:
>
> result = tapica.exports.NewUser(authObj, userObj)
>
> If the call fails, it raises an exception that can get handled by the
> client.
>
> Anyway, I think it's worth considering, since it makes the data and error
> handling much easier.

I think it makes everything easier except one thing: progress streams.
We will have to see whether we can get a stream object out of an RMI
call. The other alternative is to make progress streams client-driven
rather than server-driven (i.e., they robotically call a
'tapioca.exports.GetCurrStatus(currBackup)' to get the current status
info). I'm kind of reluctant to do that, because if the server is
bogged down, this will bog it down even worse (as vs. the case of the
updates simply getting slower, in a server-driven process).

> > Now it's time to decide on the archive format. We need to get this right
> > because changing it after we've started making backups will be a major
> > pain in the %$#@@#.
>
> Agreed!!!!
>
> > Some criteria:
> >
> > 1. Must be able to multiplex (and demultiplex) multiple streams into the
> > same tape file. This indicates that the input needs to be blocked in a
> > structured way, rather than just being a plain old stream of data, and
> > that each block needs to be tagged with a stream ID.
>
> Yep.  Some proprosed requirements:
>
> 1. Archive volumes are written in fixed blocks of a configurable size.
> Minimum IO block size is 16k.

This may not work with backups to Jazz disks, DVD-RAM disks, or MO disks.
Given my current employer, I very much want this to work with DVD-RAM
disks and MO disks :-). I will experiment tomorrow and find out exactly
what kind of write sizes these jukeboxes will handle.  I know that the MO
disks do require all writes to be at least 2K, or a multiple thereof,
because I (ab)used 'dd' to 'erase' some disks last week when I was testing
the progress window that I wrote for a web app (progress windows are
*EASY* with Tomcat and Java Server Pages, because you can store the backup
thread object itself into the session object -- which can update its own
progress info, which the refresh will catch when the window does its
auto-refresh thing every 5 seconds -- I bet Randy is green with envy,
considering the stuff he had to go through to handle progress windows!).
(Hmm, perhaps client-driven status displays aren't so bad after all?)
(Hmm, or maybe not...)

> 2. Each volume begins with a volume header block that describes the volume
> and all still-active archive streams being multiplexed to that archive.
> Note that this means that additional streams cannot be added to an archive
> that is in process.  However, it doesn't guarantee that this specific
> volume actually contains any data for that archive stream.  For example,

This sounds quite good. As for the 'does not guarantee', that's pretty
obvious. If I have one system with 100gb of data, and one system with 5gb
of data, and multiplex the two, then the second stream will finish
probably before the end of the first volume, except the actual tape writer
really has no way of knowing that. It knows it hasn't seen any data from
that particular stream for a while, but doesn't know whether it's just a
case of the source for that stream being busy or something else.

> we could say that this archive will consist of 8 streams, but only 3 can
> be written concurrently.  This would allow the other 5 streams to be
> written as previous streams are completed.

This would require some smarts on the part of the multiplexor, where we
feed it how many pipes we want active at startup, and have a monitoring
thread in the central authority that will monitor the deceasing of jobs
and add jobs to the pudding and tell the multiplexor (via a control
channel of some sort) "oh, by the way, you now have an additional stream
on pipe /var/lib/tapioca/tmp/tp2134.532 to multiplex into the backup."

The problem here is that if we want our volume header to include a list of
*all* streams in the volume, that seems to indicate that either a) the
central authority assigns stream ID's (thus knows what stream ID's are
going to be assigned to everything started up), or b) everything starts up
at the same time, and some are 'put to bed' for a while. I don't like the
'put to bed' thing, because sockets can time out, connections can time
out, IP masq table entries can time out, etc. Nasty.

So I guess we *could* do this. What we need to do, then, is have the
central authority assign the stream ID's.

> 2. Each IO block contains data for exactly one archive stream.

Uhm, Let's decide on terminology. The whole multiplexed/duplexed/whatever
stream of data is an archive stream? Or is it a backup stream? And the
individual backups, are these backup streams? Or archive streams?

We have a terminology confusion here :-). But I know what you mean here.

> 3. Each IO block contains a header that specifies it's archive stream ID,
> sequence number, checksum, etc.

The actual writer probably wants a backup stream ID and backup stream
sequence number and checksum. It doesn't know anything about archive
streams, it just knows what I/O blocks look like. Some other widget
that knows about archive streams actually writes header info.

We'll need to think about restores, and how to specify what to be
restored. I'm thinking that the low level restorer thingy will just be fed
a sequence of block numbers, and will fetch those blocks and put them out.
Then something upstream will actually strip the proper file info out of
the stream. This requires a simple SQL statement to grab the block numbers
for the files and output them. Hmm, that tends to indicate that we want a
'last block' field in the database too, for indicating the block that
contains the end of the file. (and perhaps can indicate a range of blocks
to the restorer thingy).

I'm tired of calling the thingies in the pudding thingies. "component" is
so, uhm, generic. What do we call them?

> 4. The payload section of each IO block starts with a byte that indicates
> a structure type: data stream header, data stream resource, data stream
> data, etc.  This is followed by the structure of the indicated type, and
> any data it carries.

Should I/O blocks also carry an 'originator type' ID? Or is it enough to
stick this in the volume header? In any event, payload has to be marked
somehow with what kind of thingy created the payload, so that we can
restore the payload using the correct kind of thingy.

(Spoons. Tapioca. Pudding. Spoons stir the pudding, no? Sorry, just
getting silly here :-)

> 5. Within each IO block the structures are variable length, ensuring
> efficient space usage and performance.
>
> The downside to variable length structures and data is handling
> corruption.  With a fixed-length, BRU-style format, if a file data block
> was corrupted, we could (and did!) just write the data section out to the
> file, and note the error.  On a file header block, we could have just
> invented a file name, and wrote out the appropriate data section.
>
> But with variable length records, we can't even trust that the encoded
> length is correct, and we don't know for certain if there is another
> stream header structure in the block, and if so, where. We either have to
> write very complex (and error-prone) code to try and make sense of the
> corrupted data, or throw the whole block away and move on.

I'm not sure that there's much we can do about corruption in today's world
except report it, write the block somewhere in case a human needs to see
it, and continue. If we've trashed the archive somehow (via a defective
program or whatever), I certianly don't want to restore corruped data!

> > 3. The tape format should not require doing a MT_TELL for every bloody
> > block written to tape, only for blocks that actually need it (i.e, blocks
> > that contain the beginning of a piece of data logged into the database).
> > This tends to indicate that blocks need tagging with a "type" field.
>
> Hmm, how do we handle the catalog rebuild (reading the archive) case?  We
> don't know *if* we need to do an MT_TELL or not at the time we *need* to
> do the MT_TELL, before reading the block.

For rebuilding the catalog the MT_TELL thing is okay. It is buggy writers
that are a pain. We need some way to work around idiotic firmware or
idiotic device drivers that flush a tape drivers' buffers upon write
(thus limiting performance to about 8K/sec!).

Now that I think about it, you're right, the MT_TELL is not a big deal on
restores or verifies. Writes are where it hinders performance. If we can
come up with some (optional) scheme to limit # of times we need to call
it, that'd be a big performance boost with some drives.

> > 4. The format should be able to handle two things other than raw
> >   data blocks:
> >    a) producing location information suitable for logging into the
> >     central authority's location database for use in future restores,
> >     and
> >    b) holding any OS-specific data needed to fully restore the file.
>
> Right.  And of course, we also want to be able to restore the data portion
> of a stream (file or otherwise) on a different OS.

That will require a filter of some sort in some cases, if only to convert
file names and permissions to a reasonable thing.

> > 5. The stream format will have to hold data about what kind of writer
> > produced the data in the file, so that the file logger can properly
> > account for the differences in display format and pass that data upstream
> > to the user interface. We don't want to force Unix filename format onto
> > Windows or Mac or etc.!
>
> Yep.  Yet another lesson we learned!  In the archive, paths should
> probably be encoded into some platform-independent form, so that it can be
> reconstructed for the platform we are restoring to.  But we still want an
> indicator of the original platform, for catalog and display purposes.

> Oh, and let's not forget, the converters from the independant path to the
> native path format will need to check for and handle invalid characters in
> the path.

Yep :-)

> > Similarly, if we're backing up a database file
> > dump stream (one possible data source) we don't want to have to pretend
> > that it contains Unix-structured data, and we need to know it came from
> > a database stream dumper rather than from a filesystem dumper, so that
> > when we go to restore it we know what restorer to use!
>
> Yep.  I'm finding it useful to think about streams as having 'names', not
> paths.

>
> >     So each type of data stream creator will need a unique creator ID of
> > some sort to tell us what kind of widget created the data stream, and this
> > gets put into the header so that we can grab it and know what to restore
> > this data stream with.
>
> I've been thinking a lot about this, and am having difficulty.  On the one
> hand, we want a general ID that indicates very generally what this stream
> is (directory, file, pipe data, database, etc), and a general way of
> accessing it for cross platform support.

> On the other hand, we also want to be able to identify a file as coming
> from an ext2 filesystem, so we can backup and restore the extended ext2
> bits.  In other cases, platform and filesystem specific ACL's need to be
> handled.

>
> So it looks like we need at least 3 different indicator ID's for stream
> headers.  The first (1 byte?) to indicate the type of stream (directory,
> file, pipe output, command output, etc), the second (1 byte?) to indicate
> the platform of the given type (Windows, Mac, Unix for directories and
> files; Oracle or MySQL for database streams, etc).  The third (2 bytes?)
> further classifies the type of stream based upon it's original writer
> object.

The platform and type of stream is indicated by the writer object. An NT
writer object will have a different object ID than a Linux EXT2 writer
object or a MacOS writer object, because they have different data formats.
Writer objects may have, e.g., a pathname translator function, associated
with them. We don't need to put more indicator ID's into the headers,
because these indicator ID's are associated with the object that is
creating this stream. Sort of, if I have a code 0x5324 that is a "Oracle
NT Database Dump Object", this object may have a pathname code of 0x53
associated with it ("SQL Database/Table Names"), an originating OS code of
0x05 (Win32), etc... but all we need in the records on tape is the object
id of the creating object.

This means we have a central repository of object information, but that's
easy enough to accomplish (we're going to have a SQL database, after
all!).

> This way, if we are running on an NT system where we don't have the Unix
> file object class, we can use more general file object class to process

I would suggest that we have a set of translation objects that can be
thrown into the pudding where necessary to do any kinds of translations
that we feel are needed. The NT restorer object thus doesn't have to know
about anything except NT file stuff. It doesn't have to know that up
higher, a "EXT2->NT" translator sat on the stream and re-wrote the header
data to make it reasonable.

> They also don't have to worry about processing data produced by other file
> objects, since they would only get data that they (or their subclasses)
> created.

The translator stuff works there too.

> Think about the archive API like a filesystem streams API.  The file
> object just reads/writes data, and doesn't worry about how that get's
> stored on the media.

I think an I/O block size has to be passed to the low-level stream
creator, because it is in the best position to figure out the best way
of spanning (or padding) at I/O block boundaries. An after-the-fact
'chunker' is not as good there.

> > 8. Checksumming streams: We should probably only worry about checksumming
> >   buffer-sized chunks of data, not individual blocks of structured data.

> We need some new terms.  We are using 'streams' in the sense of archive
> streams and interplexing, and 'streams' in the sense of data streams to be
> archived.  How about calling them something like d-streams and a-streams.
> A d-stream is a stream of data to be archived.  An a-stream is an archive
> stream, and contains one or more d-streams.  An archive is made up of one
> or more a-streams.

Can we come up with catchier names than this? I do agree we have a
terminology problem here. 'archive streams', 'backup streams'. No?

Tomorrow I guess I get to work on terminology and see if I can come up
with some definitions that make sense.

-- 
Eric Lee Green                             mailto:er...@ba...
               BadTux: http://www.badtux.org
  GnuPG public key at http://badtux.org/eric/eric.gpg