Re: [Tapioca-devel] Archive format

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Mon, 2 Jul 2001, Eric Lee Green wrote:

> I've put the preliminary architecture documentation up at
> http://tapioca.sourceforge.net . That should give you a good idea of how
> this is structured. Aside from the 'pudding', the rest of the
architecture
> should be somewhat familiar, except for the parts inflicted by Java (like
> the tape server being multi-threaded rather than multi-process).

Looks good.  One question I have though is the implementation of the
tapicom protocol.

I have just updated my java reference library, and see that java has a
built in RPC mechanism called RMI (Remote Method Invocation).  It's
simpler than CORBA or COM, although it only works between java objects.

It appears that RMI can be easily integrated with any type of socket,
including an SSL-socket.

Maybe we should consider using RMI instead?  The advantage here to me is
that we don't have to write a lot of code to format, write, read, and
parse data for every operation we perform, while also trying to figure out
the error detection, reporting, and handling mechanisms.  If we want to
add a user to the database, we just call something like:

result = tapica.exports.NewUser(authObj, userObj)

If the call fails, it raises an exception that can get handled by the
client.

Anyway, I think it's worth considering, since it makes the data and error
handling much easier.

> Now it's time to decide on the archive format. We need to get this right
> because changing it after we've started making backups will be a major
> pain in the %$#@@#.

Agreed!!!!

> Some criteria:
>
> 1. Must be able to multiplex (and demultiplex) multiple streams into the
> same tape file. This indicates that the input needs to be blocked in a
> structured way, rather than just being a plain old stream of data, and
> that each block needs to be tagged with a stream ID.

Yep.  Some proprosed requirements:

1. Archive volumes are written in fixed blocks of a configurable size.
Minimum IO block size is 16k.

2. Each volume begins with a volume header block that describes the volume
and all still-active archive streams being multiplexed to that archive.
Note that this means that additional streams cannot be added to an archive
that is in process.  However, it doesn't guarantee that this specific
volume actually contains any data for that archive stream.  For example,
we could say that this archive will consist of 8 streams, but only 3 can
be written concurrently.  This would allow the other 5 streams to be
written as previous streams are completed.

2. Each IO block contains data for exactly one archive stream.

3. Each IO block contains a header that specifies it's archive stream ID,
sequence number, checksum, etc.

4. The payload section of each IO block starts with a byte that indicates
a structure type: data stream header, data stream resource, data stream
data, etc.  This is followed by the structure of the indicated type, and
any data it carries.

5. Within each IO block the structures are variable length, ensuring
efficient space usage and performance.

The downside to variable length structures and data is handling
corruption.  With a fixed-length, BRU-style format, if a file data block
was corrupted, we could (and did!) just write the data section out to the
file, and note the error.  On a file header block, we could have just
invented a file name, and wrote out the appropriate data section.

But with variable length records, we can't even trust that the encoded
length is correct, and we don't know for certain if there is another
stream header structure in the block, and if so, where. We either have to
write very complex (and error-prone) code to try and make sense of the
corrupted data, or throw the whole block away and move on.

One thing we might want to consider is some kind of ECC encoding for the
structure headers.  I don't have any clear idea of how much ECC to do, or
even, where it should go in the format.  I also don't have any clear idea
of the performance impact of ECC.

> 3. The tape format should not require doing a MT_TELL for every bloody
> block written to tape, only for blocks that actually need it (i.e, blocks
> that contain the beginning of a piece of data logged into the database).
> This tends to indicate that blocks need tagging with a "type" field.

Hmm, how do we handle the catalog rebuild (reading the archive) case?  We
don't know *if* we need to do an MT_TELL or not at the time we *need* to
do the MT_TELL, before reading the block.

I suppose we could calculate the tape block size when we open the volume
(a=MT_TELL, read_block, b=MT_TELL, tape_block_size = b-a).  Then we could
read the block, and it contained stream headers, fudge the QFA position.

> 4. The format should be able to handle two things other than raw
>   data blocks:
>    a) producing location information suitable for logging into the
>     central authority's location database for use in future restores,
>     and
>    b) holding any OS-specific data needed to fully restore the file.

Right.  And of course, we also want to be able to restore the data portion
of a stream (file or otherwise) on a different OS.

> 5. The stream format will have to hold data about what kind of writer
> produced the data in the file, so that the file logger can properly
> account for the differences in display format and pass that data upstream
> to the user interface. We don't want to force Unix filename format onto
> Windows or Mac or etc.!

Yep.  Yet another lesson we learned!  In the archive, paths should
probably be encoded into some platform-independent form, so that it can be
reconstructed for the platform we are restoring to.  But we still want an
indicator of the original platform, for catalog and display purposes.

Oh, and let's not forget, the converters from the independant path to the
native path format will need to check for and handle invalid characters in
the path.

> Similarly, if we're backing up a database file
> dump stream (one possible data source) we don't want to have to pretend
> that it contains Unix-structured data, and we need to know it came from
> a database stream dumper rather than from a filesystem dumper, so that
> when we go to restore it we know what restorer to use!

Yep.  I'm finding it useful to think about streams as having 'names', not
paths.

>     So each type of data stream creator will need a unique creator ID of
> some sort to tell us what kind of widget created the data stream, and this
> gets put into the header so that we can grab it and know what to restore
> this data stream with.

I've been thinking a lot about this, and am having difficulty.  On the one
hand, we want a general ID that indicates very generally what this stream
is (directory, file, pipe data, database, etc), and a general way of
accessing it for cross platform support.

On the other hand, we also want to be able to identify a file as coming
from an ext2 filesystem, so we can backup and restore the extended ext2
bits.  In other cases, platform and filesystem specific ACL's need to be
handled.

So it looks like we need at least 3 different indicator ID's for stream
headers.  The first (1 byte?) to indicate the type of stream (directory,
file, pipe output, command output, etc), the second (1 byte?) to indicate
the platform of the given type (Windows, Mac, Unix for directories and
files; Oracle or MySQL for database streams, etc).  The third (2 bytes?)
further classifies the type of stream based upon it's original writer
object.

This way, if we are running on an NT system where we don't have the Unix
file object class, we can use more general file object class to process
the data.  In other words, the lowlevel specific file object readers and
writers don't have to worry about whether the host platform supports them
or not.  They will only be available on the platforms that support them.
They also don't have to worry about processing data produced by other file
objects, since they would only get data that they (or their subclasses)
created.

File Stream Object class heirarchy:

GenericFile - Available everywhere.  Processes all types of file data
  |-- NTFile - Available on NT.  Processes file data for all NT filesystems
  | |-NTFSFile - Availble on NT, when backing up or restoring to NTFS filesystems
  |
  |-- UnixFile -
  ....

> 6. For volume changes, the full header information should be replicated
>   on the new volume, along with what volume we're working on etc. so that
>   if we have a tape that is a volume 2, we have more of a chance of
>   associating it with the correct volume 1 if we have to do this by
>   hand.
>
> 7. Fixed-size blocks, or variable-sized blocks? Fixed-sized blocks, like
>  'tar' uses, are easy to deal with, and can be easily packed into
>  larger buffers (as long as said larger buffers are a multiple of the
>   blocksize in length).  However, each block adds overhead. If the
>   block size is too small, overhead becomes too much of a percentage of
>   the block. If the block size is too large, then we have too much
>   wasted space at the end of the block.
>
>   Variable-sized blocks could be used, but we could require that these
>   be packed into a fixed-size buffer of some large size (perhaps
>   64K or 128K) such that each buffer begins with a block and no block
>   spans buffers. This is a pain, but results in less wasted space and
>   thus better performance in the end. Note that if we limit the
>   variable-sized blocks to 32k in size, we can represent the size of the
>   block with only 2 bytes in the block's header.

I think variable sized works best, as we don't worry about wasting space,
which is really the biggest overhead in fixed size blocks.  And 2 bytes
gives us a maximum 'chunk' size of 64k.  Note that we can pack variable
sized chunks into a fixed size IO block.  For example, assume the
following 128k fixed IO blocks (these sizes are arbitrary):

IO header   - 32-bytes
Stream header(1) - 196-bytes (encodes name length, resource length,
                           and data lengths, etc)
  - Stream name - 34-bytes
  - Stream resource - 132-bytes
Stream data(1) - 2345-bytes
  - header 12-bytes
  - data   2333-bytes
Stream header(2) - 96-bytes
  - name - 42-bytes
  - resource - 23-bytes
Stream data(2) - 48401-bytes
.....
Stream data(4) - 836-bytes
  - header 12-bytes
  - data 824-bytes
# END OF BLOCK 2 here, stream 4 not finished
# Start of BLOCK 3,
IO Header - 32 bytes
Stream data(4) - 65504-bytes
  - header 12-bytes
  - data - 65492-bytes
Stream data(4) - 1804-bytes # last block of stream 4
  - header 12-bytes
  - data - 1792-bytes
....

Think about the archive API like a filesystem streams API.  The file
object just reads/writes data, and doesn't worry about how that get's
stored on the media.

This doesn't really cause any API problems.  In fact, it ensures that the
only thing that can know about the archive format is the archive object.
We just need to figure out whether the archive object processes stream
objects, or if the stream objects process themselves via the archive
object.

> 8. Checksumming streams: We should probably only worry about checksumming
>   buffer-sized chunks of data, not individual blocks of structured data.
>   Setup time for the CRC calculations can thus be reduced, as can the
>   overhead of the CRC checksum itself.

Yes.  This should be done on an IO block, and stored in the IO block
header.  It is useful to be able to tell the other ojects whether or not
the block this data came from validated or not, so they know how much
trust to put into it, but we don't want to have several checksums floating
around.

>
> 9. I think Mr. Fish mentioned that we probably want an "end of file" block
> in file streams so that we know we have reached the end of a file.  This
> simplifies some programming, I guess. Did I misread the message?

I don't think that's required.  We can flag a file EOF when we see a new
stream header.  It also means we can pack the stream data all the way to
the end of the IO block.

If we want, we can just add a flags byte to the stream data structure that
indicates that "this is the last data block".

> Okay, I think this is enough to think about. I am especially curious to
> know what you think about the notion of putting variable-sized blocks into
> bigger buffer-sized blocks. I think this solves many problems (we never
> really know how much OS-specific data is going to be in file headers, for
> example), but is somewhat more complex than fixed-size blocks like 'tar',
> and yes, there is still some overhead in some cases (if we don't have
> enough space at the end of a buffer for a block, that space is wasted).

I like it.  We actually waste less space than the fixed-chunk case.  BRU
wasted an average of 512-bytes per file for small files, and ~850 bytes
per file for larger files.  Packing multiple things into a fixed IO block
prevents this.

There will be some space wasted: for example, we can't (more specifically,
don't want to!) split a stream header across an IO buffer boundary.  But
if we are writing stream data, we can adjust/buffer the data to exactly
fill the buffer, and then just start the next IO buffer with the leftover
stream data (and a new stream data header indicating the size).

> Comments?

We need some new terms.  We are using 'streams' in the sense of archive
streams and interplexing, and 'streams' in the sense of data streams to be
archived.  How about calling them something like d-streams and a-streams.
A d-stream is a stream of data to be archived.  An a-stream is an archive
stream, and contains one or more d-streams.  An archive is made up of one
or more a-streams.

-- 
Richard Fish, Unix/Linux Software Engineer, rj...@fi...