|
From: Eric L. G. <er...@ba...> - 2001-07-19 06:37:52
|
Note: Sorry about the delay on commenting on this message. Had a slight crisis at work (a member of the team quit and we had to go into emergency mode to fill in for her). Things are settling down a little now so we should be able to move forward a bit. I am currently working on (generating code for) the network block protocol for the plumbing. That can go on in parallel to the other work. I have also found a different crypto toolkit which is smaller than OpenSSL and which is native C++, so will work better with the Plumber (which is C++). The crypto toolkit is called "Crypto++". Now all I need is a working C++ compiler (the one that comes with Red Hat 7.1 is busted, it will not create executables, it won't even compile "hello_world.cpp", I'm waiting for the fixed one to download from ftp.redhat.com). On Sat, 7 Jul 2001, Richard Fish wrote: > On Wed, 4 Jul 2001, Eric Lee Green wrote: > > This may not work with backups to Jazz disks, DVD-RAM disks, or MO disks. > > Given my current employer, I very much want this to work with DVD-RAM > > disks and MO disks :-). I will experiment tomorrow and find out exactly > > what kind of write sizes these jukeboxes will handle. I know that the MO > > disks do require all writes to be at least 2K, or a multiple thereof, > > Well, my hope is that we could provide a buffer of a particular size, and > have the driver figure out how to transfer that to the device. As long as I did some experiments, and it turns out that the Linux drivers, at least, will handle breaking up any writes to the MO drive to 2K chunks. But the writes have to be a multiple of 2K. This indicates that we are going to need to maintain some information about block devices that's different from what we maintain about tape devices, and bear those in mind when we are deciding what block size to use for a particular backup, but I don't see any real problem there. The central authority knows what devices the data is going to be saved to, and can tailor a buffer size that's a multiple of block size for all of them. > > obvious. If I have one system with 100gb of data, and one system with 5gb > > of data, and multiplex the two, then the second stream will finish > > probably before the end of the first volume, except the actual tape writer > > really has no way of knowing that. It knows it hasn't seen any data from > > that particular stream for a while, but doesn't know whether it's just a > > case of the source for that stream being busy or something else. > > Well, it should get an EOF on the socket. That will tell it the stream is > done, at which point, it can mark it as inactive/finished/whatever. The multiplexor gets an EOF, but the actual tape writer, downstream of it, doesn't. It just knows it's getting blocks, and plunks them to disk or tape or whatever. > It's a very nice feature from an administration standpoint, because I can > say I want to backup my entire workground of 30 machines, and multiplex > them 4 at a time, and let the software figure out how to do it. It avoids > me having to micro-manage the backup, and ensure the backup is completed > in the least amount of time. I like this. It does mean that we must have the central controller assign all archive ID's rather than the client. > > We'll need to think about restores, and how to specify what to be > > restored. I'm thinking that the low level restorer thingy will just be fed > > a sequence of block numbers, and will fetch those blocks and put them out. > > Then something upstream will actually strip the proper file info out of > > the stream. This requires a simple SQL statement to grab the block numbers > > for the files and output them. Hmm, that tends to indicate that we want a > > 'last block' field in the database too, for indicating the block that > > contains the end of the file. (and perhaps can indicate a range of blocks > > to the restorer thingy). > > Hmm, this would indicate that we can't do anything less than a full > restore without a catalog, even using low-level, command-line accessible, > utilities. I can accept that. Naw. If you feed a block range of -1:-1 to the restorer, it just dumps the whole volume set to the output. Upstream, you can use a filter that filters it by filename or whatever criteria you want, but the low level tape reader knows nothing about files. It can't know anything about files, because files depend upon knowing the details of archive streams. All it knows about is blocks. But another widget upstream can strip out the file data that we're interested in. This works pretty much like 'tar' already works. If you restore a tape with 'tar' and just want file '/foo/bar', it'll read through the entire darned tape looking for /foo/bar, even after it's already restored /foo/bar. Obviously we want to use a catalog and only dump the blocks of interest, but I really would prefer that the restore scanner thingy not have to know anything about filesystems. Paths and path extensions vary greatly depending upon operating system, and some things don't even really have a path (like an Oracle database dump stream using, hmm, what was the name of that standard protocol for getting a dump stream from databases?). > The alternative is to have the lowlevel archive/backup/whatever scanner > have the ability to also look for paths and path extensions. Essentially, > to treat the archive/backup/whatever as a filesystem tree. I'd prefer that the tape/disk/punch card/etc. low level drivers not need to know anything about files, and that this all be decided upstream. They know about archive streams -- that should be easy enough, we've already specified that any given i/o block contains data only from a single archive stream, we can put that ID at the top of the i/o block and filter on it -- but because we want to be able to restore any kind of data, not just filesystem tree data, I'd prefer that we *not* embed any kind of filesystem logic. For example, I can foresee an LDAP data sourcer that would produce a stream of LDAP-formatted data. There's no reasonable way to treat that as a filesystem dump or restore, and I wouldn't even try. Let some specific LDAP-knowledgable widget handle knowing what an LDAP key looks like and how to set LDAP variables in the destination LDAP directory. > If we eliminate this concept, then we can also eliminate certain other > unstated requirements that were in my head, like that all entries in a > particular directory be contiguous within a single archive stream. I.e, > that /usr/bin comes after /usr and before anything else not in the /usr > heirarchy. Correct. We want to keep the actual low-level i/o engine as simple and stupid as possible, because I envision that we will actually have dozens of these guys eventually -- e.g., one that will back up to CD-R's, one that will back up to DVD-RAM disks, one that will back up to a sequence of 2GB files, etc. This also indicates that we don't want to think in terms of "block numbers" -- we want to think in terms of "location identifiers". For example, for the sequence of 2GB files case, the actual location of a file could be "/data/storage/345:32768", meaning that the data we want is at location 32768 in file /data/storage/345. > This could lead to some other interesting features, like multi-threading > backups on the agent and processing different filesystems in parallel. > Heck, we could even consider breaking a single backup-object (file, > database, etc) into segments that could be interspersed with other > segments within the archive stream! Although, I think that is complexity > we can avoid at this point. See my sequence of 2GB files case :-). I do agree that this would allow some neat stuff like the multi-threading backups on the agent! But we'll leave that for future versions :-). > As far as the database goes though, I think we primarily want to store the > starting block of a new backup object (file, database, etc), rather than > all of the blocks or range of blocks that contain it. I think if we try > to describe all of the blocks that make it up, we will essentially end up > indexing every single archive block in the database, and that is a *lot* > of data. We can't think in terms of blocks. We must think in terms of location. As for the describing all blocks, that's actually fairly easy. You just have two locations: the location containing the beginning of the file, and the location containing the ending of the file. When you go to the restore, you feed that range to the restorer, after doing a 'sort' and 'uniq' on the whole mess. I think that the ending block is a pretty easy one for the 'processor' widget for a particular type of archive to figure out. Here's what happens. A processor widget gets a block on its input. It records what start-of-files are in that block, and what end-of-files are in that block. It then writes the block. It eventually gets a location back for that block. It looks at all end-of-files for that block, goes fetches the start-of-files data for all end-of-files in that block (hash tables are great, eh!), and writes out the database record. Quite simple. You forget that C++ has standard vector and hash table classes as part of the STL! (Of course we have to remember to zap the data out of the hash table after we write it!). I'm starting to feel a *LOT* better about using C++ than I was feeling a few weeks ago. Having these kinds of classes sitting there for use means we can think of doing things like this, like we'd do them in Python, without worrying about all the code we'd have to write to do it. > This does mean that the multiplexing writer thingy needs to be fair in how > it choses what archive stream to service next, so that we don't have 10GB > of data to read through to find the next block, unless the system it came > from had some kind of problem. I agree. It should attempt to do a round-robin service on all of the available inputs. > There could be some effeciency gained on a restore from being able to use > QFA seeks to skip ahead, but I don't think it's worth the extra catalog > storage. Again, I agree. The range is enough. The fact that we had to read 30 blocks to restore 5 blocks of data is trivial for most cases. > > The platform and type of stream is indicated by the writer object. An NT > > writer object will have a different object ID than a Linux EXT2 writer > > object or a MacOS writer object, because they have different data formats. > > Writer objects may have, e.g., a pathname translator function, associated > > with them. We don't need to put more indicator ID's into the headers, > > because these indicator ID's are associated with the object that is > > creating this stream. Sort of, if I have a code 0x5324 that is a "Oracle > > NT Database Dump Object", this object may have a pathname code of 0x53 > > associated with it ("SQL Database/Table Names"), an originating OS code of > > 0x05 (Win32), etc... but all we need in the records on tape is the object > > id of the creating object. > > > > This means we have a central repository of object information, but that's > > easy enough to accomplish (we're going to have a SQL database, after > > all!). > > I think I got it....what if we could interconnect the writer objects for a > restore. Let me explain: during a backup, the process is simple and > obvious (from the data flow perspective): > > NTFS file --> NTFS_File_Obj --> ArchiveObj -> .... > EXT2FS file --> EXT2_File_Obj --> ArchiveObj -> .... > Oracle DB --> Oracle_DB_Obj .... > > Let's think about those File and DB objects for a second -- they need to > read object specific data (ACLs, extended bits, file attributes, datafile > locations, etc), format it into a form they can read later, and read and > format the data from the file or database. All of that get's sent off to > the archiving process. > > For a restore to the original location, the process looks like: > > ... --> ArchiveObj -{NTFS_File_Stream}-> NTFS_File_Obj --> NTFS file > ... > > In the above, the NTFS_File_Obj gets (via push or pull) the same data > stream(s) it originally stored via the archive object (we hope!). Since > it wrote that data, it knows how to decode it to get the filename, file > attributes, and the data section. So, it has no problem recreating the > original file, with the original attributes, etc. > > The problem is how to handle the foreign data case: > > ... -> ArchiveObj -{NTFS_File_Stream} -> EXT2_File_Obj -> EXT2FS file > or > ... -> ArchiveObj -{NTFS_File_Stream} -> NTFS_File_Obj -> EXT2FS file > > One way is as you suggest, to have a translater object to convert the > NTFS_File_Stream into a EXT2_File_Stream, that can be read by the > EXT2_File_Obj. But that involves a set of objects that have to be > maintained in sync with the writer objects. Instead, what if we leverage > the knowledge and code already in the NTFS_File_Obj, and tell it to write > it's data via an EXT2_File_Obj. It looks like this: > > ... -> ArchiveObj -{NTFS_File_Stream} -> NTFS_File_Obj > --> EXT2_File_Obj -> EXT2FS file You still must have a converter object here to, e.g., map from an NTFS filename to a EXT2FS filename. Whether the conversion is done at a low level via calling a method for an object or via an external translator component, it's going to have to be done. Note that the converter object can most certainly use the C++ classes used by the writer object to handle most of that, I certainly wasn't suggesting that we had to maintain two totally separate sets of classes, just that there was a step involved in translating from one format to another. We might think further on how to make this translation process easier, but I do not think we need to worry about it for the moment. For the moment it will probably suffice to say "you can restore NT files only to NT systems, and LDAP directories only to LDAP directories". The beauty of a component architecture is that you can always add more components into the stream later. We want something that works in a relatively short time. I'm tired of tarring up my notebook to my desktop then backing up my desktop, I want a real network backup again and no, I'm not going to install Arkeia to do it, I want something Open Source! > the important thing is the 'translation' is done by the writer/reader > objects themselves, without needing another class of objects to handle > that. No problem with that, just noting that we must have format translation somewhere if we are going to do cross-platform restores, whether it is as part of writer objects or as separate components that sit in the stream. Separate components (built with those objects, sitting in a pipeline between source and destination) are easier to hack into the pipeline later, but impose a performance penalty. Still, how often are we going to do cross-platform restores, and is the performance penalty going to be severe enough that we really care? > Now, there are some connections make no sense. For example, restoring > file data to an oracle database object would be non-sensical. So there > are a couple simple rules to implement in code: > > 1. File objects connect only to other file objects. > 2. All other objects connect to file objects, or themselves. All I would suggest is that we keep things simple wherever possible. "Release early, release often" is the goal. If we get working code on the site, then we can possibly get other contributors to do things like, e.g., hack on translation objects. Remember, this isn't you and me and Randy sitting in our cubicles anymore, this is an Open Source project, and the more we can parallelize the development, the more hackers we can attract to working on it and have them make meaningful contributions. But nobody's going to participate until we can do actual network backups and restores. Thus I suggest that we defer talk on format translations, and concentrate on the low level block format. > > I think an I/O block size has to be passed to the low-level stream > > creator, because it is in the best position to figure out the best way > > of spanning (or padding) at I/O block boundaries. An after-the-fact > > 'chunker' is not as good there. > > Except that means that the lowlevel stream objects need to know about the > archive format, header sizes, etc. That is not good. No it doesn't. It needs to know that x bytes are reserved for a header, but then it just fills in the rest of the buffer with its data and passes it on to the next component, and passes it down the line with each component filling in its own piece of the puzzle as desired. This does mean that it needs to "know" that it owns, say, bytes 256 to (n-12) of a 32768-byte buffer block, but that's not too difficult. > I think the stream creators create exactly that, a stream of data (here's > some data, write it to the archive). By "archive" do you mean the sum total of data being written to tape? Or do you mean a single backup stream? For simplicity's sake, it makes sense to organize each backup stream into tape-block-sized (or some kind of io-block-sized) chunks, each of which is tagged with what backup stream wrote it. This way the low level tape reader can read back stuff based on what backup stream it came from, without knowing anything about what's actually in those io-block-sized chunks. I think I see where you're going. You're saying that the stream processor -- the thingy that takes the raw stream of objects from the stream creator and does any processing necessary and figures out what data needs to be logged by the database to eg. indicate start and end of files -- should be the one that actually chunks it into IO-buffer-sized blocks, because that makes it easier for it to decide where files start and stop (and span things across block boundaries) without having to re-parse the IO-buffer-sized block back into its component raw stream blocks. Okay. That's fair. I think that'll work. It'll also make the agent side code smaller, which makes it more feasible that we could possibly create rescue disks for this thing, since the stream processor This way filesystem readers can handle the spanning in a way that makes sense. They know, e.g., that if they have a 'filename' block that's 105 characters long, it makes no sense to span that across a buffer boundary that's 30 characters away, so they can create a 'padding' block of 30 bytes, and continue on to the next block. There is no such thing as an 'archive object', by the way, just an 'archive block' object which the various writers know how to read and write off of tape or disk or whatever media we're backing up to. I think we need to get away from the whole notion of an "archive". What we have is a "backup", which consists of one or more "backup streams", each of which could represent any kind of data (some could be filesystem backup streams, another might be a backup of the NT registry, etc.). Each kind of backup stream type must have its own routines for deciding what sub-blocks within the overall 'archive block' mean, and thus for chunking data. > The archive object inserts whatever > headers and structures it needs to be able to validate and return that > data back to the stream object for a restore. It also takes care of > overflowing data across block boundaries (with new structures, etc). Again, I state that we should ban the word 'archive' in favor of the term 'backup' and 'backup stream', where a 'backup' consists of blocks from multiple 'backup streams'. That will probably save some confusion. > The point is, that having any layer try to format/manage data for another > layer is difficult to maintain, and violates the black-box encapsulation > rules. True, but somebody does have to know how to chunk things. We can't do anything about that :-(. I do believe that moving the chunking out to the processors, rather than putting it in the creators, is probably a good thing because that results in smaller creators (and the creators are the things that live out on the remote systems... the smaller we can get these things, the more probable we can make rescue disks with them). > > Can we come up with catchier names than this? I do agree we have a > > terminology problem here. 'archive streams', 'backup streams'. No? > > > Tomorrow I guess I get to work on terminology and see if I can come up > > with some definitions that make sense. > > Let's see (brainstorming here), > > Addict - (A)ttention (D)eficit (D)isorder (I)nfli(C)ting (T)ask > (aka. the multiplexor) > Archie - A 'stream' of data representing an object in an archive > Archive - A collection of Archies representing a single backup > set (constrained to a single system?) > Backup - A collection of one or more archives, one one or more media > volumes. > Vault - The collection of all Backups done by Tapioca. > > Hmm, none of these have anything to do with pudding, deserts, or even > food. Anybody else? My head hurts. Let's move on to what a backup header (the thing at the start of backup volumes) looks like, then what a Unix filesystem stream looks like. Note that because we've defined this so flexibly, we can later add, e.g., an ext2 filesystem stream thingy or etc., and by bumping the object ID still be able to read the archive using the previous-generation object. So I don't think we have to be quite as careful about the actual (non-io-related) contents of the backup blocks as we thought, as long as we get the backup block and backup header formats fixed in stone. From thence onwards, as long as we can pull out the object ID of the entity to be used to restore the stream out of the backup header, we can restore this, and if we change the format of a backup stream, we just bump our object ID and retain a copy of the old restorer at the original object ID so that we can continue to restore streams of that format. Yes, a pain, and to be avoided whenever possible, but backwards compatibility cruft is inevitable. You of all people should know that :-). Eric Lee Green mailto:er...@ba... BadTux: http://www.badtux.org |