Thread: [mbackup-devel] xml as a tape header: the cons
Status: Alpha
Brought to you by:
jo2y
|
From: James O'K. <jo...@mi...> - 2000-08-22 19:08:54
|
The only thing I have against using xml as a tape format is the extra tape space that it would use. I wrote this quick example: <?xml version="1.0" ?> <mbackup:header> <mbackup:date>966968837</mbackup:date> <mbackup:filename>/usr/local/bin/foobar</mbackup:filename> <mbackup:hostname>uhura.midnightlinux.com</mbackup:hostname> </mbackup:header> I'm not sure if that's true valid xml, but I think it's close enough for this example. I also have this more traditional header format: 96696883724uhura.midnightlinux.com21usr/local/bin/foobar 9 digit time, 2 digit next field length, hostname, 2 digit next field length, filename. Both give the same information. The xml one is 215 bytes and the other one is 58 bytes. I also gathered some data that is typical of the data we use at work: [root@cadillac round5]# du -a|wc -l 306050 [root@cadillac round5]# du -s 4070060 . 306,050 files using about 4 gigs of space. With the xml header, there is 65M of header data to label 4gig of data. With the other header, there is 17M of header data to label 4gig of data. Given this is a simple header and doesn't have all the data a full header might have, but I feel that the xml header will grow faster than the other header, even if we don't use the mbackup namespace part. Just something to consider... -james |
|
From: John H. <Jo...@mw...> - 2000-08-22 21:28:59
|
> The only thing I have against using xml as a tape format is the extra tape > space that it would use. I wrote this quick example: > > <?xml version="1.0" ?> > <mbackup:header> > <mbackup:date>966968837</mbackup:date> > <mbackup:filename>/usr/local/bin/foobar</mbackup:filename> > <mbackup:hostname>uhura.midnightlinux.com</mbackup:hostname> > </mbackup:header> > > I'm not sure if that's true valid xml, but I think it's close enough for > this example. I also have this more traditional header format: > > 96696883724uhura.midnightlinux.com21usr/local/bin/foobar > > 9 digit time, 2 digit next field length, hostname, 2 digit next field > length, filename. > > Both give the same information. The xml one is 215 bytes and the other one > is 58 bytes. > > I also gathered some data that is typical of the data we use at work: > > [root@cadillac round5]# du -a|wc -l > 306050 > [root@cadillac round5]# du -s > 4070060 . > > 306,050 files using about 4 gigs of space. > With the xml header, there is 65M of header data to label 4gig of data. > With the other header, there is 17M of header data to label 4gig of data. > > Given this is a simple header and doesn't have all the data a full header > might have, but I feel that the xml header will grow faster than the other > header, even if we don't use the mbackup namespace part. > > Just something to consider... > > -james I've just checked out our web/ftp server. for /etc, which is classic for small files, 6390k/1333 files = 4.8K on average. over the whole system 5.5gb/ 106639 files = 52.2kb I was thinking of a much more verbose and detailed header so lets say 1k per file. nah lets say 2k. So for a full backup this is going to take 213Mb. But thats trivial, not worth worrying over! For a DDS-3 thats an extra 3mins of backup time. Of course for a drive supporting compression, its even smaller. The advantages of expandability and flexibility remain huge. Implementation wise, the client would create the XML header with the data and pass it, locally or over a network , to the server process. When it comes into the server its parsed into a DOM object. (I dont see any reason why we need a validating parser though). Then a pointer to the DOM gets passed along with the data to the filters. One possibility might be a compression filter that examines to DOM to see if the data is in a compressed native format (With Netware 4+ for example, you can open a compressed file and keep it compressed). It might then compress the data and add a new entity to the DOM <mbackup:compression>gzip</mbackup:compression> and at the point where it goes out to the device it gets converted to an XML document. Regards |
|
From: John H. <Jo...@mw...> - 2000-08-22 21:29:01
|
> The only thing I have against using xml as a tape format is the extra tape > space that it would use. I wrote this quick example: > > <?xml version="1.0" ?> > <mbackup:header> > <mbackup:date>966968837</mbackup:date> > <mbackup:filename>/usr/local/bin/foobar</mbackup:filename> > <mbackup:hostname>uhura.midnightlinux.com</mbackup:hostname> > </mbackup:header> > > I'm not sure if that's true valid xml, but I think it's close enough for > this example. I also have this more traditional header format: > > 96696883724uhura.midnightlinux.com21usr/local/bin/foobar > > 9 digit time, 2 digit next field length, hostname, 2 digit next field > length, filename. > > Both give the same information. The xml one is 215 bytes and the other one > is 58 bytes. > > I also gathered some data that is typical of the data we use at work: > > [root@cadillac round5]# du -a|wc -l > 306050 > [root@cadillac round5]# du -s > 4070060 . > > 306,050 files using about 4 gigs of space. > With the xml header, there is 65M of header data to label 4gig of data. > With the other header, there is 17M of header data to label 4gig of data. > > Given this is a simple header and doesn't have all the data a full header > might have, but I feel that the xml header will grow faster than the other > header, even if we don't use the mbackup namespace part. > > Just something to consider... > > -james I've just checked out our web/ftp server. for /etc, which is classic for small files, 6390k/1333 files = 4.8K on average. over the whole system 5.5gb/ 106639 files = 52.2kb I was thinking of a much more verbose and detailed header so lets say 1k per file. nah lets say 2k. So for a full backup this is going to take 213Mb. But thats trivial, not worth worrying over! For a DDS-3 thats an extra 3mins of backup time. Of course for a drive supporting compression, its even smaller. The advantages of expandability and flexibility remain huge. Implementation wise, the client would create the XML header with the data and pass it, locally or over a network , to the server process. When it comes into the server its parsed into a DOM object. (I dont see any reason why we need a validating parser though). Then a pointer to the DOM gets passed along with the data to the filters. One possibility might be a compression filter that examines to DOM to see if the data is in a compressed native format (With Netware 4+ for example, you can open a compressed file and keep it compressed). It might then compress the data and add a new entity to the DOM <mbackup:compression>gzip</mbackup:compression> and at the point where it goes out to the device it gets converted to an XML document. Regards |
|
From: James O'K. <jo...@mi...> - 2000-08-22 21:49:38
|
On Wed, 23 Aug 2000, John Huttley wrote: > Implementation wise, the client would create the XML header with the data and > pass it, > locally or over a network , to the server process. I'm not sure I understand this. Do you plan to chance the file_tag struct into a pointer to an xml formatted string? If that's they plan, I'm curious about the overhead of making each module understand how to read and add to the xml format. Right now, to change something in the file_tag struct you can just do a file_tag->current_size = 100. If we change that to xml, won't we have to file_tag->xml_set_current_size(100)? I'm not sure I see how that is better. On the other hand, if we're just talking about having the tape-writing module create the xml header just as we write to tape then we're on the same page. I was also planning on using XML to talk between modules and the GUI such as libglade and possibly using XML for talking between client and server, but I'm not sure if this would be the same XML format as the data on tape. -james |
|
From: John H. <Jo...@mw...> - 2000-08-23 05:23:35
|
{{ Nick you know all about XML, is this reasonable??}}
----- Original Message -----
From: James O'Kane <jo...@mi...>
To: <mba...@li...>
Sent: Wednesday, 23 August 2000 09:49
Subject: Re: [mbackup-devel] xml as a tape header: the cons
> On Wed, 23 Aug 2000, John Huttley wrote:
> > Implementation wise, the client would create the XML header with the data and
> > pass it,
> > locally or over a network , to the server process.
>
> I'm not sure I understand this. Do you plan to chance the file_tag struct
> into a pointer to an xml formatted string? If that's they plan, I'm
> curious about the overhead of making each module understand how to read
> and add to the xml format. Right now, to change something in the file_tag
> struct you can just do a file_tag->current_size = 100. If we change that
> to xml, won't we have to file_tag->xml_set_current_size(100)? I'm not sure
> I see how that is better.
> On the other hand, if we're just talking about having the tape-writing
> module create the xml header just as we write to tape then we're on the
> same page.
Not quite. the existing file_tag only makes sense in the context of a unix
system.
If the intent is to backup _data_ as against _files_ then a general purpose
interface is required.
However this does not mean that we need to pass around an XML formatted string
and have every filter
parse it and re write it. Your example shows it as worse case, because you have
picked an attribute that is
already supported in the file_tag. A simple struct dereference is hard to beat!
There is certainly going to be overhead in having each filter understand DOM, but
it is,
I think, survivable. It does not increase with the size of the file.. unlike
compression, for example.
Suppose we were backing up a netware 3 file system.
Our header needs to look like this..
<stream_code>NETWARE 3</stream_code> <<+++ Actually I suppose we
specify the DTD
<stream_id>3</stream_id>
<object_id>489894</objectid>
## Quick ref to this file, makes subsequent headers much shorter
<length>1234567</length>
<owner>someone</owner>
<cdate>2000-03-01 23:45:01</cdate> ## ISO standard dates.
<mdate>2000-08-21 20:45:01</mdate>
<trustee type=user>someuser1<rights>RWCM</rights></trustee> ##
trustees are unlimited in number
<trustee type=user>someuser2<rights>SRWCEMFA</rights></trustee>
<trustee type=group>somegroup1<rights>RF</rights></trustee>
<trustee type=group>admingroup1<rights>SRWCEMFA</rights></trustee>
<type>directory<IRmask>SRF</IRmask>
<attribute>DPR</attribute>
<server>MYSERVER</server>
<volume>VOL1</volume>
<path>home\john\documents</path>
<namespace>long
<filename>mydatadirectory</filename>
<OS2EA encoding=BASE64> ## if resource fork size is a
problem, we can write them out as data in their own block
kilobytes of base 64 data
</OS2EA>
</namespace>
<namespace>mac
<filename>mydatadirectory</filename>
<resource_fork encoding=BASE64>
even more kilobytes of base 64 data
</resource_fork>
</namespace>
<namespace>DOS
<filename>MYDAT~1</filename>
</namespace>
<block>
<sequence>1</sequence>
<length>524288</length> # we are writing out in 512kb
blocks..
<offset>0</offset>
</block>
============
Then the next block can just have
<stream_code>NETWARE 3</stream_code
<stream_id>3</stream_id>
<object_id>489894</objectid>
<block>
<sequence>2</sequence>
<length>524288</length>
<offset>524288</offset>
</block>
=============
When this hits the filters, they may act on the data to change the size of the
blocks (compression etc)
and add additional entities into the header.
As you can see, there isn't much in there that matches the existing file_tag
structure.
Netware 4 and higher is even worse.
XML/DOM headers are better in that we can do more with them. Not as simple or as
fast though.
> I was also planning on using XML to talk between modules and the GUI such
> as libglade and possibly using XML for talking between client and server,
> but I'm not sure if this would be the same XML format as the data on tape.
Good old glade! I never did work it out. I'm sure though, the the XML will be
quite different.
Regards
John
|