Re: [Bacula-devel] Some ideas

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello Chuck,

On Sun, 2003-03-16 at 23:33, Chuck Hemker wrote:
> Sorry this took me so long to respond.  I wanted to check a few things and then
> didn't get back to it.

No problem -- I'm not in a rush.

> 
> On 07-Mar-03 Kern Sibbald wrote:
> > On Fri, 2003-03-07 at 14:48, Chuck Hemker wrote:
> >> 1. It would be nice if bacula and its utilities would support write
> >> protected
> >> tapes.  I wanted to use the scan option to btape the other day to test a
> >> blocking problem, and the documentation said it was a dangerous command.  So
> >> I
> >> write protected the tape and it couldn't open it.  Also ma tape in that I
> >> want
> >> to restore something off a tape, I would like to be able to write protect
> >> it.
> > 
> > Well, the tools ARE supposed to open the tape read-only if they can.
> > In some cases such as btape, it always needs it read/write because you
> > can do both operations.  Can you be specific about what tools are
> > broken?  I'll attempt to fix them if I can.
> 
> With btape, when I was debugging my tape blocking problem I used the scan
> command in btape to tell me what size the tape blocks were.  However, btape can
> be a dangerous command so I wanted to have the tape write protected.  With this
> command, I would say let it open a read only tape, and have it error out on a
> tape write error if you attempt to write to it.

I'll note what you are asking for, but I don't think it is so easy to
change btape, so I'm not planning any change for the moment.

> 
> I don't know if it's fixed in the cvs, but with 1.29 here are some examples:
> 
> bacula-sd:
> 
> ../sbin/console
> Connecting to Director ibmpcserver325:9101
> 1000 OK: ibmpcserver325-dir Version: 1.29 (22 January 2003)
> *mount
> Using default Catalog name=MyCatalog DB=bacula
> The defined Storage resources are:
>      1: 8mmDrive
> Item 1 selected automatically.
> 3901 open device failed: ERR=dev.c:266 stored: unable to open device /dev/nst0:
> ERR=Read-only file system
> 
> *
> 
> With bls:
> 
> ../sbin/bls -b test /dev/nst0
> bls: butil.c:143 Using device: /dev/nst0 for reading.
> bls: Fatal Error at dev.c:273 because:
> dev.c:266 stored: unable to open device /dev/nst0: ERR=Read-only file system
> bls: Fatal Error at device.c:227 because:
> dev open failed: dev.c:266 stored: unable to open device /dev/nst0:
> ERR=Read-only file system
> 
> bls: bls Fatal error: butil.c:98 Cannot open /dev/nst0
> 
> With btape:
> 
> ../sbin/btape /dev/nst0
> Tape block granularity is 1024 bytes.
> btape: butil.c:143 Using device: /dev/nst0 for writing.
> btape: Fatal Error at dev.c:273 because:
> dev.c:266 stored: unable to open device /dev/nst0: ERR=Read-only file system
> btape: Fatal Error at device.c:227 because:
> dev open failed: dev.c:266 stored: unable to open device /dev/nst0:
> ERR=Read-only file system
> 
> btape: btape Fatal error: butil.c:98 Cannot open /dev/nst0

Yes, I forgot about this case. You've set AlwaysOpen=yes. In that case,
Bacula has no choice but to open the tape read/write because it must
always keep the drive open so it *must* open it read/write to handle
both restores and backups.  If you don't have AlwaysOpen set, then
Bacula does open the tape read-only for restores and read/write for
backups.

In the case of bls. This should be fixed in the CVS unless I made
a programming error. I haven't explicitly tested it, but I have added
all the necessary code so that bls now opens the tape read-only.

>  
> >> 2. I was thinking about trying to speed up restores of small numbers of
> >> files.
> >> I noticed that for tapes most of the information is already in the catalog.
> >> However, it would be nice to have an option to set the maximum size of a
> >> tape
> >> file.  When it hit that limit, it would write a tape mark and a jobmedia
> >> record
> >> and then continue.  Then a smart restore bsr creating program could figure
> >> out
> >> what tape files needed to be restore and the restore could fast forward to
> >> the
> >> nearest tape mark.  What do you think?
> > 
> > Yes, I have *always* planned to implement a maximum tape file size.  In
> > fact, the record is already permitted in the Device resource (Maximum
> > File Size), but there is no code implemented.  It shouldn't be hard as
> > you indicate, just check the limit which is stored in dev->max_file_size
> > in block.c.  The only trick is to implement the jobmedia() record
> > update, which shouldn't be too hard -- just use the same code as is
> > done at the end of the Job. 
> > 
> > If you want to take a stab at this, I would really appreciate it. 
> > Otherwise, it is on my list and I should get to it before 1.30 is 
> > finished.
> > 
> > I think the code in restore is already perfectly aware of the 
> > file position, so it *should* automatically work, after all restore
> > needs to know about changing Volumes, which also changes file numbers.

I think I already mentioned that MaximumFileSize is already implemented.
I'm not sure I have tested it but all the code is in -- it was in in
1.29, but could run one buffer over the size you specified. In 1.30,
the maximum file size is always less than or equal to the size you
specify.

> > 
> >> 
> >> This assumes:
> >> a. fast forwarding to tape marks is much faster then fast forwarding
> >> records.
> >> b. fast forwarding records isn't that much faster then reading them and
> >>    discarding the data.
> 
> For tapes, the reason I suggested adding a job media record was because when I
> looked at what I needed to do a fast restore of one file I needed a list of all
> of the media and tape files that the file was on.  My first thought was to
> record (either in the file table or another table) the tape file that a file
> was on.  However, I realized that the job media table had all of the information
> I needed (and not too much more).  The only addition I needed was a job media
> record for the intermediate tape files.  Maybe this isn't the best way to do
> it, but it sounded good to me.  Let me know what you think.

Currently Bacula does not keep track of where on the tape each file
resides. It does know where the job containing the file starts on the
tape and ends on the tape, so it forward spaces to the beginning of
the job and begins reading.  As a consequence, adding another job media
record would not currently help.  The reason I did it that way is that
the File record (containing info on a particular file) is already the
biggest consumer of catalog space, and adding additional fields for
each file is quite expensive.  This may be a project some day, but
I find that a restore is done only once in a blue moon so optimization
here will be rarely used. If one needs restore speed, then it would
probably be best to keep disk archives for at least the critical
period of a few days.

> 
> -
> 
> By the way:  One of the problems I've seen with Legato Networker and multi
> session tapes was if your doing a large restore from a 4 session tape, it has
> to read 4 times the data off the tape to do the restore.  What I was thinking
> long term was having the option to do some sort of disk caching on the storage
> server so you could have a tape something like this:
> 
> tape mark
> data from session 1
> tape mark
> data from session 2
> tape mark
> data from session 1 
> tape mark
> data from session 2
> 
> this way to do a large restore, you could restore data, fast forward a tape
> mark, restore data, fast forward a tape mark, ...
> 
> Just an idea I had.

Yes, this is also one of the things that I am worried about in running
multiple simultaneous jobs -- the restore time will go up considerably.
I have planned a solution to the problem which contains most of the
elements of your idea, and is somewhat like Networker disk caching yet
different.

My idea is to add a new Migration job type to Bacula, which will 
copy all the data for a specified job to a different media, then
modify the database so that the old job is "purged" of file records,
and the new Job contains all the old file records.  This is then
a move of the job data from one place to another. The physical
job data of the original job will not be removed or deleted, but
the catalog will now only point to the second "Migration" job.

Now what will this give us?  First you do a backup to disk. Then
you do a Migration to tape.  This leaves you will a disk image that
you could then delete if you want, and a tape image.  The cool
part of this is that the tape will write at full speed, and
one can arrange for it all to be in a single file. This will
also eliminate the problems certain shops have with their backups
stopping because the end of tape is reached, and yet they cannot
mount a new tape and permit it to run during work hours.

> 
> >> For backups to disk, I came up with several ideas, but I'm not sure how
> good
> >> any of them might be.  If you want I'll mention some of them.
> > 
> > Yes, please do mention them.
> 
> For disks:
> 
> Assumptions:
> 1. It can position anywhere in the file quickly

True

> 
> Other things worth thinking about:
> 
> 2. To save space in the catalog, you could store internal media addressing info
>    (file positions, ...) in either a parallel file on the backup disk or in
>    a header of the backup file.

The exact byte address of the beginning of every Job *is* currently
saved in the catalog. It just is not currently used -- it just needs a
little be of code (probably one or two lines to do the lseek()).

> 3. It might be better with multisession cd's or DVDs to have a directory with
>    different files on the disk for different backup runs.  (with a max size for
>    the directory)

Currently if you are backing up to disk, it is possible and desirable to
have each job go to a different file -- most people currently do it this
way.  However, there is currently no explicit way to have a max size for
the directory -- except that you can fix a max size for each file, and
you can fix the maximum number of files, which in effect accomplishes
the same thing.  This is already implemented in 1.29.

>    This way someone could back up to the directory, write the first session,
>    delete the file, backup up again, back up the second session, ...
>    I have not played with multisession cd's yet, so I may not have the details
>    right but it's something to think about.

This is also currently possible. Just set a max file size, a max number
of files, and Bacula will stop asking for a new Volume. Write the
CD, purge the file, and continue.

> 
> Various options for backing up to disk (in no particular order):
> 
> 1. add a file position to the file table and record the file position for the
>    first block for the file.

This already exists

> 2. Same as tapes with a tape mark index file (file with tape mark number, and
>    file position) on the disk.

Using the catalog with the Job start position, Bacula could simply
lseek() to the beginning of the job and start reading -- no
problem. It has always been planned -- just a question of priorities.

> 3. separate the tape files into separate disk files

????

> 4. have a index file on the disk with where each file is.
> 5. or some combination of the above.

Given the low frequency of restores, for the moment, any additional
indexing scheme is pretty low on my priorities. For tapes it could
be VERY useful and enormously speed up the restore time.

...

> >> 5. I created a client and jobs records for my notebook the other day, and
> >> now
> >> status all hangs (and creates an error in messages) when the notebook is not
> >> connected.  I wonder how difficult keeping track of which clients and
> >> storage
> >> servers are active.  Either for a "status active" command or "list active"
> >> so
> >> a GUI could see who's active and then do a status of only them.
> > 
> > Yes, this is a problem, though the timeout is only 15 seconds, it can be
> > annoying.  It should not hang indefinitely though.  If it does, I'd like
> > to hear more.  Any specific suggestions would be welcome.
> > 
> > One feature that would be really nice would be to have a some Schedule
> > that says try running this Job every 20 minutes if it fails to connect.
> > That would allow you to have portables, and if they were connected to
> > the network for 20 minutes or more, they would be backed up otherwise,
> > there would be no errors reported.  This could be handy also for Windows
> > machines that are frequently turned off.  I've got it on my list.
> 
> I wasn't really thinking about starting jobs:
> 
> 1. I haven't scheduled anything yet.
> 2. I probably won't schedule the notebook anyways.  I'll probably just do a run
>    job by hand.  I can because I'm the bacula admin. :)
> 
> For the future, maybe some sort of run_client command would help.  When someone
> brings their notebook in they could do something like:
> 
> run_client client-name now/later
> it would connect to the dird with a restricted password
> tell it to either run the job now or as scheduled later
> dird would check to make sure it's allowed and either run it or schedule it for
>      later
> client disconnects
> dird job would then connect to filed when requested.

Yes, this is a good idea. I'll think a bit more about a simplified
client initiation of jobs. My current thinking is simply to use
the console -- you can run it from anywhere on the network.

> 
> -
> 
> What I was thinking of (I'm used to the networker status interface) would be
> have a gui client with status windows looking something like this:
> 
> ---------------------------------------
> | dird status                         |
> | job queue info                      |
> | ...                                 |
> ---------------------------------------
> | tape drive status                   |
> | ...                                 |
> ---------------------------------------
> | running client status               |
> | ...                                 |
> ---------------------------------------
> 
> With updates every few? seconds.
> This would allow the admin to watch bacula and keep an eye on what it's doing.
> 
> One way to implement this would be to do "status all" and parse the output. 
> However, what I noticed was "status all" had to wait for clients that are not
> connected.
> 
> I'm not sure what the best way to fix this would be.  A few options I came up
> with was:
> 
> 1. Have bacula know what bacula-fd and bacula-sd it's it has jobs currently
>    running on and implement a "status active" command.  This would also help
>    for text console users if they have a large number of clients.
> 2. Have the job info for running jobs displayed by the status include which
>    bacula-fd and bacula-sd it's talking to and have the client check them
>    individually.
> 3. Have the gui do a .clients and a .storage? and get a list of things to poll
>    and then do a status of each individually skipping most of the time ones
>    that timed out the last time.

The current timeout for a connection from the Console is 15 seconds,
that can be set to zero if necessary.  I think the solution to the
problem is simply to either periodically "poll" each device with a
zero timeout, or to fire off three threads that connect and then do
a new form of the status command that automatically repeats the
status every x seconds. This is a bit like the current Win32 clients
if you click right on the Bacula tray icon and select Status, it will
produce a status dialog box that is automatically refreshed every
5 seconds.

> 
> It's just something to think about.  It'll be a while before I would have a
> chance to think about implementing anything like this.  And maybe it's just
> because I'm new to the program that when I'm setting it up I want to know what
> it's doing.  :)
> 
> By the way, in two places (there may be a few more) the storage status seems to
>    be not as clear as it could be (and they are places where things take a
>    while):
> 
> 1. Positioning to end of data getting ready for an append:
>     Device /dev/nst0 is mounted with Volume bacula-8
>     Device is being initialized.
>     Total Bytes=184,837,126 Blocks=2,867 Bytes/block=64,470
>     Positioned at File=0 Block=0
> 
>    Maybe something like:
>     Device is being positioned
>     Device is being positioned for append
>     Device is being positioned to file x
> 
> 2. The other is during the rewind after hitting EOT.
>     (sorry I don't have the current message)
>     I noticed this because the tape drive in use light stops flashing and makes
>     slightly different noises, but bacula hasn't yet sent the message to
>     mount the next tape.

I agree, I have noted this as something to "refine". 

Best regards,

Kern