Re: [Bacula-devel] Some ideas

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Sorry this took me so long to respond.  I wanted to check a few things and then
didn't get back to it.

On 07-Mar-03 Kern Sibbald wrote:
> On Fri, 2003-03-07 at 14:48, Chuck Hemker wrote:
>> 1. It would be nice if bacula and its utilities would support write
>> protected
>> tapes.  I wanted to use the scan option to btape the other day to test a
>> blocking problem, and the documentation said it was a dangerous command.  So
>> I
>> write protected the tape and it couldn't open it.  Also ma tape in that I
>> want
>> to restore something off a tape, I would like to be able to write protect
>> it.
> 
> Well, the tools ARE supposed to open the tape read-only if they can.
> In some cases such as btape, it always needs it read/write because you
> can do both operations.  Can you be specific about what tools are
> broken?  I'll attempt to fix them if I can.

With btape, when I was debugging my tape blocking problem I used the scan
command in btape to tell me what size the tape blocks were.  However, btape can
be a dangerous command so I wanted to have the tape write protected.  With this
command, I would say let it open a read only tape, and have it error out on a
tape write error if you attempt to write to it.

I don't know if it's fixed in the cvs, but with 1.29 here are some examples:

bacula-sd:

../sbin/console
Connecting to Director ibmpcserver325:9101
1000 OK: ibmpcserver325-dir Version: 1.29 (22 January 2003)
*mount
Using default Catalog name=MyCatalog DB=bacula
The defined Storage resources are:
     1: 8mmDrive
Item 1 selected automatically.
3901 open device failed: ERR=dev.c:266 stored: unable to open device /dev/nst0:
ERR=Read-only file system

*

With bls:

../sbin/bls -b test /dev/nst0
bls: butil.c:143 Using device: /dev/nst0 for reading.
bls: Fatal Error at dev.c:273 because:
dev.c:266 stored: unable to open device /dev/nst0: ERR=Read-only file system
bls: Fatal Error at device.c:227 because:
dev open failed: dev.c:266 stored: unable to open device /dev/nst0:
ERR=Read-only file system

bls: bls Fatal error: butil.c:98 Cannot open /dev/nst0

With btape:

../sbin/btape /dev/nst0
Tape block granularity is 1024 bytes.
btape: butil.c:143 Using device: /dev/nst0 for writing.
btape: Fatal Error at dev.c:273 because:
dev.c:266 stored: unable to open device /dev/nst0: ERR=Read-only file system
btape: Fatal Error at device.c:227 because:
dev open failed: dev.c:266 stored: unable to open device /dev/nst0:
ERR=Read-only file system

btape: btape Fatal error: butil.c:98 Cannot open /dev/nst0

>> 2. I was thinking about trying to speed up restores of small numbers of
>> files.
>> I noticed that for tapes most of the information is already in the catalog.
>> However, it would be nice to have an option to set the maximum size of a
>> tape
>> file.  When it hit that limit, it would write a tape mark and a jobmedia
>> record
>> and then continue.  Then a smart restore bsr creating program could figure
>> out
>> what tape files needed to be restore and the restore could fast forward to
>> the
>> nearest tape mark.  What do you think?
> 
> Yes, I have *always* planned to implement a maximum tape file size.  In
> fact, the record is already permitted in the Device resource (Maximum
> File Size), but there is no code implemented.  It shouldn't be hard as
> you indicate, just check the limit which is stored in dev->max_file_size
> in block.c.  The only trick is to implement the jobmedia() record
> update, which shouldn't be too hard -- just use the same code as is
> done at the end of the Job. 
> 
> If you want to take a stab at this, I would really appreciate it. 
> Otherwise, it is on my list and I should get to it before 1.30 is 
> finished.
> 
> I think the code in restore is already perfectly aware of the 
> file position, so it *should* automatically work, after all restore
> needs to know about changing Volumes, which also changes file numbers.
> 
>> 
>> This assumes:
>> a. fast forwarding to tape marks is much faster then fast forwarding
>> records.
>> b. fast forwarding records isn't that much faster then reading them and
>>    discarding the data.

For tapes, the reason I suggested adding a job media record was because when I
looked at what I needed to do a fast restore of one file I needed a list of all
of the media and tape files that the file was on.  My first thought was to
record (either in the file table or another table) the tape file that a file
was on.  However, I realized that the job media table had all of the information
I needed (and not too much more).  The only addition I needed was a job media
record for the intermediate tape files.  Maybe this isn't the best way to do
it, but it sounded good to me.  Let me know what you think.

-

By the way:  One of the problems I've seen with Legato Networker and multi
session tapes was if your doing a large restore from a 4 session tape, it has
to read 4 times the data off the tape to do the restore.  What I was thinking
long term was having the option to do some sort of disk caching on the storage
server so you could have a tape something like this:

tape mark
data from session 1
tape mark
data from session 2
tape mark
data from session 1 
tape mark
data from session 2

this way to do a large restore, you could restore data, fast forward a tape
mark, restore data, fast forward a tape mark, ...

Just an idea I had.

>> For backups to disk, I came up with several ideas, but I'm not sure how
good
>> any of them might be.  If you want I'll mention some of them.
> 
> Yes, please do mention them.

For disks:

Assumptions:
1. It can position anywhere in the file quickly

Other things worth thinking about:

2. To save space in the catalog, you could store internal media addressing info
   (file positions, ...) in either a parallel file on the backup disk or in
   a header of the backup file.
3. It might be better with multisession cd's or DVDs to have a directory with
   different files on the disk for different backup runs.  (with a max size for
   the directory)
   This way someone could back up to the directory, write the first session,
   delete the file, backup up again, back up the second session, ...
   I have not played with multisession cd's yet, so I may not have the details
   right but it's something to think about.

Various options for backing up to disk (in no particular order):

1. add a file position to the file table and record the file position for the
   first block for the file.
2. Same as tapes with a tape mark index file (file with tape mark number, and
   file position) on the disk.
3. separate the tape files into separate disk files
4. have a index file on the disk with where each file is.
5. or some combination of the above.

>> 3. I was thinking of writing a more intelligent restore user interface. 
>> With
>> getting the information from the catalog, I was wondering if it would be
>> better
>> to talk to the catalog directly using the routines in the cats directory or
>> use the sqlquery command in the console?
> 
> Until now, I have kept all the Catalog code in the Directory (with the
> exception of dbcheck and bscan). This is because at some point I would
> like to add user level security and access.  If we have code spread
> everywhere this will be more difficult.  The other advantage is that
> any code you add to the Director is automatically available to both
> the tty console program and the GNOME program.  The major disadvantage
> is it increases the size of the code -- however, compared to Networker
> the Bacula Director is really tiny. :-)

I was thinking that breaking the restore interface at this point was going to
be messy, but thinking about it again, it sounds like it might work well.

>> 4. When your working on the console, you might want to remember that at some
>> point in time, someone might want to writing a GUI interface that talks to
>> it.
>> Non interactive commands and easy to parse output would make that easier.
> 
> Yes, I have had this in mind from the beginning, and I have two
> solutions. One is to set some sort of "batch" or non-interactive flag
> to eliminate unwanted queries. The second is I attempt (in most cases)
> to provide full command lines so that no prompt will be issued, and
> finally, something that is not used yet is that in addition to all
> the standard commands such as "list ..." there is already built in
> a set of .xxx commands, which are meant ONLY to be used by a GUI
> interface.  They are as yet very incomplete, but if you type
> .jobs, you will get a simple list of all Job names, likewise for
> .filesets, .clients.  Also, if you do .die, the director segment
> faults (used for testing the automatic traceback).  Take a look
> at <bacula-source>/src/dird/ua_dotcmds.c for more details.  This
> is the primary solution I had planned for the problem you mention.

I didn't realize they were there.  And as long as the rest of the needed
commands have a way to specify everything together it should work well.

>> 5. I created a client and jobs records for my notebook the other day, and
>> now
>> status all hangs (and creates an error in messages) when the notebook is not
>> connected.  I wonder how difficult keeping track of which clients and
>> storage
>> servers are active.  Either for a "status active" command or "list active"
>> so
>> a GUI could see who's active and then do a status of only them.
> 
> Yes, this is a problem, though the timeout is only 15 seconds, it can be
> annoying.  It should not hang indefinitely though.  If it does, I'd like
> to hear more.  Any specific suggestions would be welcome.
> 
> One feature that would be really nice would be to have a some Schedule
> that says try running this Job every 20 minutes if it fails to connect.
> That would allow you to have portables, and if they were connected to
> the network for 20 minutes or more, they would be backed up otherwise,
> there would be no errors reported.  This could be handy also for Windows
> machines that are frequently turned off.  I've got it on my list.

I wasn't really thinking about starting jobs:

1. I haven't scheduled anything yet.
2. I probably won't schedule the notebook anyways.  I'll probably just do a run
   job by hand.  I can because I'm the bacula admin. :)

For the future, maybe some sort of run_client command would help.  When someone
brings their notebook in they could do something like:

run_client client-name now/later
it would connect to the dird with a restricted password
tell it to either run the job now or as scheduled later
dird would check to make sure it's allowed and either run it or schedule it for
     later
client disconnects
dird job would then connect to filed when requested.

-

What I was thinking of (I'm used to the networker status interface) would be
have a gui client with status windows looking something like this:

---------------------------------------
| dird status                         |
| job queue info                      |
| ...                                 |
---------------------------------------
| tape drive status                   |
| ...                                 |
---------------------------------------
| running client status               |
| ...                                 |
---------------------------------------

With updates every few? seconds.
This would allow the admin to watch bacula and keep an eye on what it's doing.

One way to implement this would be to do "status all" and parse the output. 
However, what I noticed was "status all" had to wait for clients that are not
connected.

I'm not sure what the best way to fix this would be.  A few options I came up
with was:

1. Have bacula know what bacula-fd and bacula-sd it's it has jobs currently
   running on and implement a "status active" command.  This would also help
   for text console users if they have a large number of clients.
2. Have the job info for running jobs displayed by the status include which
   bacula-fd and bacula-sd it's talking to and have the client check them
   individually.
3. Have the gui do a .clients and a .storage? and get a list of things to poll
   and then do a status of each individually skipping most of the time ones
   that timed out the last time.

It's just something to think about.  It'll be a while before I would have a
chance to think about implementing anything like this.  And maybe it's just
because I'm new to the program that when I'm setting it up I want to know what
it's doing.  :)

By the way, in two places (there may be a few more) the storage status seems to
   be not as clear as it could be (and they are places where things take a
   while):

1. Positioning to end of data getting ready for an append:
    Device /dev/nst0 is mounted with Volume bacula-8
    Device is being initialized.
    Total Bytes=184,837,126 Blocks=2,867 Bytes/block=64,470
    Positioned at File=0 Block=0

   Maybe something like:
    Device is being positioned
    Device is being positioned for append
    Device is being positioned to file x

2. The other is during the rewind after hitting EOT.
    (sorry I don't have the current message)
    I noticed this because the tape drive in use light stops flashing and makes
    slightly different noises, but bacula hasn't yet sent the message to
    mount the next tape.