From: Kern S. <ke...@si...> - 2008-02-10 09:12:47
|
Hello, Nice investigation work! This is certainly an interesting bug. Fortunately, the only apparent consequences are that getting to the first migration record is slow (i.e. the index is not properly setup). In principle, this bug shouldn't exist because migration uses all the same output routines that a backup uses -- except that as you point out, a migration job skips creating session start/end records because they are simply copied. Your preliminary patch probably corrects a good number of problems, but it also suffers from two things that I can see: 1. dev can change during execution of the code if the device is switched. Therefore, at this top level routine, it must be reloaded each time it is referenced -- this points out a bug that I should declare it volatile in the dcr packet to prevent the compiler from being too smart in optimization. 2. the same bookkeeping that is done in creating a session start label is done in the session end label routine to get an accurate end of all the records. If this is not done properly, it will be more serious in that Bacula might stop prematurely during a restore. The cleanest solution that I can think for this problem is to make the migration code call the session label subroutine, and let the subroutine sort it out. I.e. we need to ensure that for migration jobs the subroutine doesn't actually write a record this can probably be best done by adding a new argument that tells it to write or not. Another possibility would be to move the index bookkeeping code out of the session label routine -- it would make it less likely for such problems to occur in the future at the expense of some additional complications ... What do you think? On Sunday 10 February 2008 04.52:26 Peter Much wrote: > Abstract: > --------- > Migration jobs do not insert the jobmedia.startfile attribute > in the catalog. > > > Impact: > ------- > Very slow restore. > > > Example: > -------- > bacula=> select jobid,name,type,jobfiles,poolid,priorjobid from job > where jobid=1728 or jobid=1729 or jobid=1748 or jobid=1749 > order by jobid; > jobid | name | type | jobfiles | poolid | priorjobid > -------+--------------------+------+----------+--------+------------ > 1728 | HomeDirsDisp | B | 4398 | 5 | 0 > 1729 | HomeDirsDisp | M | 4398 | 2 | 0 > 1748 | MoveToTapeInternal | g | 0 | 5 | 0 > 1749 | HomeDirsDisp | B | 4398 | 5 | 1729 > (4 rows) > > -> job 1728 saves directly to pool 5. Job 1729 saves the same data > to pool 2, and migrating job 1748/1749 moves it on to pool 5. > > bacula=> select jobid,firstindex,lastindex,startfile,endfile, > startblock,endblock from jobmedia > where jobid=1728 or jobid=1729 or jobid=1748 or jobid=1749 > order by jobid; > jobid | firstindex | lastindex | startfile | endfile | startblock | > endblock > -------+------------+-----------+-----------+---------+------------+------- >---- 1728 | 1 | 4398 | 55 | 55 | 0 | > 3466 1729 | 1 | 4398 | 0 | 0 | 217 | > 223662044 1749 | 1 | 4398 | 0 | 56 | 0 > | 3466 (3 rows) > > -> Job 1728 sets startfile attribute, migrate job 1749 sets just 0. > > > Comment: > -------- > 1. I have tried to fix the StartFile issue, so that my restores > run better. I have not further checked how the other data are > derived on Migration and if there may be more problems. > Usually I do such, but I was tired at that point. ;) Yes, it looks like you fixed the StartFile issue, but the same problem may exist on the end file position. > > 2. The existence of jcr->dcr is checked twice in that function, > once right after entry, and once at label "ok_out:". Could it > be something "stealing" this object inflight? The DCR should have a fixed address and should not be stolen inflight. The bulk of the code was copied from append.c, so it is possible that there are some inefficiencies. > If so, then my earlier using of "dev" as a shorthand for > jcr->dcr->dev is illegitim, and things need to be written > more cumbersomely. In many routines, I set dev at the top of the subroutine and use it as shorthand. However, at this high a level, one cannot do that because dev can change during the backup if the device changes. E.g. when reading files that were written to two devices, dev will change, and (currently not implemented) during backup if the end of a tape is reached and Bacula decides to change drives, dev will also change. > (I have obviousely not gone into the layout of the whole > software conception already; therefore these patches should > be considered more as suggestions to where the problem seems > to arise, and not so much as readymade fixes!) Well, it is a very good start. Would you please submit a bug report on this with basically the text that you wrote here (you can obviously change it taking into account what I have written, if you wish) -- that will guarantee that the problem is resolved and ensure that everyone who is interested sees it ... Best regards, Kern > > Patch: > ------ > --- src/stored/mac.c.orig Wed Oct 3 13:36:47 2007 > +++ src/stored/mac.c Sun Feb 10 04:28:54 2008 > @@ -89,7 +89,7 @@ > } > > Dmsg3(200, "Found %d volumes names for %s. First=%s\n", > jcr->NumReadVolumes, - jcr->VolList->VolumeName, Type); > + Type, jcr->VolList->VolumeName); > > /* Ready devices for reading and writing */ > if (!acquire_device_for_read(jcr->read_dcr) || > @@ -98,8 +98,26 @@ > goto bail_out; > } > > - Dmsg2(200, "===== After acquire pos %u:%u\n", jcr->dcr->dev->file, > jcr->dcr->dev->block_num); + dev = jcr->dcr->dev; > + Dmsg2(200, "===== After acquire pos %u:%u\n", dev->file, > dev->block_num); > > + /* On MIGRATE we must setup some dcr jobmedia data. On BACKUP > + * this is done in write_session_label(). > + */ > + switch(jcr->JobType) { > + case JT_MIGRATE: > + if (dev->is_tape()) { > + jcr->dcr->StartBlock = dev->block_num; > + jcr->dcr->StartFile = dev->file; > + } else { > + jcr->dcr->StartBlock = (uint32_t)dev->file_addr; > + jcr->dcr->StartFile = (uint32_t)(dev->file_addr >> 32); > + } > + break; > + default: > + break; > + } > + > set_jcr_job_status(jcr, JS_Running); > dir_send_job_status(jcr); > > @@ -117,7 +135,6 @@ > > ok_out: > if (jcr->dcr) { > - dev = jcr->dcr->dev; > if (ok || dev->can_write()) { > /* Flush out final partial block of this session */ > if (!write_block_to_device(jcr->dcr)) { > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Bacula-devel mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-devel |