From: Kern S. <ke...@si...> - 2015-04-25 05:50:57
|
In my last email, I did forget to mention that as you point out, the problem can also result from a design issue. And the resolution of those problems from design issues fall into my point 2. If we have a good test case that shows the problem, even if it results from a design decision, most of the time we can find a solution -- in some cases, we have added new directives, but in most cases, a bit more programming/logic can fix the problem. One of the biggest issues that I have with the current SD algorithm is that during the drive(s) reservation process (prior to starting the SD job) once a write drive is assigned, it cannot be changed. Changing a drive when multiple simultaneous jobs are writing is a non-trivial problem. There are solutions, but they require rather profound changes to the SD, which I have been planning for at least 5 years -- all the underlying code and algorithms now exist so it is a matter of time. Best regards, Kern On 24.04.2015 22:07, Josh Fisher wrote: > I guess it is semantics, but I was just pointing out that it was not a > coding issue, but rather a design issue/choice. > > You can divide the jobs into different pools and then give jobs in the > same pools different priorities. The pools allow multiple jobs (from > different pools) to run concurrently, while the priorities serialize the > jobs within each pool. Far from desirable, but it does work. > > In any case, I agree that all of the ways of using multiple drives > concurrently seem unwieldy. It would be nice if both device and volume > assignment were done as a single atomic operation every time that a job > selected a volume. In other words, when the job needs a volume, it looks > for both an AVAILABLE volume and an AVAILABLE device at the same time, > and only one job at a time can make a volume-device selection. That is > easier said than done, of course. > > On 4/24/2015 1:09 PM, Clark, Patricia A. wrote: >> To avoid hijacking the question and to address whether it's a bug or not: >> >> Why it's a bug - request for media that is unavailable because it is >> already in use whether for a backup or recovery by a new backup job is a >> bug when other perfectly good media is available. One should not need to >> create separate pools otherwise you will need a separate pool for each job >> to ensure this situation never happens. The real issue here is how and >> when the communication happens between the director and the storage >> daemon. If both of these jobs start within a short period of each other >> (usually on the same schedule), that's when the second job will request >> media that has already been assigned by the SD, but not communicated to >> the director prior to the second job starting. That gap is what creates >> the contention for media. I have also had tapes pulled out from >> underneath a job resulting in "NULL" volume name and failed jobs. So, if >> not separate pools, then there's using separate schedules for each job, >> also not desireable. I have used offset schedules for groups of jobs in >> order to reduce the number of contentions. If nothing else, if media is >> not available within a reasonable period of time of the request, the >> director and/or the SD should decide to look for another. >> >> Patti Clark >> Linux System Administrator >> R&D Systems Support Oak Ridge National Laboratory >> >> >> >> On 4/24/15, 11:02 AM, "Josh Fisher" <jf...@pv...> wrote: >> >>> On 4/24/2015 9:14 AM, Clark, Patricia A. wrote: >>>> This is a known bug that has been reported, but still exists. The job >>>> wants the tape in use by another job that is using it in drive 0. >>> I'm not convinced that this is a bug. By design, Bacula allows more than >>> one job to simultaneously write to the same volume. When a job looks for >>> the next volume to write on, it cannot exclude volumes that are already >>> in use by another job. Note that this is not just at job start up, but >>> any time a volume is needed. What causes the catch-22 is that each job >>> is assigned a single device (tape drive) only once at job start up. If >>> two jobs, each writing to a different device, require the same volume, >>> then one job must wait until the volume can be moved into its assigned >>> device. So it is not a bug in the implementation, but rather a design >>> choice. >>> >>> From the perspective of using a multiple drive changer it would seem >>> that it is a bug to allow multiple jobs to simultaneously write to the >>> same volume, but Bacula must work with all kinds of hardware. If the >>> implementation were changed to disallow simultaneous writes to the same >>> volume, then concurrent jobs with a single drive changer would be >>> impossible. >>> >>> Bacula does allow resolving this issue through the use of pools. By >>> segregating jobs that are to be run concurrently into different pools, >>> the situation where two jobs want the same volume at the same time is >>> avoided altogether. So is this a bug, or is it a configuration error? >>> >>> > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users > |