Re: [Bacula-users] Device is BLOCKED - renamed Bugged or not?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

In my last email, I did forget to mention that as you point out, the
problem can also result from a design issue.  And the resolution of
those problems from design issues fall into my point 2.  If we have a
good test case that shows the problem, even if it results from a design
decision, most of the time we can find a solution -- in some cases, we
have added new directives, but in most cases, a bit more
programming/logic can fix the problem.

One of the biggest issues that I have with the current SD algorithm is
that during the drive(s) reservation process (prior to starting the SD
job) once a write drive is assigned, it cannot be changed.  Changing a
drive when multiple simultaneous jobs are writing is a non-trivial
problem.  There are solutions, but they require rather profound changes
to the SD, which I have been planning for at least 5 years -- all the
underlying code and algorithms now exist so it is a matter of time.

Best regards,
Kern

On 24.04.2015 22:07, Josh Fisher wrote:
> I guess it is semantics, but I was just pointing out that it was not a 
> coding issue, but rather a design issue/choice.
>
> You can divide the jobs into different pools and then give jobs in the 
> same pools different priorities. The pools allow multiple jobs (from 
> different pools) to run concurrently, while the priorities serialize the 
> jobs within each pool. Far from desirable, but it does work.
>
> In any case, I agree that all of the ways of using multiple drives 
> concurrently seem unwieldy.  It would be nice if both device and volume 
> assignment were done as a single atomic operation every time that a job 
> selected a volume. In other words, when the job needs a volume, it looks 
> for both an AVAILABLE volume and an AVAILABLE device at the same time, 
> and only one job at a time can make a volume-device selection. That is 
> easier said than done, of course.
>
> On 4/24/2015 1:09 PM, Clark, Patricia A. wrote:
>> To avoid hijacking the question and to address whether it's a bug or not:
>>
>> Why it's a bug - request for media that is unavailable because it is
>> already in use whether for a backup or recovery by a new backup job is a
>> bug when other perfectly good media is available.  One should not need to
>> create separate pools otherwise you will need a separate pool for each job
>> to ensure this situation never happens.  The real issue here is how and
>> when the communication happens between the director and the storage
>> daemon.  If both of these jobs start within a short period of each other
>> (usually on the same schedule), that's when the second job will request
>> media that has already been assigned by the SD, but not communicated to
>> the director prior to the second job starting.  That gap is what creates
>> the contention for media.  I have also had tapes pulled out from
>> underneath a job resulting in "NULL" volume name and failed jobs.  So, if
>> not separate pools, then there's using separate schedules for each job,
>> also not desireable.  I have used offset schedules for groups of jobs in
>> order to reduce the number of contentions.  If nothing else, if media is
>> not available within a reasonable period of time of the request, the
>> director and/or the SD should decide to look for another.
>>
>> Patti Clark
>> Linux System Administrator
>> R&D Systems Support Oak Ridge National Laboratory
>>
>>
>>
>> On 4/24/15, 11:02 AM, "Josh Fisher" <jf...@pv...> wrote:
>>
>>> On 4/24/2015 9:14 AM, Clark, Patricia A. wrote:
>>>> This is a known bug that has been reported, but still exists.  The job
>>>> wants the tape in use by another job that is using it in drive 0.
>>> I'm not convinced that this is a bug. By design, Bacula allows more than
>>> one job to simultaneously write to the same volume. When a job looks for
>>> the next volume to write on, it cannot exclude volumes that are already
>>> in use by another job. Note that this is not just at job start up, but
>>> any time a volume is needed. What causes the catch-22 is that each job
>>> is assigned a single device (tape drive) only once at job start up. If
>>> two jobs, each writing to a different device, require the same volume,
>>> then one job must wait until the volume can be moved into its assigned
>>> device. So it is not a bug in the implementation, but rather a design
>>> choice.
>>>
>>>  From the perspective of using a multiple drive changer it would seem
>>> that it is a bug to allow multiple jobs to simultaneously write to the
>>> same volume, but Bacula must work with all kinds of hardware. If the
>>> implementation were changed to disallow simultaneous writes to the same
>>> volume, then concurrent jobs with a single drive changer would be
>>> impossible.
>>>
>>> Bacula does allow resolving this issue through the use of pools. By
>>> segregating jobs that are to be run concurrently into different pools,
>>> the situation where two jobs want the same volume at the same time is
>>> avoided altogether.  So is this a bug, or is it a configuration error?
>>>
>>>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud 
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Bacula-users mailing list
> Bac...@li...
> https://lists.sourceforge.net/lists/listinfo/bacula-users
>