From: Clark, P. A. <cl...@or...> - 2015-04-24 17:09:15
|
To avoid hijacking the question and to address whether it's a bug or not: Why it's a bug - request for media that is unavailable because it is already in use whether for a backup or recovery by a new backup job is a bug when other perfectly good media is available. One should not need to create separate pools otherwise you will need a separate pool for each job to ensure this situation never happens. The real issue here is how and when the communication happens between the director and the storage daemon. If both of these jobs start within a short period of each other (usually on the same schedule), that's when the second job will request media that has already been assigned by the SD, but not communicated to the director prior to the second job starting. That gap is what creates the contention for media. I have also had tapes pulled out from underneath a job resulting in "NULL" volume name and failed jobs. So, if not separate pools, then there's using separate schedules for each job, also not desireable. I have used offset schedules for groups of jobs in order to reduce the number of contentions. If nothing else, if media is not available within a reasonable period of time of the request, the director and/or the SD should decide to look for another. Patti Clark Linux System Administrator R&D Systems Support Oak Ridge National Laboratory On 4/24/15, 11:02 AM, "Josh Fisher" <jf...@pv...> wrote: > >On 4/24/2015 9:14 AM, Clark, Patricia A. wrote: >> This is a known bug that has been reported, but still exists. The job >>wants the tape in use by another job that is using it in drive 0. > >I'm not convinced that this is a bug. By design, Bacula allows more than >one job to simultaneously write to the same volume. When a job looks for >the next volume to write on, it cannot exclude volumes that are already >in use by another job. Note that this is not just at job start up, but >any time a volume is needed. What causes the catch-22 is that each job >is assigned a single device (tape drive) only once at job start up. If >two jobs, each writing to a different device, require the same volume, >then one job must wait until the volume can be moved into its assigned >device. So it is not a bug in the implementation, but rather a design >choice. > > From the perspective of using a multiple drive changer it would seem >that it is a bug to allow multiple jobs to simultaneously write to the >same volume, but Bacula must work with all kinds of hardware. If the >implementation were changed to disallow simultaneous writes to the same >volume, then concurrent jobs with a single drive changer would be >impossible. > >Bacula does allow resolving this issue through the use of pools. By >segregating jobs that are to be run concurrently into different pools, >the situation where two jobs want the same volume at the same time is >avoided altogether. So is this a bug, or is it a configuration error? > > >> Your options are: >> >> 1. Let it wait until the job(s) using the tape in drive 0 finishes. >> The pitfall here is if the tape becomes full. >> 2. Cancel the job(s) requesting the tape in drive 1. Don't restart >>the job, but start a new job. It may or may not decide to use a >>different tape. >> 3. Cancel the job(s) using the tape in drive 0. Bacula should move >>the tape from drive 0 to drive 1 once all of the connections to the tape >>and drive have been released. >> 4. If, for some strange reason there are no jobs using the tape in >>drive 0, try releasing drive 0 in bconsole - this will put the tape back >>into its slot and Bacula should mount it for you. >> >> You may need to use a combination of #4 and one of the other options. >>If none of the above corrects the issue, you may need to restart both >>your director and storage daemons and start again. >> >> Patti Clark >> Linux System Administrator >> R&D Systems Support Oak Ridge National Laboratory >> >> From: <More>, Ankush >><ank...@ca...<mailto:ank...@ca...>> >> Date: Friday, April 24, 2015 at 3:29 AM >> To: Radosław Korzeniewski >><rad...@ko...<mailto:rad...@ko...>> >> Cc: bacula-users >><bac...@li...<mailto:bac...@li...urceforge >>.net>> >> Subject: Re: [Bacula-users] Device is BLOCKED >> >> Hi , >> >> Yes, I tried to mount from bconsole-->mount, but still error is same. >> Appreciate if some can quickly help to resolve this issue. >> >> Device "Drive-1" (/dev/nst1) open but no Bacula volume is currently >>mounted. >> Device is BLOCKED waiting for mount of volume "NY5039L4", >> Pool: Billable >> Media type: LTO-4 >> Slot 1 is loaded in drive 0. >> Total Bytes Read=0 Blocks Read=0 Bytes/block=0 >> Positioned at File=0 Block=0 >> >> Thank you, >> Ankush >> From: Radosław Korzeniewski [mailto:rad...@ko...] >> Sent: 23 April 2015 20:07 >> To: More, Ankush >> Cc: >>bac...@li...<mailto:bac...@li...urceforge. >>net> >> Subject: Re: [Bacula-users] Device is BLOCKED >> >> Hello, >> >> 2015-04-23 13:28 GMT+02:00 More, Ankush >><ank...@ca...<mailto:ank...@ca...>>: >> Hi Team, >> >> We have bacula 7.x with tape auto-changer. >> I am getting below error in "status" and backup stop ( list jobs show >>as running). >> I notice when I run "/usr/libexec/bacula/mtx-changer" tape >>"NY5039L4" is mounted in Drive. >> >> From Bacula point of view mtx-changer can show you that a tape is >>loaded, not mounted. >> >> Then why bacula show BLOCKED. >> How to resolve this issue? >> >> Bacula is asking you to mount a tape. Did you do this? You can mount a >>tape with mount command in bconsole. >> >> Is there any parameter ? >> >> Device "Drive-1" (/dev/nst1) is waiting for: >> Volume: NY5216L4 >> Pool: Billable >> Media type: LTO-4 >> Device is BLOCKED waiting for mount of volume "NY5039L4", >> Pool: Billable >> Media type: LTO-4 >> Slot 1 is loaded in drive 1. >> Total Bytes Read=64,512 Blocks Read=1 Bytes/block=64,512 >> Positioned at File=0 Block=0 >> >> Thank you, >> Ankush >> This message contains information that may be privileged or >>confidential and is the property of the Capgemini Group. It is intended >>only for the person to whom it is addressed. If you are not the intended >>recipient, you are not authorized to read, print, retain, copy, >>disseminate, distribute, or use this message or any part thereof. If you >>receive this message in error, please notify the sender immediately and >>delete all copies of this message. >> >> >>------------------------------------------------------------------------- >>----- >> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT >> Develop your own process in accordance with the BPMN 2 standard >> Learn Process modeling best practices with Bonita BPM through live >>exercises >> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- >>event?utm_ >> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF >> _______________________________________________ >> Bacula-users mailing list >> >>Bac...@li...<mailto:Bac...@li...urceforge. >>net> >> https://lists.sourceforge.net/lists/listinfo/bacula-users >> >> >> >> -- >> Radosław Korzeniewski >> rad...@ko...<mailto:rad...@ko...> >> >>------------------------------------------------------------------------- >>----- >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight. >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> _______________________________________________ >> Bacula-users mailing list >> Bac...@li... >> https://lists.sourceforge.net/lists/listinfo/bacula-users > > >-------------------------------------------------------------------------- >---- >One dashboard for servers and applications across Physical-Virtual-Cloud >Widest out-of-the-box monitoring support with 50+ applications >Performance metrics, stats and reports that give you Actionable Insights >Deep dive visibility with transaction tracing using APM Insight. >http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >_______________________________________________ >Bacula-users mailing list >Bac...@li... >https://lists.sourceforge.net/lists/listinfo/bacula-users |
From: Josh F. <jf...@pv...> - 2015-04-24 20:06:38
|
I guess it is semantics, but I was just pointing out that it was not a coding issue, but rather a design issue/choice. You can divide the jobs into different pools and then give jobs in the same pools different priorities. The pools allow multiple jobs (from different pools) to run concurrently, while the priorities serialize the jobs within each pool. Far from desirable, but it does work. In any case, I agree that all of the ways of using multiple drives concurrently seem unwieldy. It would be nice if both device and volume assignment were done as a single atomic operation every time that a job selected a volume. In other words, when the job needs a volume, it looks for both an AVAILABLE volume and an AVAILABLE device at the same time, and only one job at a time can make a volume-device selection. That is easier said than done, of course. On 4/24/2015 1:09 PM, Clark, Patricia A. wrote: > To avoid hijacking the question and to address whether it's a bug or not: > > Why it's a bug - request for media that is unavailable because it is > already in use whether for a backup or recovery by a new backup job is a > bug when other perfectly good media is available. One should not need to > create separate pools otherwise you will need a separate pool for each job > to ensure this situation never happens. The real issue here is how and > when the communication happens between the director and the storage > daemon. If both of these jobs start within a short period of each other > (usually on the same schedule), that's when the second job will request > media that has already been assigned by the SD, but not communicated to > the director prior to the second job starting. That gap is what creates > the contention for media. I have also had tapes pulled out from > underneath a job resulting in "NULL" volume name and failed jobs. So, if > not separate pools, then there's using separate schedules for each job, > also not desireable. I have used offset schedules for groups of jobs in > order to reduce the number of contentions. If nothing else, if media is > not available within a reasonable period of time of the request, the > director and/or the SD should decide to look for another. > > Patti Clark > Linux System Administrator > R&D Systems Support Oak Ridge National Laboratory > > > > On 4/24/15, 11:02 AM, "Josh Fisher" <jf...@pv...> wrote: > >> On 4/24/2015 9:14 AM, Clark, Patricia A. wrote: >>> This is a known bug that has been reported, but still exists. The job >>> wants the tape in use by another job that is using it in drive 0. >> I'm not convinced that this is a bug. By design, Bacula allows more than >> one job to simultaneously write to the same volume. When a job looks for >> the next volume to write on, it cannot exclude volumes that are already >> in use by another job. Note that this is not just at job start up, but >> any time a volume is needed. What causes the catch-22 is that each job >> is assigned a single device (tape drive) only once at job start up. If >> two jobs, each writing to a different device, require the same volume, >> then one job must wait until the volume can be moved into its assigned >> device. So it is not a bug in the implementation, but rather a design >> choice. >> >> From the perspective of using a multiple drive changer it would seem >> that it is a bug to allow multiple jobs to simultaneously write to the >> same volume, but Bacula must work with all kinds of hardware. If the >> implementation were changed to disallow simultaneous writes to the same >> volume, then concurrent jobs with a single drive changer would be >> impossible. >> >> Bacula does allow resolving this issue through the use of pools. By >> segregating jobs that are to be run concurrently into different pools, >> the situation where two jobs want the same volume at the same time is >> avoided altogether. So is this a bug, or is it a configuration error? >> >> |
From: Kern S. <ke...@si...> - 2015-04-25 05:50:57
|
In my last email, I did forget to mention that as you point out, the problem can also result from a design issue. And the resolution of those problems from design issues fall into my point 2. If we have a good test case that shows the problem, even if it results from a design decision, most of the time we can find a solution -- in some cases, we have added new directives, but in most cases, a bit more programming/logic can fix the problem. One of the biggest issues that I have with the current SD algorithm is that during the drive(s) reservation process (prior to starting the SD job) once a write drive is assigned, it cannot be changed. Changing a drive when multiple simultaneous jobs are writing is a non-trivial problem. There are solutions, but they require rather profound changes to the SD, which I have been planning for at least 5 years -- all the underlying code and algorithms now exist so it is a matter of time. Best regards, Kern On 24.04.2015 22:07, Josh Fisher wrote: > I guess it is semantics, but I was just pointing out that it was not a > coding issue, but rather a design issue/choice. > > You can divide the jobs into different pools and then give jobs in the > same pools different priorities. The pools allow multiple jobs (from > different pools) to run concurrently, while the priorities serialize the > jobs within each pool. Far from desirable, but it does work. > > In any case, I agree that all of the ways of using multiple drives > concurrently seem unwieldy. It would be nice if both device and volume > assignment were done as a single atomic operation every time that a job > selected a volume. In other words, when the job needs a volume, it looks > for both an AVAILABLE volume and an AVAILABLE device at the same time, > and only one job at a time can make a volume-device selection. That is > easier said than done, of course. > > On 4/24/2015 1:09 PM, Clark, Patricia A. wrote: >> To avoid hijacking the question and to address whether it's a bug or not: >> >> Why it's a bug - request for media that is unavailable because it is >> already in use whether for a backup or recovery by a new backup job is a >> bug when other perfectly good media is available. One should not need to >> create separate pools otherwise you will need a separate pool for each job >> to ensure this situation never happens. The real issue here is how and >> when the communication happens between the director and the storage >> daemon. If both of these jobs start within a short period of each other >> (usually on the same schedule), that's when the second job will request >> media that has already been assigned by the SD, but not communicated to >> the director prior to the second job starting. That gap is what creates >> the contention for media. I have also had tapes pulled out from >> underneath a job resulting in "NULL" volume name and failed jobs. So, if >> not separate pools, then there's using separate schedules for each job, >> also not desireable. I have used offset schedules for groups of jobs in >> order to reduce the number of contentions. If nothing else, if media is >> not available within a reasonable period of time of the request, the >> director and/or the SD should decide to look for another. >> >> Patti Clark >> Linux System Administrator >> R&D Systems Support Oak Ridge National Laboratory >> >> >> >> On 4/24/15, 11:02 AM, "Josh Fisher" <jf...@pv...> wrote: >> >>> On 4/24/2015 9:14 AM, Clark, Patricia A. wrote: >>>> This is a known bug that has been reported, but still exists. The job >>>> wants the tape in use by another job that is using it in drive 0. >>> I'm not convinced that this is a bug. By design, Bacula allows more than >>> one job to simultaneously write to the same volume. When a job looks for >>> the next volume to write on, it cannot exclude volumes that are already >>> in use by another job. Note that this is not just at job start up, but >>> any time a volume is needed. What causes the catch-22 is that each job >>> is assigned a single device (tape drive) only once at job start up. If >>> two jobs, each writing to a different device, require the same volume, >>> then one job must wait until the volume can be moved into its assigned >>> device. So it is not a bug in the implementation, but rather a design >>> choice. >>> >>> From the perspective of using a multiple drive changer it would seem >>> that it is a bug to allow multiple jobs to simultaneously write to the >>> same volume, but Bacula must work with all kinds of hardware. If the >>> implementation were changed to disallow simultaneous writes to the same >>> volume, then concurrent jobs with a single drive changer would be >>> impossible. >>> >>> Bacula does allow resolving this issue through the use of pools. By >>> segregating jobs that are to be run concurrently into different pools, >>> the situation where two jobs want the same volume at the same time is >>> avoided altogether. So is this a bug, or is it a configuration error? >>> >>> > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users > |
From: Ana E. M. A. <emi...@gm...> - 2015-04-27 01:04:45
|
I'm glad to read so good news. Thank you Kern. I have been trying to understand this issue that a Bacula user has been facing. As Kern said, it is really difficult to replicate it. We noticed that his backups worked fine for days and suddenly a "DEVICE is blocked" appeared. Some details about his configuration: 1) 3 pools being used by 20 or more concurrent jobs; 2) an autochanger with 10 drives (to avoid interleaving, each device was configured with maximum concurrent jobs = 1) 3) jobs with different priorities and various scheduled times. 4) groups of jobs using different pools He noticed that he was having issues with slot mess. That is, before his backups started, he had the output from mtx-changer listall showing the media/slots information as it was in Bacula's Catalog. Then, after a day of backup jobs run he noticed that mtx-changer listall show different information from the Catalog. The issue here seemed to be the autochanger timeout configuration. He had an autochanger with a 900 seconds timeout. So we configured the maximum changer/rewind/open wait directives configured for 900 seconds and the mtx-changer script. It seems that this solved the problem with the slot mess. We thought that this was causing the issue with DEVICE is blocked. But we cannot confirme that by now. Also he did some schedules and pools modifications. Now all the jobs have the same priority, same time schedule and will use just one pool in a specific day. We are going to monitor this new configuration and maybe we can post here the results. Best regards, Ana On Sat, Apr 25, 2015 at 2:50 AM, Kern Sibbald <ke...@si...> wrote: > In my last email, I did forget to mention that as you point out, the > problem can also result from a design issue. And the resolution of > those problems from design issues fall into my point 2. If we have a > good test case that shows the problem, even if it results from a design > decision, most of the time we can find a solution -- in some cases, we > have added new directives, but in most cases, a bit more > programming/logic can fix the problem. > > One of the biggest issues that I have with the current SD algorithm is > that during the drive(s) reservation process (prior to starting the SD > job) once a write drive is assigned, it cannot be changed. Changing a > drive when multiple simultaneous jobs are writing is a non-trivial > problem. There are solutions, but they require rather profound changes > to the SD, which I have been planning for at least 5 years -- all the > underlying code and algorithms now exist so it is a matter of time. > > Best regards, > Kern > > On 24.04.2015 22:07, Josh Fisher wrote: > > I guess it is semantics, but I was just pointing out that it was not a > > coding issue, but rather a design issue/choice. > > > > You can divide the jobs into different pools and then give jobs in the > > same pools different priorities. The pools allow multiple jobs (from > > different pools) to run concurrently, while the priorities serialize the > > jobs within each pool. Far from desirable, but it does work. > > > > In any case, I agree that all of the ways of using multiple drives > > concurrently seem unwieldy. It would be nice if both device and volume > > assignment were done as a single atomic operation every time that a job > > selected a volume. In other words, when the job needs a volume, it looks > > for both an AVAILABLE volume and an AVAILABLE device at the same time, > > and only one job at a time can make a volume-device selection. That is > > easier said than done, of course. > > > > On 4/24/2015 1:09 PM, Clark, Patricia A. wrote: > >> To avoid hijacking the question and to address whether it's a bug or > not: > >> > >> Why it's a bug - request for media that is unavailable because it is > >> already in use whether for a backup or recovery by a new backup job is a > >> bug when other perfectly good media is available. One should not need > to > >> create separate pools otherwise you will need a separate pool for each > job > >> to ensure this situation never happens. The real issue here is how and > >> when the communication happens between the director and the storage > >> daemon. If both of these jobs start within a short period of each other > >> (usually on the same schedule), that's when the second job will request > >> media that has already been assigned by the SD, but not communicated to > >> the director prior to the second job starting. That gap is what creates > >> the contention for media. I have also had tapes pulled out from > >> underneath a job resulting in "NULL" volume name and failed jobs. So, > if > >> not separate pools, then there's using separate schedules for each job, > >> also not desireable. I have used offset schedules for groups of jobs in > >> order to reduce the number of contentions. If nothing else, if media is > >> not available within a reasonable period of time of the request, the > >> director and/or the SD should decide to look for another. > >> > >> Patti Clark > >> Linux System Administrator > >> R&D Systems Support Oak Ridge National Laboratory > >> > >> > >> > >> On 4/24/15, 11:02 AM, "Josh Fisher" <jf...@pv...> wrote: > >> > >>> On 4/24/2015 9:14 AM, Clark, Patricia A. wrote: > >>>> This is a known bug that has been reported, but still exists. The job > >>>> wants the tape in use by another job that is using it in drive 0. > >>> I'm not convinced that this is a bug. By design, Bacula allows more > than > >>> one job to simultaneously write to the same volume. When a job looks > for > >>> the next volume to write on, it cannot exclude volumes that are already > >>> in use by another job. Note that this is not just at job start up, but > >>> any time a volume is needed. What causes the catch-22 is that each job > >>> is assigned a single device (tape drive) only once at job start up. If > >>> two jobs, each writing to a different device, require the same volume, > >>> then one job must wait until the volume can be moved into its assigned > >>> device. So it is not a bug in the implementation, but rather a design > >>> choice. > >>> > >>> From the perspective of using a multiple drive changer it would seem > >>> that it is a bug to allow multiple jobs to simultaneously write to the > >>> same volume, but Bacula must work with all kinds of hardware. If the > >>> implementation were changed to disallow simultaneous writes to the same > >>> volume, then concurrent jobs with a single drive changer would be > >>> impossible. > >>> > >>> Bacula does allow resolving this issue through the use of pools. By > >>> segregating jobs that are to be run concurrently into different pools, > >>> the situation where two jobs want the same volume at the same time is > >>> avoided altogether. So is this a bug, or is it a configuration error? > >>> > >>> > > > > > ------------------------------------------------------------------------------ > > One dashboard for servers and applications across Physical-Virtual-Cloud > > Widest out-of-the-box monitoring support with 50+ applications > > Performance metrics, stats and reports that give you Actionable Insights > > Deep dive visibility with transaction tracing using APM Insight. > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > _______________________________________________ > > Bacula-users mailing list > > Bac...@li... > > https://lists.sourceforge.net/lists/listinfo/bacula-users > > > > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users > |
From: Kern S. <ke...@si...> - 2015-04-27 06:07:27
|
<html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <div class="moz-cite-prefix">Hello Ana,<br> <br> One of the big race conditions that is not yet solved, because it takes a major rewrite, that is waiting on me having some free time is the case where two jobs attempt to use the same drive at the same time for different Volumes. This leads to a BLOCKED condition on one of the jobs until the other job finishes.<br> <br> The workaround for that problem is for jobs that can contend for the same drive but use different Volumes (pools), ensure that they do not all start at the same time. That is if you start 50-100 jobs at the same time, and there are 20 that run concurrently in the SD, then you increase the changes of a initial drive assignment conflict.<br> <br> If instead you start those jobs with 1-2 minute intervals, you will not have that particular issue. Generally, it just requires slightly different schedules.<br> <br> Best regards,<br> Kern<br> <br> On 27.04.2015 03:04, Ana Emília M. Arruda wrote:<br> </div> <blockquote cite="mid:CAA...@ma..." type="cite"> <div dir="ltr"> <div><br> </div> <div> <div class="gmail_default" style="font-family:tahoma,sans-serif">I'm glad to read so good news. Thank you Kern.</div> </div> <div><br> </div> <div> <div class="gmail_default" style="font-family:tahoma,sans-serif">I have been trying to understand this issue that a Bacula user has been facing. As Kern said, it is really difficult to replicate it. We noticed that his backups worked fine for days and suddenly a "DEVICE is blocked" appeared. Some details about his configuration:</div> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif"><br> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">1) 3 pools being used by 20 or more concurrent jobs;</div> <div class="gmail_default" style="font-family:tahoma,sans-serif">2) an autochanger with 10 drives (to avoid interleaving, each device was configured with maximum concurrent jobs = 1)</div> <div class="gmail_default" style="font-family:tahoma,sans-serif">3) jobs with different priorities and various scheduled times.</div> <div class="gmail_default" style="font-family:tahoma,sans-serif">4) groups of jobs using different pools</div> <div class="gmail_default" style="font-family:tahoma,sans-serif"><br> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">He noticed that he was having issues with slot mess. That is, before his backups started, he had the output from mtx-changer listall showing the media/slots information as it was in Bacula's Catalog. Then, after a day of backup jobs run he noticed that mtx-changer listall show different information from the Catalog. </div> <div class="gmail_default" style="font-family:tahoma,sans-serif"><br> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">The issue here seemed to be the autochanger timeout configuration. He had an autochanger with a 900 seconds timeout. So we configured the maximum changer/rewind/open wait directives configured for 900 seconds and the mtx-changer script. It seems that this solved the problem with the slot mess.</div> <div class="gmail_default" style="font-family:tahoma,sans-serif"><br> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">We thought that this was causing the issue with DEVICE is blocked. But we cannot confirme that by now.<br> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif"><br> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">Also he did some schedules and pools modifications. Now all the jobs have the same priority, same time schedule and will use just one pool in a specific day.</div> <div class="gmail_default" style="font-family:tahoma,sans-serif"><br> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">We are going to monitor this new configuration and maybe we can post here the results.</div> <div class="gmail_default" style="font-family:tahoma,sans-serif"><br> </div> <div class="gmail_default" style="font-family:tahoma,sans-serif">Best regards,</div> <div class="gmail_default" style="font-family:tahoma,sans-serif">Ana</div> </div> <div class="gmail_extra"><br> <div class="gmail_quote">On Sat, Apr 25, 2015 at 2:50 AM, Kern Sibbald <span dir="ltr"><<a moz-do-not-send="true" href="mailto:ke...@si..." target="_blank">ke...@si...</a>></span> wrote:<br> <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">In my last email, I did forget to mention that as you point out, the<br> problem can also result from a design issue. And the resolution of<br> those problems from design issues fall into my point 2. If we have a<br> good test case that shows the problem, even if it results from a design<br> decision, most of the time we can find a solution -- in some cases, we<br> have added new directives, but in most cases, a bit more<br> programming/logic can fix the problem.<br> <br> One of the biggest issues that I have with the current SD algorithm is<br> that during the drive(s) reservation process (prior to starting the SD<br> job) once a write drive is assigned, it cannot be changed. Changing a<br> drive when multiple simultaneous jobs are writing is a non-trivial<br> problem. There are solutions, but they require rather profound changes<br> to the SD, which I have been planning for at least 5 years -- all the<br> underlying code and algorithms now exist so it is a matter of time.<br> <br> Best regards,<br> Kern<br> <div class="HOEnZb"> <div class="h5"><br> On <a moz-do-not-send="true" href="tel:24.04.2015%2022" value="+12404201522">24.04.2015 22</a>:07, Josh Fisher wrote:<br> > I guess it is semantics, but I was just pointing out that it was not a<br> > coding issue, but rather a design issue/choice.<br> ><br> > You can divide the jobs into different pools and then give jobs in the<br> > same pools different priorities. The pools allow multiple jobs (from<br> > different pools) to run concurrently, while the priorities serialize the<br> > jobs within each pool. Far from desirable, but it does work.<br> ><br> > In any case, I agree that all of the ways of using multiple drives<br> > concurrently seem unwieldy. It would be nice if both device and volume<br> > assignment were done as a single atomic operation every time that a job<br> > selected a volume. In other words, when the job needs a volume, it looks<br> > for both an AVAILABLE volume and an AVAILABLE device at the same time,<br> > and only one job at a time can make a volume-device selection. That is<br> > easier said than done, of course.<br> ><br> > On 4/24/2015 1:09 PM, Clark, Patricia A. wrote:<br> >> To avoid hijacking the question and to address whether it's a bug or not:<br> >><br> >> Why it's a bug - request for media that is unavailable because it is<br> >> already in use whether for a backup or recovery by a new backup job is a<br> >> bug when other perfectly good media is available. One should not need to<br> >> create separate pools otherwise you will need a separate pool for each job<br> >> to ensure this situation never happens. The real issue here is how and<br> >> when the communication happens between the director and the storage<br> >> daemon. If both of these jobs start within a short period of each other<br> >> (usually on the same schedule), that's when the second job will request<br> >> media that has already been assigned by the SD, but not communicated to<br> >> the director prior to the second job starting. That gap is what creates<br> >> the contention for media. I have also had tapes pulled out from<br> >> underneath a job resulting in "NULL" volume name and failed jobs. So, if<br> >> not separate pools, then there's using separate schedules for each job,<br> >> also not desireable. I have used offset schedules for groups of jobs in<br> >> order to reduce the number of contentions. If nothing else, if media is<br> >> not available within a reasonable period of time of the request, the<br> >> director and/or the SD should decide to look for another.<br> >><br> >> Patti Clark<br> >> Linux System Administrator<br> >> R&D Systems Support Oak Ridge National Laboratory<br> >><br> >><br> >><br> >> On 4/24/15, 11:02 AM, "Josh Fisher" <<a moz-do-not-send="true" href="mailto:jf...@pv...">jf...@pv...</a>> wrote:<br> >><br> >>> On 4/24/2015 9:14 AM, Clark, Patricia A. wrote:<br> >>>> This is a known bug that has been reported, but still exists. The job<br> >>>> wants the tape in use by another job that is using it in drive 0.<br> >>> I'm not convinced that this is a bug. By design, Bacula allows more than<br> >>> one job to simultaneously write to the same volume. When a job looks for<br> >>> the next volume to write on, it cannot exclude volumes that are already<br> >>> in use by another job. Note that this is not just at job start up, but<br> >>> any time a volume is needed. What causes the catch-22 is that each job<br> >>> is assigned a single device (tape drive) only once at job start up. If<br> >>> two jobs, each writing to a different device, require the same volume,<br> >>> then one job must wait until the volume can be moved into its assigned<br> >>> device. So it is not a bug in the implementation, but rather a design<br> >>> choice.<br> >>><br> >>> From the perspective of using a multiple drive changer it would seem<br> >>> that it is a bug to allow multiple jobs to simultaneously write to the<br> >>> same volume, but Bacula must work with all kinds of hardware. If the<br> >>> implementation were changed to disallow simultaneous writes to the same<br> >>> volume, then concurrent jobs with a single drive changer would be<br> >>> impossible.<br> >>><br> >>> Bacula does allow resolving this issue through the use of pools. By<br> >>> segregating jobs that are to be run concurrently into different pools,<br> >>> the situation where two jobs want the same volume at the same time is<br> >>> avoided altogether. So is this a bug, or is it a configuration error?<br> >>><br> >>><br> ><br> > ------------------------------------------------------------------------------<br> > One dashboard for servers and applications across Physical-Virtual-Cloud<br> > Widest out-of-the-box monitoring support with 50+ applications<br> > Performance metrics, stats and reports that give you Actionable Insights<br> > Deep dive visibility with transaction tracing using APM Insight.<br> > <a moz-do-not-send="true" href="http://ad.doubleclick.net/ddm/clk/290420510;117567292;y" target="_blank">http://ad.doubleclick.net/ddm/clk/290420510;117567292;y</a><br> > _______________________________________________<br> > Bacula-users mailing list<br> > <a moz-do-not-send="true" href="mailto:Bac...@li...">Bac...@li...</a><br> > <a moz-do-not-send="true" href="https://lists.sourceforge.net/lists/listinfo/bacula-users" target="_blank">https://lists.sourceforge.net/lists/listinfo/bacula-users</a><br> ><br> <br> <br> ------------------------------------------------------------------------------<br> One dashboard for servers and applications across Physical-Virtual-Cloud<br> Widest out-of-the-box monitoring support with 50+ applications<br> Performance metrics, stats and reports that give you Actionable Insights<br> Deep dive visibility with transaction tracing using APM Insight.<br> <a moz-do-not-send="true" href="http://ad.doubleclick.net/ddm/clk/290420510;117567292;y" target="_blank">http://ad.doubleclick.net/ddm/clk/290420510;117567292;y</a><br> _______________________________________________<br> Bacula-users mailing list<br> <a moz-do-not-send="true" href="mailto:Bac...@li...">Bac...@li...</a><br> <a moz-do-not-send="true" href="https://lists.sourceforge.net/lists/listinfo/bacula-users" target="_blank">https://lists.sourceforge.net/lists/listinfo/bacula-users</a><br> </div> </div> </blockquote> </div> <br> </div> </blockquote> <br> </body> </html> |
From: Josh F. <jf...@pv...> - 2015-04-27 13:17:06
|
On 4/25/2015 1:50 AM, Kern Sibbald wrote: > In my last email, I did forget to mention that as you point out, the > problem can also result from a design issue. And the resolution of > those problems from design issues fall into my point 2. If we have a > good test case that shows the problem, even if it results from a design > decision, most of the time we can find a solution -- in some cases, we > have added new directives, but in most cases, a bit more > programming/logic can fix the problem. > > One of the biggest issues that I have with the current SD algorithm is > that during the drive(s) reservation process (prior to starting the SD > job) once a write drive is assigned, it cannot be changed. Changing a > drive when multiple simultaneous jobs are writing is a non-trivial > problem. There are solutions, but they require rather profound changes > to the SD, which I have been planning for at least 5 years -- all the > underlying code and algorithms now exist so it is a matter of time. Thank you Kern. That is good news! Have you considered using a single device-volume pair assignment, rather than both a device assignment and a separate volume assignment? I have found that the easiest way to avoid thread-related issues is to minimize the number of things that must be serialized. Since a job, at any given instant, will always require both a device and a volume, it might make sense to assign both at the same time as a single atomic operation. The device-volume pair assignment code can be serialized by a single mutex, and I believe that would greatly simplify the device and volume assignment code, as well as allow for changing a job's device in a safe manner. Any time that a job requires a volume to write on, whether at job start up or end of previous volume, it requests a device-volume pair to continue writing on. Since only one job at a time can enter the assignment code, both device and volume state are guaranteed to be static while checking device and volume criteria and making a device-volume pair selection and unloading / loading the device as needed. In turn, a successful request guarantees that the device-volume pair returned is valid for the job, and an unsuccessful request guarantees that the job needs to wait for an appendable volume. I believe that treating device and volume as a single unit would greatly simplify the assignment code. A single mutex for device-volume pairing should eliminate any chance of a race condition. |
From: Kern S. <ke...@si...> - 2015-04-25 05:43:18
|
Your analysis(es) of the situation sounds correct to me. The big problems for developers are: 1. Race conditions such as you mention are difficult to reproduce. If we have a script that will reproduce it every time or nearly every time, it is relatively easy though sometimes a lot of work to fix the problem. 2. Yes, as you point out the Dir and the SD should look for another drive/volume. However, again, this needs a script to duplicate the problem, and in addition this is more a development issue than a bug (though that could be disputed), and thus it is a question of priorities and finding someone with the desire and time to program. One of the good things about Bacula Systems, is that paying customers report many if not most of these problems, and in those cases, either the customer is willing to produce a script that reproduces the problem or the Bacula Systems support team does so, so these problems are being fixed over time. In each Bacula community release (the next in June-July) *all* the Bacula Enterprise bug/race condition fixes are backported to the community as well as many of the new Enterprise features. So the situation is not as bad as it may at first appear (at least in my opinion). Best regards, Kern On 24.04.2015 19:09, Clark, Patricia A. wrote: > To avoid hijacking the question and to address whether it's a bug or not: > > Why it's a bug - request for media that is unavailable because it is > already in use whether for a backup or recovery by a new backup job is a > bug when other perfectly good media is available. One should not need to > create separate pools otherwise you will need a separate pool for each job > to ensure this situation never happens. The real issue here is how and > when the communication happens between the director and the storage > daemon. If both of these jobs start within a short period of each other > (usually on the same schedule), that's when the second job will request > media that has already been assigned by the SD, but not communicated to > the director prior to the second job starting. That gap is what creates > the contention for media. I have also had tapes pulled out from > underneath a job resulting in "NULL" volume name and failed jobs. So, if > not separate pools, then there's using separate schedules for each job, > also not desireable. I have used offset schedules for groups of jobs in > order to reduce the number of contentions. If nothing else, if media is > not available within a reasonable period of time of the request, the > director and/or the SD should decide to look for another. > > Patti Clark > Linux System Administrator > R&D Systems Support Oak Ridge National Laboratory > > > > On 4/24/15, 11:02 AM, "Josh Fisher" <jf...@pv...> wrote: > >> On 4/24/2015 9:14 AM, Clark, Patricia A. wrote: >>> This is a known bug that has been reported, but still exists. The job >>> wants the tape in use by another job that is using it in drive 0. >> I'm not convinced that this is a bug. By design, Bacula allows more than >> one job to simultaneously write to the same volume. When a job looks for >> the next volume to write on, it cannot exclude volumes that are already >> in use by another job. Note that this is not just at job start up, but >> any time a volume is needed. What causes the catch-22 is that each job >> is assigned a single device (tape drive) only once at job start up. If >> two jobs, each writing to a different device, require the same volume, >> then one job must wait until the volume can be moved into its assigned >> device. So it is not a bug in the implementation, but rather a design >> choice. >> >> From the perspective of using a multiple drive changer it would seem >> that it is a bug to allow multiple jobs to simultaneously write to the >> same volume, but Bacula must work with all kinds of hardware. If the >> implementation were changed to disallow simultaneous writes to the same >> volume, then concurrent jobs with a single drive changer would be >> impossible. >> >> Bacula does allow resolving this issue through the use of pools. By >> segregating jobs that are to be run concurrently into different pools, >> the situation where two jobs want the same volume at the same time is >> avoided altogether. So is this a bug, or is it a configuration error? >> >> >>> Your options are: >>> >>> 1. Let it wait until the job(s) using the tape in drive 0 finishes. >>> The pitfall here is if the tape becomes full. >>> 2. Cancel the job(s) requesting the tape in drive 1. Don't restart >>> the job, but start a new job. It may or may not decide to use a >>> different tape. >>> 3. Cancel the job(s) using the tape in drive 0. Bacula should move >>> the tape from drive 0 to drive 1 once all of the connections to the tape >>> and drive have been released. >>> 4. If, for some strange reason there are no jobs using the tape in >>> drive 0, try releasing drive 0 in bconsole - this will put the tape back >>> into its slot and Bacula should mount it for you. >>> >>> You may need to use a combination of #4 and one of the other options. >>> If none of the above corrects the issue, you may need to restart both >>> your director and storage daemons and start again. >>> >>> Patti Clark >>> Linux System Administrator >>> R&D Systems Support Oak Ridge National Laboratory >>> >>> From: <More>, Ankush >>> <ank...@ca...<mailto:ank...@ca...>> >>> Date: Friday, April 24, 2015 at 3:29 AM >>> To: Radosław Korzeniewski >>> <rad...@ko...<mailto:rad...@ko...>> >>> Cc: bacula-users >>> <bac...@li...<mailto:bac...@li...urceforge >>> .net>> >>> Subject: Re: [Bacula-users] Device is BLOCKED >>> >>> Hi , >>> >>> Yes, I tried to mount from bconsole-->mount, but still error is same. >>> Appreciate if some can quickly help to resolve this issue. >>> >>> Device "Drive-1" (/dev/nst1) open but no Bacula volume is currently >>> mounted. >>> Device is BLOCKED waiting for mount of volume "NY5039L4", >>> Pool: Billable >>> Media type: LTO-4 >>> Slot 1 is loaded in drive 0. >>> Total Bytes Read=0 Blocks Read=0 Bytes/block=0 >>> Positioned at File=0 Block=0 >>> >>> Thank you, >>> Ankush >>> From: Radosław Korzeniewski [mailto:rad...@ko...] >>> Sent: 23 April 2015 20:07 >>> To: More, Ankush >>> Cc: >>> bac...@li...<mailto:bac...@li...urceforge. >>> net> >>> Subject: Re: [Bacula-users] Device is BLOCKED >>> >>> Hello, >>> >>> 2015-04-23 13:28 GMT+02:00 More, Ankush >>> <ank...@ca...<mailto:ank...@ca...>>: >>> Hi Team, >>> >>> We have bacula 7.x with tape auto-changer. >>> I am getting below error in "status" and backup stop ( list jobs show >>> as running). >>> I notice when I run "/usr/libexec/bacula/mtx-changer" tape >>> "NY5039L4" is mounted in Drive. >>> >>> From Bacula point of view mtx-changer can show you that a tape is >>> loaded, not mounted. >>> >>> Then why bacula show BLOCKED. >>> How to resolve this issue? >>> >>> Bacula is asking you to mount a tape. Did you do this? You can mount a >>> tape with mount command in bconsole. >>> >>> Is there any parameter ? >>> >>> Device "Drive-1" (/dev/nst1) is waiting for: >>> Volume: NY5216L4 >>> Pool: Billable >>> Media type: LTO-4 >>> Device is BLOCKED waiting for mount of volume "NY5039L4", >>> Pool: Billable >>> Media type: LTO-4 >>> Slot 1 is loaded in drive 1. >>> Total Bytes Read=64,512 Blocks Read=1 Bytes/block=64,512 >>> Positioned at File=0 Block=0 >>> >>> Thank you, >>> Ankush >>> This message contains information that may be privileged or >>> confidential and is the property of the Capgemini Group. It is intended >>> only for the person to whom it is addressed. If you are not the intended >>> recipient, you are not authorized to read, print, retain, copy, >>> disseminate, distribute, or use this message or any part thereof. If you >>> receive this message in error, please notify the sender immediately and >>> delete all copies of this message. >>> >>> >>> ------------------------------------------------------------------------- >>> ----- >>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT >>> Develop your own process in accordance with the BPMN 2 standard >>> Learn Process modeling best practices with Bonita BPM through live >>> exercises >>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- >>> event?utm_ >>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF >>> _______________________________________________ >>> Bacula-users mailing list >>> >>> Bac...@li...<mailto:Bac...@li...urceforge. >>> net> >>> https://lists.sourceforge.net/lists/listinfo/bacula-users >>> >>> >>> >>> -- >>> Radosław Korzeniewski >>> rad...@ko...<mailto:rad...@ko...> >>> >>> ------------------------------------------------------------------------- >>> ----- >>> One dashboard for servers and applications across Physical-Virtual-Cloud >>> Widest out-of-the-box monitoring support with 50+ applications >>> Performance metrics, stats and reports that give you Actionable Insights >>> Deep dive visibility with transaction tracing using APM Insight. >>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> _______________________________________________ >>> Bacula-users mailing list >>> Bac...@li... >>> https://lists.sourceforge.net/lists/listinfo/bacula-users >> >> -------------------------------------------------------------------------- >> ---- >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight. >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> _______________________________________________ >> Bacula-users mailing list >> Bac...@li... >> https://lists.sourceforge.net/lists/listinfo/bacula-users > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Bacula-users mailing list > Bac...@li... > https://lists.sourceforge.net/lists/listinfo/bacula-users > |