From: <pm...@ci...> - 2008-03-29 06:13:38
|
On Fri, Mar 28, 2008 at 02:27:38PM +0100, Kern Sibbald wrote: ! Thanks for the patch. Though I did it slightly differently, I have ! implemented the concept in the current development trunk and in the 2.2.9 ! beta release. ! PS: If you get a chance, I would appreciate it if you would try the 2.2.9-b3 ! and make sure the problem does not exist any more. Oh wow, thanks for the info. And good timing this is - I just finished most of the issues with my OS upgrade (same game, bugs to fix & to report). :) 2.2.9b3 installed and running, and, at first glance it looks good. As I see, my first patch (rev.6377) did not yet make it into this release. Now concerning this one: the problem did not appear so far. And from the source I also would not expect it to appear. You did a more drastic change, stopping *all* non-writing jobs from updating the media.storageid column - I like it better this way - it's more consistent. (Since I did not precisely know what this column is used for I tried to focus on the least-possible modification. If You open this up to the most possible straightforwarness, then it is just great. :)) But, well, I am not really sure if I should already speak up about this - but if I do not it might also be wrong. At the moment I do not have any evidence yet, and it might all be the weirdness of strange coincidences. To put it short: it looks like the SD has caught some kind of Alzheimer; it has difficulties remembering which Volume is in which drive. Just telling a story as it is: since we found the problem with the concurrency counters going negative in jobq.c, my installation started to become real fun - means it started to work about the way I think it should. Then I rebuilt the whole installation (OS+apps) of my backend cluster to a new version, plus major tidy-up of everything. In the end I re-installed Bacula (2.2.8) and ran all the necessary jobs plus big migration. It ran for nearly a day, because the machine was still loaded with compiling other applications, so there were hundreds of scheduled jobs during the migration, and if there still were a conflict, it should have shown up. Actually there is one, I know about it, and we will have to look at that in due time. But besides that, it worked itself smoothly thru, and unravelled everything cleanly. Now today, I ran a small migration with the 2.2.9p3 for test, and I got half a dozen failed schedule-jobs from conflicting mounts in the autochanger, and even a completely stalled drive needing manual intervention. It seems not to recognize when a volume that it wants to use in one drive is already mounted in another drive. (*) The good thing is: it recovered from it, rescheduled the failed jobs, and didn't need a restart. (*) The virtual autochanger script that You supply ('disk-changer') does not care about conflicting mounts - it will happily mount the same Volume into two drives at the same time (and if there is one read and one write, that may even work). My script denies this - like a real autochanger would do. Basically, that problem was already present - but it appeared rather seldom, and I was still trying to figure out the exact circumstances. So this might be just strange coincidence. I am sorry that right at the moment I do not have the time to work myself thru the SVN changes and look for something that might match with my observation. So I only speak out, with the plea to be taken "with a grain of salt" as this might well be a no-issue. rgds, PMc |