From: Mantis B. T. <no...@bu...> - 2010-04-26 18:20:54
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1553 ====================================================================== Reported By: mnalis Assigned To: ====================================================================== Project: bacula Issue ID: 1553 Category: Storage Daemon Reproducibility: have not tried Severity: crash Priority: normal Status: feedback ====================================================================== Date Submitted: 2010-04-21 09:02 BST Last Modified: 2010-04-26 19:20 BST ====================================================================== Summary: SD crashed running several jobs Description: The SD was running several full jobs, and it crashed. Attached are produced traceback and bactrace files. ====================================================================== Relationships ID Summary ---------------------------------------------------------------------- has duplicate 0001561 Storage Daemon crashes sporadically. ====================================================================== ---------------------------------------------------------------------- (0005285) kern (administrator) - 2010-04-21 11:45 http://bugs.bacula.org/view.php?id=1553#c5285 ---------------------------------------------------------------------- No, this is not related to bug 1530 as far as I can tell. Did you run a restore or multiple restores sometime just prior to the crash? If so, I have a fix, if not it will require more work. ---------------------------------------------------------------------- (0005286) mnalis (reporter) - 2010-04-21 12:03 http://bugs.bacula.org/view.php?id=1553#c5286 ---------------------------------------------------------------------- Yes, one restore was running (made on host rigel, restoring to host birdun), and I canceled it at some time (as it was not restoring correct files, I'm just about to file another bug for that). The backups seemed to continue running after cancel, but I didn't wait or checked in depth (so it is possible that the problems started there already). ---------------------------------------------------------------------- (0005287) kern (administrator) - 2010-04-21 13:09 http://bugs.bacula.org/view.php?id=1553#c5287 ---------------------------------------------------------------------- Unfortunately my first message was not correct. I do not have a fix for this, and it is a problem that will take a lot of investigation and time to fix. ---------------------------------------------------------------------- (0005288) mnalis (reporter) - 2010-04-21 13:34 http://bugs.bacula.org/view.php?id=1553#c5288 ---------------------------------------------------------------------- Oh, OK. Do you perhaps know if it is related to restores after all (that is, if we do not do restores, the SD will not crash) ? If so, we can restart bacula after doing restores in order to ensure backups will run (as a workaround, not the solution) Let me know if I can do something to give you more helpful data in case it crashes again (apart from new bactrace and traceback files). ---------------------------------------------------------------------- (0005291) kern (administrator) - 2010-04-21 18:13 http://bugs.bacula.org/view.php?id=1553#c5291 ---------------------------------------------------------------------- My first message was incorrect, so I have no reason to believe that it is related to restores. ---------------------------------------------------------------------- (0005298) kern (administrator) - 2010-04-21 22:40 http://bugs.bacula.org/view.php?id=1553#c5298 ---------------------------------------------------------------------- "Let me know if I can do something to give you more helpful data in case it crashes again" You must have some configuration that is different from other users since no one else has reported this problem, so: - you could start by attaching your bacula-dir.conf and your bacula-sd.conf. We don't need to see *all* the jobs defined but at least one that is representative so we can determine if you have some unusual directive set. - how often does this happen, and how many jobs are you usually running? - about the only other way to resolve it is to have a log of every block() and unblock() in the SD, but that would require adding more code in the lock manager or possibly turning on some existing lock tracing code, and as I said before it requires a significant effort ... ---------------------------------------------------------------------- (0005303) mnalis (reporter) - 2010-04-22 14:26 http://bugs.bacula.org/view.php?id=1553#c5303 ---------------------------------------------------------------------- - Our configs are somewhat big... They are attached in conf_20100422.tgz (with passwords/IPs changed). They are several different classes of jobs, you can take clients.d/rigel.conf, clients.d/VPU/vpu-zrinski-fd.conf and clients.d/ares.conf as representatives of most classes. - this is the our second crash of SD in recent versions of bacula, and first of this type, so I would say we don't have much statistical data yet (the first SD crash, reported in bug#1530, is probably different bug as you said). I will report more if/when they happen. We have about 150 jobs running every night, 7 active LTO drives, and run the jobs in high concurrency (director concurrency set to 60, data spooling enabled, SD concurrency 6 per LTO drive) - I understand. Anyway, if you decide to give us the patch and/or options for extra debugging, we can put it in and let you know what it says when/if the crash repeats. Or, we can just wait and see how often will the SD crash and then you can decide if it is worth chasing those crashes or if it only happens once in a blue moon and there are more important issues to fix. Or maybe the our configs will give you the clue about possible culprit. Let us know how you want us to proceed, and thanks for you efforts. ---------------------------------------------------------------------- (0005323) mnalis (reporter) - 2010-04-26 09:55 http://bugs.bacula.org/view.php?id=1553#c5323 ---------------------------------------------------------------------- SD crashed again -- also GIT Branch-5.0 as of 20100420, but with added TCP_KEEPIDLE patch and by error it was recompiled without debug symbols, so I do not know if this information helps at all. I've attached bactrace and (botched) traceback files in logs_20100426_sd_crash.tgz anyway. I'll get the latest branch-5.0 to recompile now... ---------------------------------------------------------------------- (0005324) kern (administrator) - 2010-04-26 19:20 http://bugs.bacula.org/view.php?id=1553#c5324 ---------------------------------------------------------------------- Please try the attached patch. We believe that this patch will fix your problem. Issue History Date Modified Username Field Change ====================================================================== 2010-04-21 09:02 mnalis New Issue 2010-04-21 09:02 mnalis File Added: logs_20100420_sd_crash.tgz 2010-04-21 11:45 kern Note Added: 0005285 2010-04-21 11:45 kern Status new => feedback 2010-04-21 12:03 mnalis Note Added: 0005286 2010-04-21 12:06 mnalis Issue Monitored: mnalis 2010-04-21 13:09 kern Note Added: 0005287 2010-04-21 13:34 mnalis Note Added: 0005288 2010-04-21 18:13 kern Note Added: 0005291 2010-04-21 22:40 kern Note Added: 0005298 2010-04-22 14:26 mnalis Note Added: 0005303 2010-04-22 14:27 mnalis File Added: conf_20100422.tgz 2010-04-26 08:17 ebollengier Relationship added has duplicate 0001561 2010-04-26 09:55 mnalis Note Added: 0005323 2010-04-26 09:55 mnalis File Added: logs_20100426-sd_crash.tgz 2010-04-26 19:19 kern File Added: bug-1553.patch 2010-04-26 19:20 kern Note Added: 0005324 ====================================================================== |