From: <bac...@li...> - 2007-03-29 07:20:59
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=794 ====================================================================== Reported By: jusiponen Assigned To: ====================================================================== Project: bacula Issue ID: 794 Category: Storage Daemon Reproducibility: sometimes Severity: crash Priority: normal Status: feedback ====================================================================== Date Submitted: 03-02-2007 03:14 EST Last Modified: 03-29-2007 03:20 EDT ====================================================================== Summary: Storage daemon crashes semi-regularly Description: The storage daemon started to crash 1-2 times a week since I upgraded from 1.x series to 2.0.2. I see two job names in the traceback, but since this is the first traceback available (I didn't have debugging installed before) I can't say if these are related to the sd crash (will they be seen in future traces, is one of these jobs responsible for the crash, or is a third job - name not seen in the trace - crashing the sd). The jobs seen in the trace run on clients with bacula-client-2.0.2-1.x86_64 RPM package installed. After the crash Bacula allways errors the tape (2 more files on tape than in catalog). I'll post more tracebacks if (or rather when...) the crash happens again. ====================================================================== ---------------------------------------------------------------------- kern - 03-02-07 06:06 ---------------------------------------------------------------------- Are you sure you have a 64 bit machine? Have you run btape "test"? Bacula is crashing itself because there is an internal logic error (failed assert), which indicates to me that either you have Bacula build that does not correspond to your system, or you have some really serious configuration error. ---------------------------------------------------------------------- jusiponen - 03-02-07 06:48 ---------------------------------------------------------------------- Server is i686, the clients mentioned in the trace are x86_64. The server has run pre 2.x Bacula releases just fine, but it IS old (an IBM xSeries 340, about 5-6 yr old): hardware failure is entirely possible. I ran the tape test when I first installed the server and there was no errors then. Should I run it again? ---------------------------------------------------------------------- kern - 03-02-07 09:57 ---------------------------------------------------------------------- Oops, I missed that the client was 64 bits -- that shouldn't make any difference. However, you have two potential problem areas: 1. If you are running an old Dir/SD and newer clients, we don't support that, so before going any further, you need to get the Dir and SD on the same version, and at least the same version as the most recent client. 2. The fact that you are running on an older IBM machine could be causing some problems -- at a minimum I would recommend running memtest for 1-10 hours. I'm going to close the bug. Once you get everything on the same version (some of the Clients can be older), if you are still having crashes, please re-open it, and provide a new traceback. Every time you change a Bacula Dir/SD version or do a major OS upgrade, you should run btape "test" again to be sure, especially if you start having problems with the SD. ---------------------------------------------------------------------- kern - 03-02-07 09:58 ---------------------------------------------------------------------- Closed until bug reporter gets everything running on the same version. ---------------------------------------------------------------------- jusiponen - 03-28-07 03:12 ---------------------------------------------------------------------- 1. The server computer has been replaced. I had to recycle a SCSI card for the tape drive / changer (Overland PowerLoader) from the old server. Propably not the culprit here (see 3 and 4), but for sake of completeness I felt I should mention this. 2. Bacula server (SD and DIR on the same machine) has been upgraded to 2.0.3. Some of the clients have been upgraded too, but they are still a mix of versions from 1.38.x to 2.0.3. 3. "btape test" passes without issues 4. Nothing suspicious (such as SCSI errors) appears in /var/log/messages And the SD keeps on crashing. ---------------------------------------------------------------------- kern - 03-28-07 04:51 ---------------------------------------------------------------------- The SD is dying in exactly the same place as previously. There is an internal logic error. You said that the server was replaced, but you didn't specify what hardware/OS you are running. Are you running on an i686 or an IBM xSeries. If you are running on an IBM xSeries, I don't really want to go any further with this as it is not an officially supported architecture that I don't have. If you are running a real i686 then please upload your bacula-dir.conf and bacula-sd.conf files (you might want to modify your passwords), and the details of your CPU (i.e. how many CPUs, real SMP?, or multi-core, ...) Please also, upload the *exact* history for say 24 hours of what happened prior to the crash -- i.e. all the bconsole commands you entered, and all the Jobs that ran (preferably including the job report output). ---------------------------------------------------------------------- jusiponen - 03-28-07 05:38 ---------------------------------------------------------------------- CPU info: --------- processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 6 cpu MHz : 930.391 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 mmx fxsr sse bogomips : 1862.34 OS: --- CentOS release 4.4 (Final) IBM xSeries: ------------ You are confusing IBM architectures: xSeries servers are normal Intel systems. Commands entered on bconsole: ----------------------------- 26-MAR-2007: Out of tapes, so the usual "umount Tape, update slots, mount Tape" 27-MAR-2007: No commands entered, AFAIK. 28-MAR-2007: Ran a restore job on previous day. I've uploaded configuration files (client passwords have been deleted, all other passwords replaced with zyxxy) and Bacula log file. ---------------------------------------------------------------------- kern - 03-29-07 03:20 ---------------------------------------------------------------------- Thanks for the clear feedback. OK on the IBM xSeries, I now know what that means. The machine is a bit old, but it should be perfectly OK (I have a very similar Dell). Your comment about doing a restore may have pointed toward a Bacula logic problem or race condition ... Do you think you have always run a restore prior to each crash? This is something that could quite possibly lead to the internal logic error. In particular, did you do a restore while one or more other drives was open for writing? Issue History Date Modified Username Field Change ====================================================================== 03-02-07 03:14 jusiponen New Issue 03-02-07 03:14 jusiponen File Added: sd_traceback-02-MAR-2007.txt 03-02-07 06:06 kern Note Added: 0002277 03-02-07 06:06 kern Status new => feedback 03-02-07 06:48 jusiponen Note Added: 0002278 03-02-07 09:57 kern Note Added: 0002279 03-02-07 09:58 kern Note Added: 0002280 03-02-07 09:58 kern Status feedback => closed 03-02-07 09:58 kern Resolution open => unable to duplicate 03-28-07 03:12 jusiponen Status closed => feedback 03-28-07 03:12 jusiponen Resolution unable to duplicate => reopened 03-28-07 03:12 jusiponen Note Added: 0002329 03-28-07 03:13 jusiponen File Added: sd_traceback-26-MAR-2007.txt 03-28-07 03:13 jusiponen File Added: sd_traceback-27-MAR-2007.txt 03-28-07 03:13 jusiponen File Added: sd_traceback-28-MAR-2007.txt 03-28-07 04:51 kern Note Added: 0002330 03-28-07 05:37 jusiponen File Added: etc_bacula.tgz 03-28-07 05:38 jusiponen Note Added: 0002333 03-29-07 03:20 kern Note Added: 0002338 ====================================================================== |