From: <bac...@li...> - 2008-09-24 15:20:41
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=1162 ====================================================================== Reported By: csierra Assigned To: ====================================================================== Project: bacula Issue ID: 1162 Category: Director Reproducibility: random Severity: crash Priority: normal Status: feedback ====================================================================== Date Submitted: 09-24-2008 12:09 BST Last Modified: 09-24-2008 16:20 BST ====================================================================== Summary: Director crashes with double free from jcr.c:343 Description: We are running a test configuration with around 70 machines doing concurrent jobs in different storages. The number of concurrent jobs is 20. Backups are being made to files using Max Volume Jobs = 1. Volumes are being named using variable resolution. Label Format = "${JobName}-${Level}" Everything works fine during a week or two but, sometimes, the director crashes with the following message: 24-sep 03:54 myhost-dir: ABORTING due to ERROR in smartall.c:194 double free from jcr.c:343 The system is compiled with the following options: ./configure --with-mysql --enable-static-fd --with-readline --with-working-dir=/var/lib/bacula/ Debug symbols were included to trace the problem down. We have two jobs for each client, one with a fixed fileset and the other generated with a script on the client. We have registered a RunScript for failed backups wich deletes the volume of the failed job so it is not used by other job mismatching the job name with the volume name. This script runs the following: ----- #!/bin/bash # Try to delete the created volume in case one has been created. This way we prevent the next backup to use the empty volume left by this faulty job. sudo bconsole -c /etc/bacula/bconsole.conf << EOF delete Volume=${5}-${6} yes EOF ----- This script seems to work well. I just include it here in the case it sheds some light to the problem. Libraries linked by the binary: ldd /sbin/bacula-dir linux-gate.so.1 => (0xb7f5e000) libmysqlclient_r.so.15 => /usr/lib/libmysqlclient_r.so.15 (0xb7d73000) libz.so.1 => /usr/lib/libz.so.1 (0xb7d5f000) libpthread.so.0 => /lib/tls/i686/cmov/libpthread.so.0 (0xb7d47000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0xb7c5d000) libm.so.6 => /lib/tls/i686/cmov/libm.so.6 (0xb7c36000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb7c2a000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7ae9000) libcrypt.so.1 => /lib/tls/i686/cmov/libcrypt.so.1 (0xb7abb000) libnsl.so.1 => /lib/tls/i686/cmov/libnsl.so.1 (0xb7aa3000) /lib/ld-linux.so.2 (0xb7f5f000) ====================================================================== ---------------------------------------------------------------------- kern - 09-24-08 13:38 ---------------------------------------------------------------------- Hmmm. It seems that despite the fact that we have a specific lock for the last jobs list, it is not consistently used, and so you are getting hit by a race condition. I believe I have now fixed it, and will attach a patch once the tests run successfully. ---------------------------------------------------------------------- kern - 09-24-08 16:12 ---------------------------------------------------------------------- Please try the patch that is attached to this bug report. I believe that it will fix your problem. Feedback would be appreciated (yes, I know it doesn't happen very often ...). When I close this bug and if later you find the problem occurs again, please don't hesitate to re-open it. ---------------------------------------------------------------------- csierra - 09-24-08 16:20 ---------------------------------------------------------------------- Thank you very much... You are fast I must say!! :-) We have scheduled the backups everyday and it used to happen on a per week basis. So we will have to wait... In this case no news would be good news... Anyway... after a prudential time I will come back to you on this with the result. Thanks again. Issue History Date Modified Username Field Change ====================================================================== 09-24-08 12:09 csierra New Issue 09-24-08 12:10 csierra File Added: bacula_report.log 09-24-08 13:38 kern Note Added: 0003655 09-24-08 13:38 kern Status new => acknowledged 09-24-08 16:10 kern File Added: 2.4.2-jobend-crash.patch 09-24-08 16:12 kern Note Added: 0003658 09-24-08 16:12 kern Status acknowledged => feedback 09-24-08 16:12 kern Fixed in Version => 2.4.3 09-24-08 16:20 csierra Note Added: 0003659 ====================================================================== |