From: Mantis B. T. <no...@bu...> - 2014-07-10 00:31:32
|
A NOTE has been added to this issue. ====================================================================== http://bugs.bacula.org/view.php?id=2071 ====================================================================== Reported By: jpschewe Assigned To: kern ====================================================================== Project: Bacula Bug Reports Issue ID: 2071 Category: File Daemon Reproducibility: always Severity: crash Priority: normal Status: acknowledged ====================================================================== Date Submitted: 2014-06-15 18:49 BST Last Modified: 2014-07-10 01:31 BST ====================================================================== Summary: Bacula crashed trying to do a restore Description: I'm running version 5.2.6 of bacula on Linux. This is the latest available for Ubuntu. I backed up a windows laptop and then tried to restore to a Linux desktop. The Linux desktop is also running the storage daemon and the director. I tried the restore again and got the same error. This is the end of the restore log. 15-Jun 10:46 jon-fd JobId 13614: drwxrwxrwx 1 root root 0 2014-06-13 13:49:55 /mnt/vhs/jen-restore/C:/Users/Jennifer Schewe/AppData/Local/Mozilla/Firefox/Profiles/2rmh2bnm.default-1400546053153/Cache/1/18/ 15-Jun 10:46 jon-fd JobId 13614: -rwxrwxrwx 1 root root 4456 2014-06-13 13:50:21 ???? /mnt/vhs/jen-restore/C:/Users/Jennifer Schewe/AppData/Local/Mozilla/Firefox/Profiles/2rmh2bnm.default-1400546053153/Cache/1/19/2173115-Jun 10:53 jon-dir JobId 13614: Fatal error: bsock.c:503 Packet size too big from "Client: jon-fd:jon:9102. Terminating connection. 15-Jun 10:53 jon-dir JobId 13614: Fatal error: Socket error on Store end command: ERR=No data available 15-Jun 10:53 jon-dir JobId 13614: Error: Bacula jon-dir 5.2.6 (21Feb12): Build OS: x86_64-pc-linux-gnu ubuntu 14.04 JobId: 13614 Job: RestoreFiles.2014-06-15_10.41.57_04 Restore Client: jon-fd Start time: 15-Jun-2014 10:41:59 End time: 15-Jun-2014 10:53:25 Files Expected: 27,962 Files Restored: 0 Bytes Restored: 0 Rate: 0.0 KB/s FD Errors: 1 FD termination status: SD termination status: OK Termination: *** Restore Error *** 15-Jun 10:53 jon-dir JobId 13614: Error: Bacula jon-dir 5.2.6 (21Feb12): Build OS: x86_64-pc-linux-gnu ubuntu 14.04 JobId: 13614 Job: RestoreFiles.2014-06-15_10.41.57_04 Restore Client: jon-fd Start time: 15-Jun-2014 10:41:59 End time: 15-Jun-2014 10:53:25 Files Expected: 27,962 Files Restored: 0 Bytes Restored: 0 Rate: 0.0 KB/s FD Errors: 2 FD termination status: SD termination status: OK Termination: *** Restore Error *** Here is the gdb traceback email I received. [New LWP 17020] [New LWP 3705] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f3d4eec0c33 in select () at ../sysdeps/unix/syscall-template.S:81 81 ../sysdeps/unix/syscall-template.S: No such file or directory. $1 = 762212202 $2 = 25469064 $3 = 25469128 /etc/bacula/scripts/btraceback.gdb:4: Error in sourced command file: No symbol "catalog_db" in current context. ====================================================================== ---------------------------------------------------------------------- (0006942) kern (administrator) - 2014-07-05 17:01 http://bugs.bacula.org/view.php?id=2071#c6942 ---------------------------------------------------------------------- Clearly Bacula got a fatal error, but I see no evidence of a crash (daemon stops running). The gdb output is not valid. It looks like you are trying to restore Windows data to a Linux machine. I principal this is possible, but from the information you have provided, there isn't much I can do. To get more information, first, proved that Bacula actually crashed by showing that it is running before the failure and not running after the failure. If it crashes we will need a valid gdb output, which if you have configured Bacula correctly will be emailed to you. If it is not a crash then possibly by setting debug level 200 and posting the trace file, we might be able to see something. ---------------------------------------------------------------------- (0006948) jpschewe (reporter) - 2014-07-07 02:19 http://bugs.bacula.org/view.php?id=2071#c6948 ---------------------------------------------------------------------- I'm not sure why the gdb output isn't valid. That's what was emailed to me when bacula-fd crashed. I just uplaoded a file named "typescript". This file is a trace of bacula-fd run under gdb with -f -d 200. I snipped the middle out of the file so that it's small enough for upload. The restore failure from the director still gives the same packet too large error. I saw the error message in gdb and asked it for a stack trace. ---------------------------------------------------------------------- (0006949) kern (administrator) - 2014-07-07 09:02 http://bugs.bacula.org/view.php?id=2071#c6949 ---------------------------------------------------------------------- I've deleted your upload since it did not contain useful data. When you run Bacula under a debugger you must include the -s option on the command line otherwise the debugger will stop on every signal. The -s option causes Bacula to ignore non-fatal signals. Are you running the same version of Bacula for the Dir, SD, and FD? ---------------------------------------------------------------------- (0006950) jpschewe (reporter) - 2014-07-07 12:56 http://bugs.bacula.org/view.php?id=2071#c6950 ---------------------------------------------------------------------- I've attached 2 files. One is running under gdb and one is not. The one running under gdb doesn't get the segfault, but I do see a thread exiting when the restore failed. The one not running under gdb clearly gets a seg fault. When not running under gdb I get this email: [New LWP 15657] [New LWP 15288] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x00007f67c8654c33 in select () at ../sysdeps/unix/syscall-template.S:81 81 ../sysdeps/unix/syscall-template.S: No such file or directory. $1 = 762212202 $2 = 19411032 $3 = 19411096 /etc/bacula/scripts/btraceback.gdb:4: Error in sourced command file: No symbol "catalog_db" in current context. In both cases I get an email from the director that the restore failed with packet size too big. 07-Jul 06:16 jon-fd JobId 13732: -rwxrwxrwx 1 root root 4396 2014-07-01 10:23:05 /mnt/external/jpschewe/test-restore/C:/Users/Jennifer Schewe/AppData/Local/Mozilla/Firefox/Profiles/hjfcjzwn.default/Cache/6/08/1E6B4m01 07-Jul 06:16 jon-fd JobId 13732: -rwxrwxrwx 1 root root 5161 2014-06-30 16:09:44 /mnt/external/jpschewe/test-restore/ÿÿÿúC:/Users/Jennifer Schewe/AppData/Local/Mozilla/Firefox/Profiles/hjfcjzwn.default/Cache/6/08/2110E07-Jul 06:18 jon-dir JobId 13732: Fatal error: bsock.c:503 Packet size too big from "Client: jon-fd:jon:9102. Terminating connection. 07-Jul 06:18 jon-dir JobId 13732: Fatal error: Socket error on Store end command: ERR=No data available 07-Jul 06:18 jon-dir JobId 13732: Error: Bacula jon-dir 5.2.6 (21Feb12): Build OS: x86_64-pc-linux-gnu ubuntu 14.04 JobId: 13732 Job: RestoreFiles.2014-07-07_06.15.33_09 Restore Client: jon-fd Start time: 07-Jul-2014 06:15:35 End time: 07-Jul-2014 06:18:41 Files Expected: 36,914 Files Restored: 0 Bytes Restored: 0 Rate: 0.0 KB/s FD Errors: 1 FD termination status: SD termination status: Running Termination: *** Restore Error *** 07-Jul 06:18 jon-dir JobId 13732: Error: Bacula jon-dir 5.2.6 (21Feb12): Build OS: x86_64-pc-linux-gnu ubuntu 14.04 JobId: 13732 Job: RestoreFiles.2014-07-07_06.15.33_09 Restore Client: jon-fd Start time: 07-Jul-2014 06:15:35 End time: 07-Jul-2014 06:18:41 Files Expected: 36,914 Files Restored: 0 Bytes Restored: 0 Rate: 0.0 KB/s FD Errors: 2 FD termination status: SD termination status: Running Termination: *** Restore Error *** ---------------------------------------------------------------------- (0006951) jpschewe (reporter) - 2014-07-07 12:58 http://bugs.bacula.org/view.php?id=2071#c6951 ---------------------------------------------------------------------- And yes I'm running the same version of bacula for the storage, director and file deamons when I'm restoring. They are all at vesion 5.2.6 and on the same host. The windows client used to create the backups that I'm trying to restore is 5.2.10. ---------------------------------------------------------------------- (0006952) jpschewe (reporter) - 2014-07-07 15:33 http://bugs.bacula.org/view.php?id=2071#c6952 ---------------------------------------------------------------------- I just added gdb-ignore-sigusr.trace.snipped. It seems that the -s option to bacula-sd ignores all signals including SIGSEGV. This means that gdb doesn't get a chance to be notified of the segmentation violation. So I ran bacula-sd without -s in gdb and then told gdb to ignore SIGUSR1 and SIGUSR2. This provided me with a clean trace showing that there is an error in cleaning up memory. I'm not sure how this relates to the server message of packet too large, but hopefully provides you enough information to tell me where I might debug next. ---------------------------------------------------------------------- (0006953) kern (administrator) - 2014-07-07 17:56 http://bugs.bacula.org/view.php?id=2071#c6953 ---------------------------------------------------------------------- OK, nice job of getting gdb to work. As far as I can see, somehow Bacula's memory has become corrupted. This probably also caused the previous error that you saw concerning the packet size. The problem is that the detection of the problem is a long time after the damage is done. It is essentially impossible to track down and fix if I cannot reproduce it, because tracking it down involves enabling memory checking code in lots of places then once it triggers (currently only at the end of the job), you add more and more until you isolate the problem. This can take several days. So, to resolve it, I will need to have the bsr file that is generated for the restore, and the volume which contains the data. Since the Volume is probably very large (judging from the size of the restore) this can be a problem. Of course, you also may not appreciate sending me your Volume data. Another option would be to try to reduce it to the minimal restore size (perhaps one file) that causes the problem. A final option is to see if the restore works without restoring the Windows data or to restore it on a Windows machine. Restoring Windows data on a Linux machine uses code that picks through Windows meta data to find the real data and restore it. This code is very likely to be the source of the bug. But tracking down the exact place can really be problematic. ---------------------------------------------------------------------- (0006954) jpschewe (reporter) - 2014-07-07 18:40 http://bugs.bacula.org/view.php?id=2071#c6954 ---------------------------------------------------------------------- I can restore Linux data to a Linux computer. I've also restored a smaller portion of this same data to a LInux computer. So I suspect it's some particular file that's actually the issue. I'll see if I can narrow this down and post more details. ---------------------------------------------------------------------- (0006955) jpschewe (reporter) - 2014-07-08 12:50 http://bugs.bacula.org/view.php?id=2071#c6955 ---------------------------------------------------------------------- OK, I've managed to make it crash when only restoring 1 file in 1 directory and I'm willing to share that data. So I would like to make sure I have a reasonable test plan here before I put this together. Here's what I'm thinking, please correct any issues you see. Create a disk file pool called "test" Create a job that only backs up this bad directory on Windows using the pool "test" Run the job Test restoring to my Linux client and verify that it fails Gather up the bsr and the disk volume and transfer to you Does this sound reasonable? ---------------------------------------------------------------------- (0006956) kern (administrator) - 2014-07-08 12:58 http://bugs.bacula.org/view.php?id=2071#c6956 ---------------------------------------------------------------------- Yes, that sounds like a good way to be able to resolve the problem, and I am interested in fixing it. Thanks for taking all the trouble to simplify the failure case. Please do send me the bsr and the disk volume. If it is small enough, you may post it to this bug report. Otherwise, please feel free to send it to my personal email address: kern at sibbald dot com. That address should allow receiving a rather large email. Another option is that you can post it on some site where I can download it. ---------------------------------------------------------------------- (0006957) jpschewe (reporter) - 2014-07-10 01:31 http://bugs.bacula.org/view.php?id=2071#c6957 ---------------------------------------------------------------------- I just uploaded a tar file with the bsr file and the disk volume containing the backup. When I run a restore with this I get the files created and the director reports that the restore succeeded, but the file daemon crashed. Issue History Date Modified Username Field Change ====================================================================== 2014-06-15 18:49 jpschewe New Issue 2014-07-05 17:01 kern Note Added: 0006942 2014-07-05 17:01 kern Status new => feedback 2014-07-07 02:16 jpschewe File Added: typescript 2014-07-07 02:19 jpschewe Note Added: 0006948 2014-07-07 02:19 jpschewe Status feedback => new 2014-07-07 09:00 kern File Deleted: typescript 2014-07-07 09:02 kern Note Added: 0006949 2014-07-07 09:02 kern Status new => feedback 2014-07-07 12:52 jpschewe File Added: no-gdb.trace.snipped 2014-07-07 12:52 jpschewe File Added: gdb.trace.snipped 2014-07-07 12:56 jpschewe Note Added: 0006950 2014-07-07 12:56 jpschewe Status feedback => new 2014-07-07 12:58 jpschewe Note Added: 0006951 2014-07-07 15:30 jpschewe File Added: gdb-ignore-sigusr.trace.snipped 2014-07-07 15:33 jpschewe Note Added: 0006952 2014-07-07 17:56 kern Note Added: 0006953 2014-07-07 17:56 kern Status new => feedback 2014-07-07 18:40 jpschewe Note Added: 0006954 2014-07-07 18:40 jpschewe Status feedback => new 2014-07-08 12:50 jpschewe Note Added: 0006955 2014-07-08 12:58 kern Note Added: 0006956 2014-07-08 12:59 kern Assigned To => kern 2014-07-08 12:59 kern Status new => acknowledged 2014-07-10 01:30 jpschewe File Added: bacula-crash.tar.gz 2014-07-10 01:31 jpschewe Note Added: 0006957 ====================================================================== |