From: Holger P. <wb...@pa...> - 2011-08-12 15:46:39
|
Hi, Carl Wilhelm Soderstrom wrote on 2011-08-11 13:16:38 -0500 [[BackupPC-users] BackupPC spontaneous server exit with SIGPIPE]: > BackupPC 3.1.0 on Debian, kernel is 2.6.32, filesystem is xfs. > > I have a BackupPC server that has twice now had the BackupPC server process > spontaneously exit. It *did* shut down nicely, interestingly enough. The > server log (/var/lib/backuppc/log/LOG) is this: > [...] > 2011-08-11 09:00:01 Next wakeup is 2011-08-11 10:00:00 > 2011-08-11 09:51:47 Got signal PIPE... cleaning up as you probably know SIGPIPE means that the server is trying to write to a socket that has been closed by the peer. BackupPC doesn't expect that situation and doesn't attempt to handle it, because it only ever replies to messages just received (from so-called "clients"; it doesn't seem to ever write to backup jobs!) and new incoming connections, and the software speaking to the daemon (BackupPC_serverMesg) doesn't close the socket without waiting for a reply. I suppose it would be easy to implement a malicious "client" that causes the server to crash, and apparently some error condition can also lead to this. > It looks like there's no jobs running when the server exits; but obvously > when it gets restarted (by me) it runs a BackupPC_link job on the host it > was working on before it exited... so it's like the host 'hung' on doing > something in the backup, exited (timeout of some sort?), then when restarted > resumes where it left off and finishes all parts of the backup successfully. The strange thing is that it doesn't seem to start BackupPC_link before the crash. The daemon logs "Finished full backup on host1.example.com" when it receives the corresponding message from BackupPC_dump. It closes the socket whenever it detects an EOF on it (which would usually be immediately afterwards, but there is really no way to tell from the log file) and then queues a BackupPC_link if BackupPC_dump previously told it to. So, it would appear that for some reason the socket isn't closed. > This isn't the per-host LOG (/var/lib/backuppc/pc/host1.example.com/LOG) or > XferLOG (/var/lib/backuppc/pc/host1.example.com/LOG), so it's not obvious > how to turn up the debug level or otherwise try to figure out what's going > on here. I would insert debugging output into the daemon's code. You could try to add print(LOG $bpc->timeStamp, qq{About to reply "$reply" to command "$cmd" from $Clients{$client}{clientName}\n}); before line 1515 of BackupPC (syswrite), print(LOG $bpc->timeStamp, "About to send seed to UNIX socket\n"); before line 1557 (also syswrite), and print(LOG $bpc->timeStamp, qq{About to send seed to "$name:$port"\n}); before line 1578 (you'll have guessed, syswrite again). You might have noticed that I left out the syswrite in line 1500. I just don't think it's that one. [Line numbers are for the upstream tarball, I didn't check the Debian package. It's the second through fourth 'syswrite' occurrence.] Don't forget to restart BackupPC after changing the code ;-). > It's happened twice, but I've never seen this host do it before. There's > nothing in dmesg or syslog to indicate a problem. Only thing I can think of > is that BackupPC is running into some sort of filesystem corruption that is > causing it to fail out rather than risk corrupting it further. It seems to be in some way related to a hanging backup job, but I can't see how the job itself could cause it. I don't see any code meant to "crash" BackupPC in the event of filesystem corruption, but then, I'm not looking for it (yet) :-). > Upgrading this host to BackupPC 3.2.1 may be possible (need to get buy-in > from some other admins before doing that); but before I do that > I'm going to fsck the disks. I've never seen BackupPC do this before on any > of the installations I admin and it's done it twice in a few days, so I'm > suspecting hardware problems. I'm suspecting something weird :-). For the moment, I don't see any better option than trying to reproduce it with debugging output in place. Of course, it could also be something sending a gratuitous SIGPIPE to BackupPC. Does anyone know of any circumstances where a SIGPIPE would be sent to a process *group*? Regards, Holger |