From: John R. <rou...@re...> - 2007-10-26 17:13:36
|
This is a resend. The original went missing apparently. On Thu, Oct 25, 2007 at 02:10:12PM -0500, Les Mikesell wrote: > John Rouillard wrote: > > 2007-10-25 10:54:00 Aborting backup up after signal PIPE > > 2007-10-25 10:54:01 Got fatal error during xfer (aborted by signal=PIPE) > > This means the problem is on the other end of the link - or at least > that the ssh driving it exited. Hmm, ok, what happens if I add -v to the remote rsync args? Will the extra output in the rsync stream screw things up? Maybe I can use: rsync .... -v ... 2> /tmp/rsync.log to get debugging at the rsync level without sending the debugging output to BackupPC. I'll also try adding -o ServerAliveInterval=30 and -vvv to see if that improves the reliability of the ssh session and generates output, since -v sends debugging output to stderr and I can grab that with: ssh -v concord 2> /tmp/log Does BackupPC need to use stderr to the remote system for anything? > > lastlog got digests fdb1c560d9ba822ab4ffa635d4b5f67f vs > > fdb1c560d9ba822ab4ffa635d4b5f67f > > create 400 0/0 65700 lastlog > > Can't write 33932 bytes to socket > > Sending csums, cnt = 16, phase = 1 > > Read EOF: Connection reset by peer > > The process on the remote side is gone at this point. I'll buy that, but I expect some death message. A dying gasp if you will. > >If I am reading this right, the last file handled before the signal is > >/var/log/lastlog which is << 2GB (65K approx). When the signal occurs, > >I guess /var/log/ldap is the file in progress. > > > >The ldap file is 22GB in size: > > > > [rouilj@ops02 log]$ ls -l ldap > > -rw------- 1 root root 22978928497 Oct 25 18:46 ldap > > > >Could the size be the issue? > > Yes, it sounds very likely that whatever is sending the file on the > remote side can't handle files larger than 2 gigs. I just did an "sudo rsync -e ssh ops02.mht1:/var/log/ldap ." and it completed without a problem. All 22 GB of the file transfered fine 8-(. However now I have the same sigpipe issue on another host, that has been backing up fine (3 full and 3 incremental) until now: incr backup started back to 2007-10-25 17:28:40 (backup #6) for directory /var/spool/nagios Running: /usr/bin/ssh -q -x -l backup ops01.mht1.renesys.com sudo /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive --one-file-system --checksum-seed=32761 . /var/spool/nagios/ Xfer PIDs are now 24197 Rsync command pid is 24197 Got remote protocol 28 Negotiated protocol version 28 Checksum caching enabled (checksumSeed = 32761) Got checksumSeed 0x7ff9 Got file list: 11 entries Child PID is 24213 Xfer PIDs are now 24197,24213 Sending csums, cnt = 11, phase = 0 create d2775 306/200 4096 . create p 660 306/521 0 nagios.cmd create d2775 306/200 4096 tmp tmp/host-perfdata got digests 46a0099d178d1b97aa39e454ae083d3f vs 46a0099d178d1b97aa39e454ae083d3f Skipping tmp/service-perfdata.0000.bz2 (same attr) Skipping tmp/service-perfdata.0001.gz (same attr) Skipping tmp/service-perfdata.4.gz (same attr) Skipping tmp/service-perfdata.5.gz (same attr) Sending csums, cnt = 0, phase = 1 create 664 306/200 916165956 tmp/host-perfdata tmp/nagios_daemon_pids got digests 7bfc0cffe0f114dd6eea7514c44422cd vs 7bfc0cffe0f114dd6eea7514c44422cd create 664 306/200 6 tmp/nagios_daemon_pids tmp/old_list got digests 0e258a7527fe053eea032e6d58f1de7c vs 0e258a7527fe053eea032e6d58f1de7c create 664 306/200 48 tmp/old_list Read EOF: Tried again: got 0 bytes Can't write 4 bytes to socket finish: removing in-process file tmp/service-perfdata delete 644 306/200 343155581 tmp/service-perfdata.0001.gz delete 664 306/200 343250131 tmp/service-perfdata.5.gz delete 644 306/200 186949772 tmp/service-perfdata.0000.bz2 delete 664 306/200 341890997 tmp/service-perfdata.4.gz delete 644 306/200 1427879157 tmp/service-perfdata Child is aborting Done: 4 files, 916199608 bytes Got fatal error during xfer (aborted by signal=PIPE) Backup aborted by user signal Is there anything I can do to get better diagnostics. If rsync --server --sender exits with an error, how well does the File::RsyncP module do grabbing stderr (or stdout which it would see as a breaking of the protocol) and sending it back to the xfer log? Is there a flag/option I can set in File::RsyncP? (Time to perldoc File::RsyncP I guess.) > >Also is there a way to tail the xfer logs in realtime while the daemon > >is controling the backup? So I don't have to wait for the backup to > >finish? > > You aren't going to see a problem in the log file - the other end is > crashing. Well I have two backups still running (3+ hours later) and I am trying to find out what file they are stuck on. Nothing that I can see should be hanging the rsync this long compared to when I run an rsync directly. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-643-9300 x 111 |