Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

This is a resend. The original went missing apparently.

On Thu, Oct 25, 2007 at 02:10:12PM -0500, Les Mikesell wrote:
> John Rouillard wrote:
> >  2007-10-25 10:54:00 Aborting backup up after signal PIPE
> >  2007-10-25 10:54:01 Got fatal error during xfer (aborted by signal=PIPE)
> 
> This means the problem is on the other end of the link - or at least 
> that the ssh driving it exited.

Hmm, ok, what happens if I add -v to the remote rsync args? Will the
extra output in the rsync stream screw things up? Maybe I can use:

  rsync .... -v ... 2> /tmp/rsync.log

to get debugging at the rsync level without sending the debugging
output to BackupPC.

I'll also try adding -o ServerAliveInterval=30 and -vvv to see if that
improves the reliability of the ssh session and generates output,
since -v sends debugging output to stderr and I can grab that with:

 ssh -v concord 2> /tmp/log

Does BackupPC need to use stderr to the remote system for anything?

> >  lastlog got digests fdb1c560d9ba822ab4ffa635d4b5f67f vs
> >  fdb1c560d9ba822ab4ffa635d4b5f67f
> >    create   400       0/0       65700 lastlog
> >  Can't write 33932 bytes to socket
> >  Sending csums, cnt = 16, phase = 1
> >  Read EOF: Connection reset by peer
> 
> The process on the remote side is gone at this point.

I'll buy that, but I expect some death message. A dying gasp if you
will.

> >If I am reading this right, the last file handled before the signal is
> >/var/log/lastlog which is << 2GB (65K approx). When the signal occurs,
> >I guess /var/log/ldap is the file in progress.
> >
> >The ldap file is 22GB in size:
> >
> >  [rouilj@ops02 log]$ ls -l ldap 
> >  -rw-------  1 root root 22978928497 Oct 25 18:46 ldap
> >
> >Could the size be the issue?
> 
> Yes, it sounds very likely that whatever is sending the file on the 
> remote side can't handle files larger than 2 gigs.

I just did an "sudo rsync -e ssh ops02.mht1:/var/log/ldap ." and it
completed without a problem. All 22 GB of the file transfered fine
8-(. However now I have the same sigpipe issue on another host, that
has been backing up fine (3 full and 3 incremental) until now:

  incr backup started back to 2007-10-25 17:28:40 (backup #6) for
  directory /var/spool/nagios
  Running: /usr/bin/ssh -q -x -l backup ops01.mht1.renesys.com sudo
  /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group
  -D --links --hard-links --times --block-size=2048 --recursive
  --one-file-system --checksum-seed=32761 . /var/spool/nagios/
  Xfer PIDs are now 24197
  Rsync command pid is 24197
  Got remote protocol 28
  Negotiated protocol version 28
  Checksum caching enabled (checksumSeed = 32761)
  Got checksumSeed 0x7ff9
  Got file list: 11 entries
  Child PID is 24213
  Xfer PIDs are now 24197,24213
  Sending csums, cnt = 11, phase = 0
    create d2775   306/200        4096 .
    create p 660   306/521           0 nagios.cmd
    create d2775   306/200        4096 tmp
  tmp/host-perfdata got digests 46a0099d178d1b97aa39e454ae083d3f vs
  46a0099d178d1b97aa39e454ae083d3f
  Skipping tmp/service-perfdata.0000.bz2 (same attr)
  Skipping tmp/service-perfdata.0001.gz (same attr)
  Skipping tmp/service-perfdata.4.gz (same attr)
  Skipping tmp/service-perfdata.5.gz (same attr)
  Sending csums, cnt = 0, phase = 1
    create   664   306/200   916165956 tmp/host-perfdata
  tmp/nagios_daemon_pids got digests 7bfc0cffe0f114dd6eea7514c44422cd vs
  7bfc0cffe0f114dd6eea7514c44422cd
    create   664   306/200           6 tmp/nagios_daemon_pids
  tmp/old_list got digests 0e258a7527fe053eea032e6d58f1de7c vs
  0e258a7527fe053eea032e6d58f1de7c
    create   664   306/200          48 tmp/old_list
  Read EOF: 
  Tried again: got 0 bytes
  Can't write 4 bytes to socket
  finish: removing in-process file tmp/service-perfdata
    delete   644   306/200   343155581 tmp/service-perfdata.0001.gz
    delete   664   306/200   343250131 tmp/service-perfdata.5.gz
    delete   644   306/200   186949772 tmp/service-perfdata.0000.bz2
    delete   664   306/200   341890997 tmp/service-perfdata.4.gz
    delete   644   306/200  1427879157 tmp/service-perfdata
  Child is aborting
  Done: 4 files, 916199608 bytes
  Got fatal error during xfer (aborted by signal=PIPE)
  Backup aborted by user signal

Is there anything I can do to get better diagnostics. If rsync
--server --sender exits with an error, how well does the File::RsyncP
module do grabbing stderr (or stdout which it would see as a breaking
of the protocol) and sending it back to the xfer log?

Is there a flag/option I can set in File::RsyncP?

(Time to perldoc File::RsyncP I guess.)

> >Also is there a way to tail the xfer logs in realtime while the daemon
> >is controling the backup? So I don't have to wait for the backup to
> >finish?
> 
> You aren't going to see a problem in the log file - the other end is 
> crashing.

Well I have two backups still running (3+ hours later) and I am trying
to find out what file they are stuck on. Nothing that I can see should
be hanging the rsync this long compared to when I run an rsync
directly.

-- 
				-- rouilj

John Rouillard
System Administrator
Renesys Corporation
603-643-9300 x 111