Re: [BackupPC-users] fatal error: md4 doesn't match on retry; file removed

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

--- cppjavaperl <cpp...@ya...> wrote:

> 
> --- Craig Barratt <cba...@us...>
> wrote:
> > cppjavaperl writes:
> > 
> > > Anyway, my first full backup was *almost*
> > successful. 
> > > I had 144 files that failed on phase 0, with the message:
> > > 
> > > "md4 doesn't match: will retry in phase 1; file removed"
> > > 
> > > OK, I see from earlier messages on this mailing list
> > > that Craig (Barratt) said this is OK -- [hello Craig,
> > > and thanks for all your efforts on such a great piece
> > > of software!], and that the files will be retried in
> > > the second phase (phase 1).
> > > 
> > > So things went pretty well, but not perfectly. 138 of
> > > the files were successful in phase 1.  However, 6 of
> > > the files also failed in phase 1, and thus were not
> > > backed up.  For each one I get the error:
> > > 
> > > "fatal error: md4 doesn't match on retry; file
> > > removed"
> > > 
> > > Here are the permissions, sizes, filenames, etc for
> > > those 6 files:
> > > 
> > > -rw-r--r--  1 krol users 205871492 2005-01-15 02:09
> > >         initial.repos
> > >
> > > -rwxrwxrwx  1 krol users   6306304 2001-07-26 09:58 
> > >         US_co_dp_Lowres.MAP
> > >
> > > -rwxrwxrwx  1 krol users  82740262 2003-05-05 12:38 
> > >         MX5_EvalData.Exe
> > >
> > > -rw-r--r--  1 root root   39384395 2004-12-22 06:36 
> > >         kernel-source-2.6.5-7.111.19.i586.rpm
> > >
> > > -rw-r--r--  1 root root   40232872 2005-02-17 16:49 
> > >         kernel-source-2.6.5-7.145.i586.rpm
> > >
> > > -rw-r--r--  1 root root   50861983 2004-12-22 06:14 
> > >         OpenOffice_org-1.1.1-23.4.i586.patch.rpm
> > > 
> > > Note that the smallest file is only a little over 6MB,
> > > thus I don't think this is a "large file" problem.
> > 
> > > The initial.repos (the largest file, at 205MB) is a
> > > dump of a Subversion repository.  The kernel-source*
> > > and OpenOffice_org* files are updates downloaded using
> > > YOU (Yast Online Update).  These are decent sized
> > > files, but 50MB or so is not that unusual these days.
> > > 
> > > I haven't tried running an incremental yet (to see if
> > > the files will back up on the second try).  I think
> > > I'll hold off on that for a bit to see if anyone here
> > > has any comments, suggestions, etc.
> > 
> > This is not good.  You are right - statistically there
> > should be almost no chance that a 6MB file would fail
> > even on the first pass.
> > 
> > Is this the very first full or is it a second or later
> > full?
> > 
> > The only three things I can think of are:
> > 
> >  - Some bug in File::RsyncP or BackupPC.  Do you> > have
> >    checksum caching enabled (ie:> > --checksum-seed=32761
> >    in $Conf{RsyncArgs})?  If so, try turning it off.
> > 
> >  - The file is being changed during the transfer; but
> >    the mtimes don't suggest this.
> > 
> >  - Some flakey hardware or memory problem on either the
> >    client or server.  Sounds unlikely, but I'm grasping
> >    for ideas here.
> > 
>   ^^ I won't say that is out of the realm of
> possibility.  One of the tests I'm going to do is run
> memcheck from the SuSE 9.1 boot disk -- just to be
> sure 
> there isn't a problem with the memory.  It is an older
> box (500 Mhz Pentium II), with two 128MB DIMMs in it. 
> Perhaps one of the DIMMs gives errors only
> occasionally, or only at a very few addresses.  (more
> on this below)
> 
> > To debug it would be great if you changed
> > $Conf{BackupFilesOnly}
> > to point just at the 6MB file.  Then repeat to
> > verify that you
> > still get the error.  
> 
>   ^^ did this and it backed up sucessfully.  My next
> move was to do reset the config.pl file to my original
> settings and do an incremental backup.
> 
>   I had forgotten that I saved some large files on the
> drive (2GB each) since the 1st backup, and when the
> server tried to back these up it got the following
> messages:
> 
> Remote[28]:
> ^NÎ»^H^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@285 Welt
> 86
> Don't understand 'A:OM' from child
> Don't understand 'A:_LBMF' from child
> Don't understand 'T:*/*=0.56/60' from child
> Don't understand 'Z:286 Welt 87' from child
> Don't understand 'A:KP,SL' from child
> Don't understand 'T:*/*=0.575/60' from child
> Don't understand 'Z:287 Welt 88' from child
> .
> .
> (the whole rest of the XferLog looks like this)
> 
> I know for a fact that it got quite a ways into the
> backup of one of the 2GB files before choking here.  
> 
> I had actually seen this behavior before, and it was
> one of the reasons re-installed backuppc and the
> required perl modules hoping that something had gotten
> corrupted.  But now it rears its ugly head again.  I
> may be remembering this wrong, but it seems like this
> only happens on or after the second backup.  If I
> remove the machine from the list of backed up hosts,
> restart backuppc, remove the pc/<hostname> directory,
> then re-add the machine to the hosts file and restart
> backuppc again (and then do a full backup again), I
> think I usually get other errors besides the "Don't
> understand" ones.  I could be wrong on that, though. 
> It may have been happening then too.  The "Don't
> understand * from client" error seems like a really
> nasty problem (judging by my short look at the code). 
> Seems like File::RsyncP is getting really lost.
> 
> Another thing to keep in mind is that I'm backing up
> about 16GB total.
> 
>   I also got these type errors ("Don't understand...")
> when using rsync 2.6.3 (latest stable, built from
> source on both boxes), and I also tried the latest
> prerelease beta, with the same results.
> 
> 
> Try copying the file to a new
> > name (so the
> > file is identical but a new name).  Does that get
> > the error the
> > first time?  Second time?  I'd be curious seeing the
> > XferLOG file
> > with $Conf{XferLogLevel} set to 9.  Email that to me
> > offlist.
> > 
>   I will try upping the XferLogLevel and doing the
> incremental again also.
> 
> > Also, if you run vanilla rsync to pull these files
> > via rsyncd
> > do you get these same errors?
> > 
> 
>   I will try this also, but I suspect that they will
> sync fine (based on what I remember from my earlier
> testing).  It seems to happen only after the machine
> has done a lot of work (usually it is well into the
> backup).
> 
> I think doing the memtest is definitely in order.  I
> tried replacing the NIC, in case that was the problem,
> but it still happened (I did that before the initial
> backup--after the re-install).  I haven't seen any
> particular error messages in any of the logs in
> /var/log that make me think it is a related hardware
> problem.

> [snip]

OK, this thing just keeps getting weirder and weirder.  Let me see if I
can lay out all the steps I've gone through.

1) the NIC on the client box died.  It just stopped communicating.  I
thought I had found the solution.  Replaced the NIC and prepared to
retest.

2) memcheck had shown that one of my memory modules on the server had a
single address with a memory error.  I ran the test for about an hour,
and this same error showed up 6 times.

3) rebooted the box, and used memmap=exactmap memmap=640K@0
memmap=193M@1M memmap=60M@196M to remove the erring memory from use by
Linux.

4) re-ran an incremental backup.  Things looked good at first, but then
I got the "Don't understand" errors again at the end in phase 1 retry.

5) removed the bad memory module from the backup server, just in case
the memmap trick doesn't work like I think it should.  re-ran memcheck
overnight -- the remaining 128MB module tested out fine (no errors).

6) re-booted the backup server, tried to rsync the initial.repos file
manually using rsync -avz -- it worked!  Small twinge of hope appears.

7) manually rsync one of the 2GB files that failed before -- failed,
  doh!

8) OK, I tried to scp the 2GB file over, to see if there were any way
to make the network copy this file successfully.  'scp' failed with
"Disconnecting:  Corrupted MAC on input".  OK, that sounds like a
hardware problem.  Replaced the NIC on the backup server *again*.
Retried scp:  same error:  "Disconnecting:  Corrupted MAC on input"

9) OK, some of the info I found on this error suggested it might be a
problem with ssh.  I am running openssh-3.8p1-37.17 (Yast Online Update

downloaded package).  So I tried to bypass ssh.  I have a web server
running on the client box, so I put the 2GB file where it could be hit
from a URL, and tried to grab it with wget.  The wget succeeded, but
the
md5 sum of the file is different than the original.  Retried this,
again
with a successful wget, but again with a non-matching md5 sum (the 
second one didn't match the first, either, as I suspected it 
wouldn't).

10) Seems like a long shot, but the client box(perhaps I should start
  calling it 'box A', and the backuppc server 'box B.') is also a Samba
server, so I run smbmount on box B to mount a folder shared from box A.
Then I used plain old 'cp' to copy the 2GB file from the shared folder
on box A to the backuppc server (box B).  The 'cp' succeeded.  So I run
an md5sum to check the value (expecting it to be bogus) -- but it
wasn't!  It was the *correct* md5 sum.  So, just to be sure, I copy
the file using 'cp' again (saving it as a different file name).  Run
md5sum on the second copy, and it's the same!  I even did a cmp on the
two copies, and of course, they showed up identical.

So now I'm not sure what to do next.  It seems like anything that uses
TCP is having problems.  So I decided to try copying the file with scp
from another Suse 9.1 box (box C).  Guess what.  'scp' give the *same*
error copying from box A to box C:  

  "Disconnecting:  Corrupted MAC on input"

Hm. I think, Maybe I've found something.  So I smbmount the drive
shared
from box A to a directory on box C, perform a plain 'cp' of the file
and
check the md5 sum -- it matches!

Next, since I have the correct file on box B due to the SMB-based 'cp'
from step (10) above, I decided to 'scp' the file from box B to box C.
Guess what... the 'scp' was successful, and the md5 sum is correct.
seems that whatever the problem is, it is a problem with box A.  And,
of
course, box A is the machine I am using to write this post.  So for
now, 
I will submit the post, and reboot this machine to run memcheck on it.
I had done that before, but didn't get all the way through it, since it
has 1.5GB of RAM.  We'll see if that produces anything.

__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - You care about security. So do we. 
http://promotions.yahoo.com/new_mail