From: cppjavaperl <cpp...@ya...> - 2005-03-11 20:23:28
|
--- cppjavaperl <cpp...@ya...> wrote: > > --- Craig Barratt <cba...@us...> > wrote: > > cppjavaperl writes: > > > > > Anyway, my first full backup was *almost* > > successful. > > > I had 144 files that failed on phase 0, with the message: > > > > > > "md4 doesn't match: will retry in phase 1; file removed" > > > > > > OK, I see from earlier messages on this mailing list > > > that Craig (Barratt) said this is OK -- [hello Craig, > > > and thanks for all your efforts on such a great piece > > > of software!], and that the files will be retried in > > > the second phase (phase 1). > > > > > > So things went pretty well, but not perfectly. 138 of > > > the files were successful in phase 1. However, 6 of > > > the files also failed in phase 1, and thus were not > > > backed up. For each one I get the error: > > > > > > "fatal error: md4 doesn't match on retry; file > > > removed" > > > > > > Here are the permissions, sizes, filenames, etc for > > > those 6 files: > > > > > > -rw-r--r-- 1 krol users 205871492 2005-01-15 02:09 > > > initial.repos > > > > > > -rwxrwxrwx 1 krol users 6306304 2001-07-26 09:58 > > > US_co_dp_Lowres.MAP > > > > > > -rwxrwxrwx 1 krol users 82740262 2003-05-05 12:38 > > > MX5_EvalData.Exe > > > > > > -rw-r--r-- 1 root root 39384395 2004-12-22 06:36 > > > kernel-source-2.6.5-7.111.19.i586.rpm > > > > > > -rw-r--r-- 1 root root 40232872 2005-02-17 16:49 > > > kernel-source-2.6.5-7.145.i586.rpm > > > > > > -rw-r--r-- 1 root root 50861983 2004-12-22 06:14 > > > OpenOffice_org-1.1.1-23.4.i586.patch.rpm > > > > > > Note that the smallest file is only a little over 6MB, > > > thus I don't think this is a "large file" problem. > > > > > The initial.repos (the largest file, at 205MB) is a > > > dump of a Subversion repository. The kernel-source* > > > and OpenOffice_org* files are updates downloaded using > > > YOU (Yast Online Update). These are decent sized > > > files, but 50MB or so is not that unusual these days. > > > > > > I haven't tried running an incremental yet (to see if > > > the files will back up on the second try). I think > > > I'll hold off on that for a bit to see if anyone here > > > has any comments, suggestions, etc. > > > > This is not good. You are right - statistically there > > should be almost no chance that a 6MB file would fail > > even on the first pass. > > > > Is this the very first full or is it a second or later > > full? > > > > The only three things I can think of are: > > > > - Some bug in File::RsyncP or BackupPC. Do you> > have > > checksum caching enabled (ie:> > --checksum-seed=32761 > > in $Conf{RsyncArgs})? If so, try turning it off. > > > > - The file is being changed during the transfer; but > > the mtimes don't suggest this. > > > > - Some flakey hardware or memory problem on either the > > client or server. Sounds unlikely, but I'm grasping > > for ideas here. > > > ^^ I won't say that is out of the realm of > possibility. One of the tests I'm going to do is run > memcheck from the SuSE 9.1 boot disk -- just to be > sure > there isn't a problem with the memory. It is an older > box (500 Mhz Pentium II), with two 128MB DIMMs in it. > Perhaps one of the DIMMs gives errors only > occasionally, or only at a very few addresses. (more > on this below) > > > To debug it would be great if you changed > > $Conf{BackupFilesOnly} > > to point just at the 6MB file. Then repeat to > > verify that you > > still get the error. > > ^^ did this and it backed up sucessfully. My next > move was to do reset the config.pl file to my original > settings and do an incremental backup. > > I had forgotten that I saved some large files on the > drive (2GB each) since the 1st backup, and when the > server tried to back these up it got the following > messages: > > Remote[28]: > ^Nλ^H^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@285 Welt > 86 > Don't understand 'A:OM' from child > Don't understand 'A:_LBMF' from child > Don't understand 'T:*/*=0.56/60' from child > Don't understand 'Z:286 Welt 87' from child > Don't understand 'A:KP,SL' from child > Don't understand 'T:*/*=0.575/60' from child > Don't understand 'Z:287 Welt 88' from child > . > . > (the whole rest of the XferLog looks like this) > > I know for a fact that it got quite a ways into the > backup of one of the 2GB files before choking here. > > I had actually seen this behavior before, and it was > one of the reasons re-installed backuppc and the > required perl modules hoping that something had gotten > corrupted. But now it rears its ugly head again. I > may be remembering this wrong, but it seems like this > only happens on or after the second backup. If I > remove the machine from the list of backed up hosts, > restart backuppc, remove the pc/<hostname> directory, > then re-add the machine to the hosts file and restart > backuppc again (and then do a full backup again), I > think I usually get other errors besides the "Don't > understand" ones. I could be wrong on that, though. > It may have been happening then too. The "Don't > understand * from client" error seems like a really > nasty problem (judging by my short look at the code). > Seems like File::RsyncP is getting really lost. > > Another thing to keep in mind is that I'm backing up > about 16GB total. > > I also got these type errors ("Don't understand...") > when using rsync 2.6.3 (latest stable, built from > source on both boxes), and I also tried the latest > prerelease beta, with the same results. > > > Try copying the file to a new > > name (so the > > file is identical but a new name). Does that get > > the error the > > first time? Second time? I'd be curious seeing the > > XferLOG file > > with $Conf{XferLogLevel} set to 9. Email that to me > > offlist. > > > I will try upping the XferLogLevel and doing the > incremental again also. > > > Also, if you run vanilla rsync to pull these files > > via rsyncd > > do you get these same errors? > > > > I will try this also, but I suspect that they will > sync fine (based on what I remember from my earlier > testing). It seems to happen only after the machine > has done a lot of work (usually it is well into the > backup). > > I think doing the memtest is definitely in order. I > tried replacing the NIC, in case that was the problem, > but it still happened (I did that before the initial > backup--after the re-install). I haven't seen any > particular error messages in any of the logs in > /var/log that make me think it is a related hardware > problem. > [snip] OK, this thing just keeps getting weirder and weirder. Let me see if I can lay out all the steps I've gone through. 1) the NIC on the client box died. It just stopped communicating. I thought I had found the solution. Replaced the NIC and prepared to retest. 2) memcheck had shown that one of my memory modules on the server had a single address with a memory error. I ran the test for about an hour, and this same error showed up 6 times. 3) rebooted the box, and used memmap=exactmap memmap=640K@0 memmap=193M@1M memmap=60M@196M to remove the erring memory from use by Linux. 4) re-ran an incremental backup. Things looked good at first, but then I got the "Don't understand" errors again at the end in phase 1 retry. 5) removed the bad memory module from the backup server, just in case the memmap trick doesn't work like I think it should. re-ran memcheck overnight -- the remaining 128MB module tested out fine (no errors). 6) re-booted the backup server, tried to rsync the initial.repos file manually using rsync -avz -- it worked! Small twinge of hope appears. 7) manually rsync one of the 2GB files that failed before -- failed, doh! 8) OK, I tried to scp the 2GB file over, to see if there were any way to make the network copy this file successfully. 'scp' failed with "Disconnecting: Corrupted MAC on input". OK, that sounds like a hardware problem. Replaced the NIC on the backup server *again*. Retried scp: same error: "Disconnecting: Corrupted MAC on input" 9) OK, some of the info I found on this error suggested it might be a problem with ssh. I am running openssh-3.8p1-37.17 (Yast Online Update downloaded package). So I tried to bypass ssh. I have a web server running on the client box, so I put the 2GB file where it could be hit from a URL, and tried to grab it with wget. The wget succeeded, but the md5 sum of the file is different than the original. Retried this, again with a successful wget, but again with a non-matching md5 sum (the second one didn't match the first, either, as I suspected it wouldn't). 10) Seems like a long shot, but the client box(perhaps I should start calling it 'box A', and the backuppc server 'box B.') is also a Samba server, so I run smbmount on box B to mount a folder shared from box A. Then I used plain old 'cp' to copy the 2GB file from the shared folder on box A to the backuppc server (box B). The 'cp' succeeded. So I run an md5sum to check the value (expecting it to be bogus) -- but it wasn't! It was the *correct* md5 sum. So, just to be sure, I copy the file using 'cp' again (saving it as a different file name). Run md5sum on the second copy, and it's the same! I even did a cmp on the two copies, and of course, they showed up identical. So now I'm not sure what to do next. It seems like anything that uses TCP is having problems. So I decided to try copying the file with scp from another Suse 9.1 box (box C). Guess what. 'scp' give the *same* error copying from box A to box C: "Disconnecting: Corrupted MAC on input" Hm. I think, Maybe I've found something. So I smbmount the drive shared from box A to a directory on box C, perform a plain 'cp' of the file and check the md5 sum -- it matches! Next, since I have the correct file on box B due to the SMB-based 'cp' from step (10) above, I decided to 'scp' the file from box B to box C. Guess what... the 'scp' was successful, and the md5 sum is correct. seems that whatever the problem is, it is a problem with box A. And, of course, box A is the machine I am using to write this post. So for now, I will submit the post, and reboot this machine to run memcheck on it. I had done that before, but didn't get all the way through it, since it has 1.5GB of RAM. We'll see if that produces anything. __________________________________ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail |