From: Jeffrey J. K. <bac...@ko...> - 2011-01-26 03:40:48
|
Jeffrey J. Kosowsky wrote at about 20:07:37 -0500 on Sunday, January 23, 2011: > Jeffrey J. Kosowsky wrote at about 19:19:54 -0500 on Sunday, January 23, 2011: > > I was testing some of my md5sum routines and I kept getting weird > > results on ARM-based computers. > > > > Specifically, the pool file md5sum numbers were different depending on > > whether I computed them under Fedora 12 on an x86 machine vs under > > Debian Lenny on an ARM-based computer. > > > > This obviously creates issues if you want to move your backup drive > > between different CPUs. > > > > I narrowed it down to Digest::MD5, by doing the following 1-liner: > > perl -e 'use Digest::MD5 qw(md5_hex);$file=testfile; $size=(stat($file))[7];$body=`cat $file`; print md5_hex($size,$body) . "\n";' > > > > This should be the same as: > > perl -e '$file=testfile; $size=(stat($file))[7];$body=`cat $file`; print $size, $body;' | md5sum > > > > For maybe 1% of files in my pool the ARM machine gave the wrong answer > > when using Digest::MD5 > > > > So, something must be wacko in the perl implementation of Digest::MD5 > > on ARM machines! > > > > Well, what do you know, Perl 5.10.0 (at least in Debian but I think > upstream too) are broken on ARM processors. > > Something about 32-bit alignment. > You need to upgrade to 5.10.1 -- and now I wasted a day on this... > And now I need to write code to fix my pool - YUCK! > Well, I went through my pool carefully and it seems like the error effects close to HALF of my pool files. This is a real mess and PITA. BUT, I wrote a perl routine that goes through the pool and/or cpool and corrects all the entries. Specifically, it 1. Goes through the pool and calculates the actual MD5sum path for the file (using my zFile2MD5 routine if it is in the cpool which avoids decompressing the entire file). 2. If the calculated partial file MD5sum differs from the current filename, then the routine finds the first empty spot in the chain of the corrected MD5sum. If there is already a chain there (of at least one file), the routine compares files (again using my faster zcompare routine if compressed) to see if there already is a match. If there is a match, then it is flagged for later correction by a program like my BackupPC_fixLinks.pl program. While strictly speaking there is no danger in having more than one copy of the same file in a chain (and it is necessary when nlinks > MAXLINKS), it is not efficient, so it is detected and flagged. Note ,though, in general you shouldn't have many such collisions since if the MD5sum was broken once it was probably broken the whole time (unless you switched back and forth between broken and non-broken Perl versions) 3. The program then renames (i.e. moves) the file and intelligently fills in any holes in the old chain in a way that minimizes chain renumbering and that preserves the relative ordering of chain numbering. Note the routine can in general be used to check and fix the integrity of the pool/pool so it may be more generally useful The program uses routines from my jLib.pm module and requires the latest version that I have not yet posted (but will email if anybody needs it). Here though is the perl code for the routine itself: --------------------------------------------------------------------------- #!/usr/bin/perl #============================================================= -*-perl-*- # # BackupPC_fixPoolMdsums: Rename/move pool files if mdsum path name invalid # # DESCRIPTION # See 'usage' for more detailed description of what it does # # AUTHOR # Jeff Kosowsky # # COPYRIGHT # Copyright (C) 2011 Jeff Kosowsky # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA # #======================================================================== # # Version 0.1, released January 2011 # #======================================================================== use strict; use warnings; use lib "/usr/share/BackupPC/lib"; use BackupPC::Lib; use BackupPC::jLib 0.4.0; # Requires version >= 0.4.0 use File::Glob ':glob'; use Getopt::Long qw(:config no_ignore_case bundling); my $bpc = BackupPC::Lib->new or die("BackupPC::Lib->new failed\n"); %Conf = $bpc->Conf(); #Global variable defined in jLib.pm (do not use 'my') my $TopDir = $Conf{TopDir}; $TopDir =~ s|/+|/|; $TopDir =~ s|/*$|/|; #End with just one slash my $compress = $Conf{CompressLevel}; my $pool = $compress > 0 ? "cpool" : "pool"; my $compare = $compress > 0 ? \&zcompare2 : \&jcompare; my $file2md5 = $compress > 0 ? \&zFile2MD5 : \&File2MD5; my $md5 = Digest::MD5->new; my $MAXLINKS = $bpc->{Conf}{HardLinkMax}; #Option variables: my $nodups; my $outfile; my $verbose=0; #$dryrun=1; #Global variable defined in jLib.pm (do not use 'my') $dryrun=0; #Global variable defined in jLib.pm (do not use 'my') usage() unless( GetOptions( "dryrun|d!" => \$dryrun, "nodups|n" => \$nodups, #Treat dups as errors "outfile|o=s" => \$outfile, "verbose|v+" => \$verbose, #Verbosity (repeats allowed) )); my ($OUT); die "ERROR: '$outfile' already exists!\n" if -e $outfile; open($OUT, '>', "$outfile") or die "ERROR: Can't open '$outfile' for writing!($!)\n"; chdir $TopDir; my @partialbackups = glob("pc/*/NewFileList"); die("Error: Pool conflicts will occur if NewFileList present:\n " . join('\n ', @partialbackups) . "\n") if @partialbackups; system("$bpc->{InstallDir}/bin/BackupPC_serverMesg status jobs >/dev/null 2>&1"); die "Dangerous to run when BackupPC is running!!!\n" unless ($? >>8) == 1; my $total = 0; my $errors = 0; my $fixed = 0; my $chaindups = 0; my $norename = 0; my $norenumber = 0; scan_pool($pool); printf("Total=%d Errors=%d [Fixed=%d, NotFixed=%d]%s\n", $total, $errors, $fixed, ($errors-$fixed), $dryrun ?" DRY-RUN" : ""); printf("Chaindups=%d NoRename=%d NoRenumber=%d\n", $chaindups, $norename, $norenumber); exit; ####################################################################### #Run through the pool looking for misnamed md5sum paths sub scan_pool { my ($fpool) = @_; my ($dh, @fstat); return unless glob("$fpool/[0-9a-f]"); #No entries in pool my @hexlist = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'); my ($idir,$jdir,$kdir); foreach my $i (@hexlist) { print STDERR "\n**$fpool/$i: " if $verbose >=2; $idir = $fpool . "/" . $i . "/"; foreach my $j (@hexlist) { print STDERR "$j " if $verbose >=2; $jdir = $idir . $j . "/"; foreach my $k (@hexlist) { $kdir = $jdir . $k . "/"; unless(opendir($dh, $kdir)) { warn "Can't open pool directory: $kdir\n" if $verbose>=4; next; } #Sort directory entries so that chains are ordered lowest to #highest - This preserves sequential order between source and #target chains PLUS ensures that we fill holes corretly and #most efficiently my @entries = sort {poolname2number($a) cmp poolname2number($b)} (readdir ($dh)); close($dh); warn "POOLDIR: $kdir (" . ($#entries-1) ." files)\n" if $verbose >=3; my $chaindeletes = 0; my $chainstart; my $lastdigest=''; foreach (@entries) { next if /^\.\.?$/; # skip dot files (. and ..) my $origfile = ${kdir} . $_; unless(m|^([0-9a-f]+)(_[0-9]*)?|) { warn "ERROR: '$origfile' is not a valid pool entry\n"; next; } $total++; my $origdigest = $1; my $newdigest = $file2md5->($bpc, $md5, $origfile, -1, $compress); if($newdigest eq "-1") { warn "ERROR: Can't calculate md5sum name for: $origfile\n"; next; } if($newdigest ne $origdigest) { $errors++; if($origdigest ne $lastdigest) { #New chain #So go back and renumber last chain to remove holes renumber_pool_chain($chainstart, $chaindeletes) if $chaindeletes > 0; $lastdigest=$origdigest; #Reset to new chain base $chaindeletes = 0; $chainstart = $origfile; #lowest element of chain #since we are sorting directory in chain order } if(fix_entry($origfile, $newdigest)==1) {$chaindeletes++} } } #Check in case chain still going when 'foreach' ran out of... renumber_pool_chain($chainstart, $chaindeletes) if $chaindeletes > 0; } } } } #Rename/move pool chain entry $source to first open position #in $digest chain if permitted. Renumber source chain as #needed after the move sub fix_entry { my ($source, $digest) = @_; my $i=-1; my @dups = (); my $poolpath = my $poolbase = $bpc->MD52Path($digest,$compress); while( -f $poolpath ) { # Iterate through pool chain with same md5sum if((stat(_))[3] < $MAXLINKS && ! $compare->($source,$poolpath)) { #Matches existing pool entry push(@dups,$i); } $poolpath = $poolbase . "_" . ++$i; } my $dups = @dups ? ' CHAINDUPS(' . join(',', @dups) . ')': ''; $poolpath =~ m|^$TopDir/?(.*)|; my $target = $1; # print "$source $target [$errors/$total]$dups\n"; if(@dups) { warn "WARN: $dups: $source->$target\n" if $verbose >=1; $chaindups++; if($nodups) { #Don't fix dups - no changes to pool print $OUT "$source $target $dups\n"; return; } } if(-e $target || !jrename($source,$target)) { #Not renamed warn "ERROR: Can't rename: $source->$target\n" if $verbose >=1; print $OUT "$source $target NO_RENAME$dups\n"; $norename++; return -1; } # unless(delete_pool_file($source)==1) { #Renamed but source chain not renumbered # warn "ERROR: Can't renumber after rename: $source --> $target\n" # if $verbose >=1; # print $OUT "$source $target NO_RENUMBER$dups\n"; # $norenumber++; # return -2; # } #Fixed without errors print $OUT "$source $target FIXED$dups\n"; $fixed++; return 1; } sub usage { print STDERR <<EOF; usage: $0 [options] --outfile|-o <outfile> Options: --dryrun|-d Dry-run Negate with: --nodryrun --nodups|-n Don\'t rename/remove if file with same contents found in target chain (see below for details) --verbose|-v Verbose (repeat for more verbosity) DESCRIPTION: Find and fix md5sum pool name errors in pool and cpool DETAILS: Recurses through pool and cpool trees to test if the md5sum name of each pool file is correct relative to the file data. If not, the program attempts to rename (i.e. move) it to its proper md5sum name. If there already are pool files with the new name, then move it to the end of the target chain. After removing, renumber the source chain (if needed) to fill in holes left by the move. Note that the relative ordering of each chain is preserved. If the contents of the file match the contents of any of the files in the target chain, note the duplicate suffix numbers. If the --nodups|-n flag is set then don\'t rename the pool file in this case and just note where it would have gone if there were no chain dups. Note: it is not generally an error to have two pool entries in the same chain with the same data (in fact, it occurs intentionally when you exceed MAX LINKS), it just may waste some space. My routine BackupPC_fixLinks.pl can correct just such duplicates later if that is an issue. In any case, if all your misnumbering was consistent you won\'t have this situation anyway. <outfile> records all the changes made plus appends a status code: FIXED = pool file moved/renamed and original chain renumbered if needed. DUPS(n1,n2,n3) = Signals duplicates in the target chain and lists the suffixes (-1 = no suffix) Whether or not file was actually moved in this case (and hence whether the mdsum was fixed) depends on the value of the --nodups flag. NO_RENAME = Signals error in renaming/moving the pool file. The mdsum name was thus not corrected. NO_RENUMBER = The pool file was renamed/moved *but* error in renumbering the source chain to fill in the hole left by the move. EOF exit(1) } |