Dump/Restore ext2/3/4 filesystem backup / Bugs / #174 corrupted dumps of 2TB+ filesystems....

Greg Oster - 2022-06-21

My math here might be wrong... if 4294967296 is the maximum logical block address, then the corruption wouldn't be seen until LBAs of over that value.. I.e. for a block size of 4K, that would mean on filesystems larger than 16TB... which would help explain why this hasn't been reported before.

Later...
Greg Oster

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Oster - 2022-06-23

8.5TB of data successfully dumped/restored with the submitted patches in use. The dump/restore of this data set has 1000's of validation errors without the patches.

Later...
Greg Oster

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Greg Oster - 2022-06-29

These changes are necessary, but not sufficient. A multi-tape dump looks like it is corrupting a file that spans two tapes. The error seen is:
Incorrect block for <filename> at 11432470600 blocks
Incorrect block for <filename> at 11432470601 blocks
...
Incorrect block for <filename> at 11432470790 blocks
Incorrect block for <filename> at 11432470791 blocks</filename></filename></filename></filename>

When the 16TB restore finishes I'll know if this is the only file that is corrupt. [UPDATE: 16TB restore finished. 'diff' showed that only the one file above (which spanned tapes) was corrupt.]

I suspect that to fix this we'll need to modify compat/include/protocols/dumprestore.h to bump int32_t c_firstrec; to an int64_t . But such a change will need to be made in a backwards compatible way so old backups arn't rendered obsolete.

Fixing the above should also allow an easy fix for the outstanding dump progress "% done" issue too.
Later...
Greg Oster

Last edit: Greg Oster 2022-07-01

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thanks for this! I've managed to generate a test case that doesn't require terrabytes of data, only about 3GB of diskspace to reproduce:

(This assume you have no loop devices in use - it will trash them if there are!)

mkdir -p d1.mnt
mkidr -p d2.mnt
mkdir -p big

rm -f d1
truncate -s 3G d1
losetup -f d1
mkfs.ext4 /dev/loop0
losetup -d /dev/loop0

rm -f d2
truncate -s 1G d2
losetup -f d2
mkfs.ext4 /dev/loop0
losetup -d /dev/loop0

mount -o loop d1 d1.mnt/
mount -o loop d2 d2.mnt/

truncate -s 15T d1.mnt/pv
truncate -s 15T d2.mnt/pv
losetup -f d1.mnt/pv
losetup -f d2.mnt/pv

vgcreate vg30T /dev/loop2 /dev/loop3

lvcreate -n big -l 7860000 vg30T

mkfs.ext4 /dev/vg30T/big

mount /dev/vg30T/big big

fallocate -l 10T big/bigfile1
fallocate -l 10T big/bigfile2

dd if=/dev/urandom of=big/bigblock bs=1K count=10K

debugfs -R "stat bigblock" /dev/vg30T/big | cat

rm big/bigfile1 big/bigfile2

sync

dump -v -0 /dev/vg30T/big -f - | restore -C -D big/ -f -

umount big
vgchange -an vg30T

losetup -d /dev/loop3
losetup -d /dev/loop2
umount d1.mnt
umount d2.mnt

rm -f d1
rm -f d2

And this is the result without this patch:

dump -v -0 /dev/vg30T/big -f - | restore -C -D big/ -f -
  DUMP: Date of this level 0 dump: Wed Sep  4 19:18:43 2024
  DUMP: Dumping /dev/vg30T/big (an unlisted file system) to standard output
  DUMP: Excluding inode 8 (journal inode) from dump
  DUMP: Excluding inode 7 (resize inode) from dump
  DUMP: Label: none
  DUMP: Writing 10 Kilobyte records
  DUMP: mapping (Pass I) [regular files]
  DUMP: mapping (Pass II) [directories]
  DUMP: estimated 133100 blocks.
  DUMP: Volume 1 started with block 1 at: Wed Sep  4 19:18:44 2024
Dump   date: Wed Sep  4 19:18:43 2024
Dumped from: the epoch
Level 0 dump of an unlisted file system on dirac.home.woodall.me.uk:/dev/vg30T/big
Label: none
  DUMP: dumping (Pass III) [directories]
  DUMP: dumping directory inode 2
  DUMP: dumping directory inode 11
  DUMP: dumping (Pass IV) [regular files]
  DUMP: dumping regular inode 14
filesys = big/
./bigblock: tape and disk copies are different
  DUMP: Volume 1 completed at: Wed Sep  4 19:18:46 2024
  DUMP: Volume 1 133090 blocks (129.97MB)
  DUMP: Volume 1 took 0:00:02
  DUMP: Volume 1 transfer rate: 66545 kB/s
  DUMP: 133090 blocks (129.97MB)
  DUMP: finished in 2 seconds, throughput 66545 kBytes/sec
  DUMP: Date of this level 0 dump: Wed Sep  4 19:18:43 2024
  DUMP: Date this dump completed:  Wed Sep  4 19:18:46 2024
  DUMP: Average transfer rate: 66545 kB/s
  DUMP: DUMP IS DONE
Some files were modified!  1 compare errors

There's another serious bug related to EXT2_EXTENT_FLAGS_UNINIT which I've got a fix for (and might be the cause of bug 175)

There's also an issue with the verify of long symlinks. (doesn't affect the restore, only a verify similar to what is being done above in the testcase)

(There's also a longstanding bug related to verify and counting of extended attributes for which there's a fix in the debian package but doesn't appear to be in here)

Greg Oster - 2024-09-04

You're most welcome! Thanks for coming up with a small test case -- I switched from 'dump' to 'restic' for backups at about the same time that I reported the issue, and so havn't had the need to chase this problem further.
Later...
Greg Oster

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Woodall - 2024-09-08

There was a minor bug in the original patches which I've fixed in the attached patch.
I had to add a bit of extra logging to actually show that the test case was duming an EA block >2^32

dumping EA (block) in inode #13 block=4312435892

handle-bigfiles.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tim Woodall - 2024-09-08

A somewhat modified testcase that runs a bit quicker.

testcase.sh

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

corrupted dumps of 2TB+ filesystems....

Group

Searches

Help

#174 corrupted dumps of 2TB+ filesystems....

Discussion