Hi!
I observed a test failure when running the testsuite for dd-rescue-1.99.20 and at first, couldn't reproduce it again. Upon some further prodding, I found that test_sparse.sh fails for me but only usually after many iterations, and the failure mode isn't always the same.
This is one of them:
~/bugs/dd-rescue/dd_rescue-1.99.20 $ make -j32 -l32 && make -j32 -l32 check
~/bugs/dd-rescue/dd_rescue-1.99.20 $ while true ; do LD_LIBRARY_PATH=. ./test_sparse.sh "-L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:" "encrypt" "decrypt" 8388612 || break ; done
[...]
./dd_rescue -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:decrypt testfile.copy2 testfile.copy
dd_rescue: (info): Using softbs=128.0kiB, hardbs=4.0kiB
dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line!
dd_rescue: (info): expect to copy 8192.0kiB from testfile.copy2
dd_rescue: (info): crypt ( 0): Derived salt from testfile.copy2=0000000000800004
dd_rescue: (info): crypt ( 0): Generate KEY and IV from same passwd/salt
dd_rescue: (info): read testfile.copy2 (8192.0kiB): EOF
dd_rescue: (info): Summary for testfile.copy2 -> testfile.copy
dd_rescue: (info): ipos: 8192.0k, opos: 8192.0k, xferd: 8192.0k
errs: 0, errxfer: 0.0k, succxfer: 8192.0k
+curr.rate: 502978kB/s, avg.rate: 502978kB/s, avg.load: 83.6%
>-...-....-....-....-....-....-....-....--< 100% TOT: 0:00:00
8196 testfile.copy2
./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:encrypt testfile testfile.copy2
dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB
dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line!
dd_rescue: (info): expect to copy 8192.0kiB from testfile
dd_rescue: (info): crypt ( 0): Derived salt from testfile.copy2=0000000000800004
dd_rescue: (info): crypt ( 0): Generate KEY and IV from same passwd/salt
dd_rescue: (info): read testfile (8192.0kiB): EOF
dd_rescue: (info): Summary for testfile -> testfile.copy2
dd_rescue: (info): ipos: 8192.0k, opos: 8192.0k, xferd: 8192.0k
errs: 0, errxfer: 0.0k, succxfer: 8192.0k
+curr.rate: 446918kB/s, avg.rate: 446918kB/s, avg.load: 75.7%
>-----------------------------------------< 100% TOT: 0:00:00
dd_rescue: (warning): crypt ( 0): Enc alignment error! (8388612-0)=8388612 4/4
dd_rescue: libddr_crypt.c:1532: crypt_blk_cb: Assertion `left < BLKSZ-state->inbuf' failed.
./test_sparse.sh: line 14: 3107568 Aborted (core dumped) ./dd_rescue $1 $2$3 ${TESTFILE} ${TESTFILE}.copy2
ERROR 198: Error with sparse
This particular build was with GCC 13 but I first observed it when testing GCC 15 (not yet released):
```
$ ./dd_rescue --version
dd_rescue Version 1.99.20, kurt@garloff.de, GNU GPL v2/v3
(DD_RESCUE_1_99_20)
(compiled Feb 14 2025 02:37:06 by gcc-13 (Gentoo Hardened 13.3.1_p20250207 p2) 13.3.1 20250207)
(features: O_DIRECT dl/libfallocate fallocate splice fitrim xattr rdrnd sha aes avx2)
dd_rescue is free software. It's protected by the terms of GNU GPL v2 or v3
(at your option).
```
Backtrace for the crash:
``
Using host libthread_db library "/usr/lib64/libthread_db.so.1".
Core was generated by./dd_rescue -a -b 16k -L ./libddr_crypt testfile testfile.copy2'.
Program terminated with signal SIGABRT, Aborted.
44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt
at libddr_crypt.c:1436
fst=fst@entry=0x5d84626e4200 <_fstate.2>, prg=prg@entry=0x5d84626e41c0 <_progress.1>, dop=0x5d84626e4270 <_dpopts.4>) at dd_rescue.c:1904
dst=0x5d84626e4260 <_dpstate.3>, closelog=closelog@entry=0 '\000') at dd_rescue.c:1482
```
Sorry, I can't seem to edit the first post to fix the formatting.
It usually takes 30s or so for the loop running on a fast machine (AMD Ryzen 3950X) to hit a failure, sometimes longer (1m, maybe up to 3m).
Another example of a failure is:
This is with
linux-6.13.2and I've hit it on abtrfsand atmpfspartition. Let me know what more information I can share or obtain for you. Happy to try things too. Thanks!Thanks for the report!
It looks like it hits sporadically when using 16k blocks both during de- and encryption. Interestingly not even a 2nd plugin (like de/compression) seems to be involved, which I would have suspected, as that complicates things and I had a bit of work b/f releasing 1.99.20 to get all corner cases right there. I'll let you know what I can find.
Hmm, tested on two machines, (both x86-64, AMD Zen 3 and Zen 4).
Ubuntu 24.04 (gcc-13.3) bare metal and openSUSE-15.6 (with a self-compiled gcc-15) in a VM.
Running this for ~20mins on either machine did not yield any error ...
Any hints on what may be special in your setup? Can you reproduce this on several different setups (distributions, CPUs, compilers, ... ?)
Thanks for taking a look Kurt.
I'd actually started off assuming it was either a GCC bug or at least specific to GCC 15, but then managed to reproduce with GCC 13 too and wrote off the fact I hadn't hit it before as related to how I'm just not guaranteed to hit it every time (i.e. I assumed it's not a new issue).
Let me try on a few other machines and environments and get back to you. I'll first try on my other Gentoo machines then try some other distros in Docker or a chroot (I'm a Gentoo developer so I don't have others to hand ;)) or VM if must.
I have at least reproduced it on another machine already (which is also Gentoo,
linux-6.13.2, and is Zen 4) so at least I haven't wasted your time with a hardware problem!Trying reproduction in many loops on many devices.
Nothing.
Except for one finding, which I suspect not to be a dd_rescue bug:
On an ARM64 SBC (Orange Pi5B, kernel 5.10.160), I see typically two corrupted bytes (3 bytes apart or so) after running for a minute or so. This only happens when testing on NFS, not when writing to the local filesystem. Other NFS clients (x86-64, kernel 612.x mostly) do not show this behavior. Nor do I see it on the OrangePi5 when using a local filesystem. Weird.
Bizarre. Is there anything I can do to get more information other than figuring out environments it does, and doesn't, happen in? Happy to run with custom patches or build with whatever options.
Are there .qcow2 images for Gentoo available for download somewhere?
Maybe things reproduce in a Gentoo VM ...
OK, gentoo .qcow2 for cloud-init downloaded from https://www.gentoo.org/downloads/
Compiling in and running on a Gentoo 2.17 VM (kernel 6.6.47, gcc-14.2.1) on x86-64 (Zen3) for several hours did not yield any error.
local-19353 ~/dd_rescue # ./dd_rescue --version
dd_rescue Version 1.99.20, kurt@garloff.de, GNU GPL v2/v3
(DD_RESCUE_1_99_20-2-g1dd2a7a)
(compiled Mar 1 2025 16:43:33 by gcc (Gentoo 14.2.1_p20241221 p7) 14.2.1 20241221)
(features: O_DIRECT dl/libfallocate fallocate splice fitrim xattr rdrnd sha vaes avx2)
dd_rescue is free software. It's protected by the terms of GNU GPL v2 or v3
(at your option).
So I am a bit clueless what's going on on your side, Sam.
(Sorry for delay, I've been working on some other bits this week.)
Don't stress over it more for now and I'll try to reproduce in a VM. I'm sorry about the mystery :(
On one of the runs when it differs:
I managed to reproduce it on another machine. Just need to try reproduce in a VM next.
Meanwhile, I tried ASAN and UBSAN with
make EXTRA_CFLAGS="-Og -ggdb3 -fsanitize=address,undefined"and thenmake EXTRA_CFLAGS="-Og -ggdb3 -fsanitize=address,undefined"and got a tonne of errors.In particular, for the loop above, I get...
Can you reproduce that if you run it in a loop with a
ddrescuebuilt with-fsanitize=address,undefined?I can reproduce the -fsanitizer error.
Did I really get the precendence rules of C wrong? Kind of embarassing after 30yrs ...
Change
!*retry++to!(*retry)++at the two places inreal_writeblock()and try again ...I pushed this fix to DD_RESCUE_1_99_BRANCH.
I'm still looking at a few unaligned warnings that the sanitizer uncovered.
At first look, these are hard to avoid, as lzo is not designed to guarantee any alignment (larger than one byte) of compressed content.
Notes:
I plan to release 1.99.21 in a few days, so we have an official release soon with this fixed.
For the time being, picking up the attached patch would seem helpful.
Patch is also included already in my home:garloff:storage OBS project.
Nice! I'll pull that in to our packaging now and re-test with it.
Good news: ASAN is now happy. UBSAN still has some alignment issues but I haven't looked into those.
Bad news: I can still reproduce the assert (or failed file comparison, depending on luck).
Assert:
Backtrace (which is somewhat wrong -- look at the assert arguments wrt the line it then blames; no idea if gdb is confused there or what, but the assert mentioned in
assertion=is the one I see often):Then the file comparison one:
When I get that comparison failure, the difference is always in the last byte:
Last edit: Sam James 2025-03-05
I'd asked a friend to try reproduce, both to help get to the bottom of it, and also make sure I'm not wasting your time somehow -- he couldn't reproduce at first, but then pulled a fresh Gentoo stage3 tarball, chrooted in (just extracted to some temporary location), and could pretty quickly in a loop.
I'm still trying to think of ideas, but could you try the loop on a tmpfs mount?
Last edit: Sam James 2025-03-05
No luck with 3hrs on Orange Pi (aarch64, gcc-11, on NFS).
x86-64 gcc-14 on a openSUSE VM on NFS is running now since 10mins without issue.
Will leave it running for a while. Will start tmpfs tests in parallel, though I would suspect tmpfs to less likely fail ...
On Gentoo stage3 tarball: Is that significantly different from the qcow2 I downloaded a week ago?
The qcow2 should be functionally identical for our purposes here (other than some small amount of changes in stable packages in the last week), as it's a stage3 + a kernel shoved in (more or less).
Last edit: Sam James 2025-03-05
Compiled with the -fsanitize options .... Nothing thus far (after 1hr), neither on tmpfs nor NFS.
Stopping after 11hrs.
Running in a Gentoo VM again.
The binary that you use to reproduce this:
OK, care to give complete instructions for the reproduction?
is what I do without any success.
Somehow I need to get this reproduced!
Ah, sorry, I thought I had but I can't find it indeed.
This fails for me:
... but I haven't reproduced it more than once so far this morning. Bad (good) luck, I guess.
A friend who can reproduce it is also going to setup a clean VM for you to SSH into.
Last edit: Sam James 2025-03-14