static fails: missing $(OBJECTS2)
Clsoed fixed.
Ah, you can tell that the static build is not part of my test routine. I took your patch and applied it to git, see commit 0e68146. Thanks! Let me know if you find anything else.
Makefile fix: static target missed some objects.
static fails: missing $(OBJECTS2)
Remove extra -g root
Use local -march setting
Avoid raising a second SIGILL.
Handle compilers that don't support arm-v8.5-a+rng.
Add rdrand Makefile target.
Don't BSWAP32 on aarch64 ASLR init.
Update documentation.
Bump version to 1.99.22.
Use CPU rng to initialize PRNG in aarch64.
Detect avail of aarch-v8.5-a rng rndr for random numbers.
Fix rdrand elif clause syntax
Use SCHED_YIELD macro.
Autodetect SRCDIR and add build insns.
Link addtl test scripts to build dir.
Use rdrand64 on x86-64 rather than rdrand32.
clang found this for me ...
Use -@ and short option for --sparse_nonslow
Option --spares_nonslow=maxreadtime: Do write 0s.
More precise comment on the floating averages.
Add explanation of code and floating averages.
Better startup behaviro for currrate and avgrate.
Use harmonic mean to calc current speed.
test_sparse.sh sporadic failures
Closing
Great, thanks!
Thanks a lot for your patience & work on this. I can't reproduce it anymore. I think we're good to close!
So I believe that your setup did produce à significant amount of interrupted or short writes. (This may be specific to filesystem code or CPU or memory utilisation). dd_rescue had code to handle it, but it had issues. These have been addressed and I have tests for it now. I'm fairly optimistic that we are in good shape here. Let me know if you have different experience, otherwise I'd like to close this ticket. Thanks for your tenacity, helping to harden the code!
Hi Sam, Just inject the key found at https://github.com/garloff.keys If the bug still reproduces ... -- Kurt On 14.03.25 16:13, Sam James wrote: OK, @garloff https://sourceforge.net/u/garloff/profile/, a friend's setup a VM where they can reproduce it easily and made it available for SSH. Where should I email the credentials to? [tickets:#9] https://sourceforge.net/p/ddrescue/tickets/9/ test_sparse.sh sporadic failures Status: open Milestone: 1.0 Created: Fri Feb 14, 2025 02:41 AM UTC by Sam James...
Better language on -H percent.
Thank you! I won't declare victory yet but it's looking promising.
I released 1.99.21 with all the fixes included. Let me know if you find further issues or if this one is not yet completely solved for you.
Avoid definint READ and WRITE. Used on Android.
Merge branch 'DD_RESCUE_1_99_BRANCH' of ssh://git.code.sf.net/p/ddrescue/code into DD_RESCUE_1_99_BRANCH
Fix SRCDIR handling when creating .dep.
Merge branch 'DD_RESCUE_1_99_BRANCH' of ssh://git.code.sf.net/p/ddrescue/code into DD_RESCUE_1_99_BRANCH
salt is actually not very sensitive, don't warn.
Avoid using C23 festure.
The code on the git branch DD_RESCUE_1_99_BRANCH now should have impeccable handling for interrupted and short IO calls. Testing welcome!
The code on the git branch DD_RESCUE_1_99_BRANCH now should have impeccable handling or interrupted and short IO calls. Testing welcome!
Clear errno if it's EAGAIN or EINTR.
Testing and fixing short and interrupted read/write calls.
OK, @garloff, a friend's setup a VM where they can reproduce it easily and made it available for SSH. Where should I email the credentials to?
Ah, sorry, I thought I had but I can't find it indeed. This fails for me: wget https://www.garloff.de/kurt/linux/ddrescue/dd_rescue-1.99.20.tar.bz2 tar xvf dd_rescue-1.99.20.tar.bz2 cd dd_rescue-1.99.20 make -j all make check while true; do ./test_sparse.sh "-L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:" "encrypt" "decrypt" 8388612 || break; done ... but I haven't reproduced it more than once so far this morning. Bad (good) luck, I guess. A friend who can reproduce it is also going to...
Ah, sorry, I thought I had but I can't find it indeed. This fails for me: wget https://www.garloff.de/kurt/linux/ddrescue/dd_rescue-1.99.20.tar.bz2 tar xvf dd_rescue-1.99.20.tar.bz2 cd dd_rescue-1.99.20 make -j all make check while true; do ./test_sparse.sh "-L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:" "encrypt" "decrypt" 8388612 || break; done ... but I haven't reproduced it more than one so far this morning. Bad (good) luck, I guess. A friend who can reproduce it is also going to setup...
Note that the final patch will look slightly different in my current view. I will just turn the errno == 0 in the while statement into (errno == 0 || errno == EAGAIN || errno == EINTR). We do not handle short writes and interrupted writes in dd_rescue currently. That will be fixed also (and may be the reason for the issues you observe).
OK, so here's what I have been able to find: readblock() was actually not safe against a EINTR/EAGAIN followed by short read. You can add errno = 0; before the mypread() call in readblock() in dd_rescue.c:1827 and let me know if this fixes things for you.
OK, so here's what I have been able to find: readblock() was actually not safe against a EINTR/EAGAIN followed by short read. You can add errno = 0; before the mypred() call in readblock() in dd_rescue.c:1827 and let me know if this fixes things for you.
OK, care to give complete instructions for the reproduction? git checkout DD_RESCUE_1_99_BRANCH ./autogen.sh make -j all make check while true; do ./test_sparse.sh "-L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:" "encrypt" "decrypt" 8388612 || break; done is what I do without any success. Somehow I need to get this reproduced!
It looks like strace can do some fault injection too. Maybe I'll try that?
Here's lscpu from two machines where I can hit it. (Note this machine currently is booted with mitigations=off and I've found this has affected timing-related bugs before): $ uname -a Linux mop 6.13.6 #1 SMP PREEMPT_DYNAMIC Sat Mar 8 14:02:16 GMT 2025 x86_64 AMD Ryzen 9 3950X 16-Core Processor AuthenticAMD GNU/Linux $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID:...
Just regular configure and make (building from tarball, no package manager; I originally reproduced it in our PM, but I wanted to rule all of that out) with nothing exported in the environment or passed to configure or make It takes a few minutes at most
OK, creating a signal handler that only logs the signal for SIGWINCH and sending lots ot SIGWINCH does not seem to cause any trouble, so I may have to go the IO injection route.
Thinking on this some more ... In my mind, the most likely theory is that I fail to do the right thing when IO irregularities happen. IO calls can be interrupted (-EINTR) or may return with only having done part of the work (short reads/writes). We have logic to handle this, but the logic may have bugs. Such bugs tend to go unnoticed, as interrupted and short IO happens very rarely. I will look at the code with that focus again. If I don't see anything suspicious, I will create a wrapper that injects...
Better help for test_aes.
Nothing within four hours. Testing here is on a Zen3 CPU which supports VAES. The codepath is different if the CPU does support AES only or no crypto extensions. On what CPUs with what crypto capabilities can you reproduce the issue? (Note: If this does occur occasionally, there is also a possibility that it's kernel related, not handling all the extended vector registers correctly on a context switch -- though it's admittedly unlikely that this would go unnoticed ...)
Nothing within four hours. Testing here is on a Zen3 CPU which supports VAES. The codepath is different if the CPU does support AES only or no crypto extensions. An what CPUs with what crypto capabilities can you reproduce the issue? (Note: If this does occur occasionally, there is also a possibility that it's kernel related, not handling all the extended vector registers correctly on a context switch -- though it's admittedly unlikely that this would go unnoticed ...)
Running in a Gentoo VM again. The binary that you use to reproduce this: - How do you compile it? Passing any compiler flags, e.g. from the Gentoo build system? - How long does it take to reproduce the issue?
Stopping after 11hrs.
The qcow2 should be functionally identical for our purposes here (other than some small amount of changes in stable packages in the last week), as it's a stage3 + a kernel shoved in (more or less).
The qcow2 should be identical (other than some small amount of changes in stable packages in the last week), as it's a stage3 + a kernel shoved in (more or less).
Compiled with the -fsanitize options .... Nothing thus far (after 1hr), neither on tmpfs nor NFS.
No luck with 3hrs on Orange Pi (aarch64, gcc-11, on NFS). x86-64 gcc-14 on a openSUSE VM on NFS is running now since 10mins without issue. Will leave it running for a while. Will start tmpfs tests in parallel, though I would suspect tmpfs to less likely fail ... On Gentoo stage3 tarball: Is that significantly different from the qcow2 I downloaded a week ago?
Good news: ASAN is now happy. UBSAN still has some alignment issues but I haven't looked into those. Bad news: I can still reproduce the assert (or failed file comparison, depending on luck). Assert: ./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:encrypt testfile testfile.copy2 dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line! dd_rescue: (info): expect to copy 8192.0kiB from testfile...
I'd asked a friend to try reproduce, both to help get to the bottom of it, and also make sure I'm not wasting your time somehow -- he couldn't reproduce at first, but then pulled a fresh Gentoo stage3 tarball, chrooted in (just extracted to some temporary location), and could pretty quickly in a loop. I'm still trying to think of ideas, but could you try the loop on a tmpfs mount?
I'd asked a friend to try reproduce, both to help get to the bottom of it, and also make sure I'm not wasting your time somehow -- he couldn't reproduce at first, but then pulled a fresh Gentoo stage3 tarball, chrooted in (just extracted to some temporary location), and could pretty quickly in a loop. I'm still trying to think of ideas, but could you try the loop on a tmpfs partition?
Good news: ASAN is now happy. UBSAN still has some alignment issues but I haven't looked into those. Bad news: I can still reproduce the assert (or failed file comparison, depending on luck). Assert: ./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:encrypt testfile testfile.copy2 dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line! dd_rescue: (info): expect to copy 8192.0kiB from testfile...
Nice! I'll pull that in to our packaging now and re-test with it.
Notes: 1. I found no further places in the code where I may have made the same mistake. 2. The issues I saw on OrangePi5B before which I attributed to NFS are fixed by this. I have a theory here: NFS causes short (incomplete) writes with some likelihood, while many local filesystems are unlikely to do so. So that's how this got triggered. (You needed a second retry to cause stack corruption.) I plan to release 1.99.21 in a few days, so we have an official release soon with this fixed. For the time...
I pushed this fix to DD_RESCUE_1_99_BRANCH. I'm still looking at a few unaligned warnings that the sanitizer uncovered. At first look, these are hard to avoid, as lzo is not designed to guarantee any alignment (larger than one byte) of compressed content.
When countring retries, we would have inc a pointer.
Pass CFLAGS also to gcc linker call.
I can reproduce the -fsanitizer error. Did I really get the precendence rules of C wrong? Kind of embarassing after 30yrs ... Change !*retry++ to !(*retry)++ at the two places in real_writeblock() and try again ...
On one of the runs when it differs: $ diffoscope testfile testfile.copy --- testfile +++ testfile.copy @@ -524282,8 +524282,8 @@ 007fff90: 8257 553a b086 0b31 c88c 558f 5400 71bf .WU:...1..U.T.q. 007fffa0: 983e 49c9 74f1 8220 5777 b11b 119f 9000 .>I.t.. Ww...... 007fffb0: 1aed a523 8120 ab20 c94a 4e9c e0d7 aab8 ...#. . .JN..... 007fffc0: c5f1 945d 399d 0fd2 1e28 6106 e09d a777 ...]9....(a....w 007fffd0: c6bd 6382 b708 4633 c526 90c7 3443 5e7f ..c...F3.&..4C^. 007fffe0: 215c 5e10 abe8 dc1a 7be0 ae61...
(Sorry for delay, I've been working on some other bits this week.) Don't stress over it more for now and I'll try to reproduce in a VM. I'm sorry about the mystery :(
Compiling in and running on a Gentoo 2.17 VM (kernel 6.6.47, gcc-14.2.1) on x86-64 (Zen3) for several hours did not yield any error. local-19353 ~/dd_rescue # ./dd_rescue --version dd_rescue Version 1.99.20, kurt@garloff.de, GNU GPL v2/v3 (DD_RESCUE_1_99_20-2-g1dd2a7a) (compiled Mar 1 2025 16:43:33 by gcc (Gentoo 14.2.1_p20241221 p7) 14.2.1 20241221) (features: O_DIRECT dl/libfallocate fallocate splice fitrim xattr rdrnd sha vaes avx2) dd_rescue is free software. It's protected by the terms of GNU...
OK, gentoo .qcow2 for cloud-init downloaded from https://www.gentoo.org/downloads/
Are there .qcow2 images for Gentoo available for download somewhere? Maybe things reproduce in a Gentoo VM ...
Bizarre. Is there anything I can do to get more information other than figuring out environments it does, and doesn't, happen in? Happy to run with custom patches or build with whatever options.
Trying reproduction in many loops on many devices. Nothing. Except for one finding, which I suspect not to be a dd_rescue bug: On an ARM64 SBC (Orange Pi5B, kernel 5.10.160), I see typically two corrupted bytes (3 bytes apart or so) after running for a minute or so. This only happens when testing on NFS, not when writing to the local filesystem. Other NFS clients (x86-64, kernel 612.x mostly) do not show this behavior. Nor do I see it on the OrangePi5 when using a local filesystem. Weird.
More specific warnings for passed secrets.
Better support for different SRCDIR.
Thanks for taking a look Kurt. I'd actually started off assuming it was either a GCC bug or at least specific to GCC 15, but then managed to reproduce with GCC 13 too and wrote off the fact I hadn't hit it before as related to how I'm just not guaranteed to hit it every time (i.e. I assumed it's not a new issue). Let me try on a few other machines and environments and get back to you. I'll first try on my other Gentoo machines then try some other distros in Docker or a chroot (I'm a Gentoo developer...
Hmm, tested on two machines, (both x86-64, AMD Zen 3 and Zen 4). Ubuntu 24.04 (gcc-13.3) bare metal and openSUSE-15.6 (with a self-compiled gcc-15) in a VM. Running this for ~20mins on either machine did not yield any error ... Any hints on what may be special in your setup? Can you reproduce this on several different setups (distributions, CPUs, compilers, ... ?)
Thanks for the report! It looks like it hits sporadically when using 16k blocks both during de- and encryption. Interestingly not even a 2nd plugin (like de/compression) seems to be involved, which I would have suspected, as that complicates things and I had a bit of work b/f releasing 1.99.20 to get all corner cases right there. I'll let you know what I can find.
Sorry, I can't seem to edit the first post to fix the formatting. It usually takes 30s or so for the loop running on a fast machine (AMD Ryzen 3950X) to hit a failure, sometimes longer (1m, maybe up to 3m). Another example of a failure is: ./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:decrypt testfile.copy2 testfile.copy dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line! dd_rescue:...
test_sparse.sh sporadic failures
Fix memory corruption (!).