Menu

#9 test_sparse.sh sporadic failures

1.0
closed
nobody
None
2025-03-23
2025-02-14
Sam James
No

Hi!

I observed a test failure when running the testsuite for dd-rescue-1.99.20 and at first, couldn't reproduce it again. Upon some further prodding, I found that test_sparse.sh fails for me but only usually after many iterations, and the failure mode isn't always the same.

This is one of them:

~/bugs/dd-rescue/dd_rescue-1.99.20 $ make -j32 -l32 && make -j32 -l32 check
~/bugs/dd-rescue/dd_rescue-1.99.20 $ while true ; do LD_LIBRARY_PATH=. ./test_sparse.sh "-L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:" "encrypt" "decrypt" 8388612 || break ; done
[...]
./dd_rescue -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:decrypt testfile.copy2 testfile.copy
dd_rescue: (info): Using softbs=128.0kiB, hardbs=4.0kiB
dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line!
dd_rescue: (info): expect to copy 8192.0kiB from testfile.copy2
dd_rescue: (info): crypt ( 0): Derived salt from testfile.copy2=0000000000800004
dd_rescue: (info): crypt ( 0): Generate KEY and IV from same passwd/salt
dd_rescue: (info): read testfile.copy2 (8192.0kiB): EOF
dd_rescue: (info): Summary for testfile.copy2 -> testfile.copy
dd_rescue: (info): ipos:      8192.0k, opos:      8192.0k, xferd:      8192.0k
                   errs:      0, errxfer:         0.0k, succxfer:      8192.0k
             +curr.rate:   502978kB/s, avg.rate:   502978kB/s, avg.load: 83.6%
             >-...-....-....-....-....-....-....-....--< 100%  TOT:  0:00:00
8196    testfile.copy2
./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:encrypt testfile testfile.copy2
dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB
dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line!
dd_rescue: (info): expect to copy 8192.0kiB from testfile
dd_rescue: (info): crypt ( 0): Derived salt from testfile.copy2=0000000000800004
dd_rescue: (info): crypt ( 0): Generate KEY and IV from same passwd/salt
dd_rescue: (info): read testfile (8192.0kiB): EOF
dd_rescue: (info): Summary for testfile -> testfile.copy2
dd_rescue: (info): ipos:      8192.0k, opos:      8192.0k, xferd:      8192.0k
                   errs:      0, errxfer:         0.0k, succxfer:      8192.0k
             +curr.rate:   446918kB/s, avg.rate:   446918kB/s, avg.load: 75.7%
             >-----------------------------------------< 100%  TOT:  0:00:00
dd_rescue: (warning): crypt ( 0): Enc alignment error! (8388612-0)=8388612 4/4
dd_rescue: libddr_crypt.c:1532: crypt_blk_cb: Assertion `left < BLKSZ-state->inbuf' failed.
./test_sparse.sh: line 14: 3107568 Aborted                    (core dumped) ./dd_rescue $1 $2$3 ${TESTFILE} ${TESTFILE}.copy2
ERROR 198: Error with sparse

This particular build was with GCC 13 but I first observed it when testing GCC 15 (not yet released):
```
$ ./dd_rescue --version

dd_rescue Version 1.99.20, kurt@garloff.de, GNU GPL v2/v3
(DD_RESCUE_1_99_20)
(compiled Feb 14 2025 02:37:06 by gcc-13 (Gentoo Hardened 13.3.1_p20250207 p2) 13.3.1 20250207)
(features: O_DIRECT dl/libfallocate fallocate splice fitrim xattr rdrnd sha aes avx2)
dd_rescue is free software. It's protected by the terms of GNU GPL v2 or v3
(at your option).
```

Backtrace for the crash:
`` Using host libthread_db library "/usr/lib64/libthread_db.so.1". Core was generated by./dd_rescue -a -b 16k -L ./libddr_crypt testfile testfile.copy2'.
Program terminated with signal SIGABRT, Aborted.

0 __pthread_kill_implementation (threadid=139194786694976, signo=6, no_tid=0) at pthread_kill.c:44

44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
(gdb) bt

0 __pthread_kill_implementation (threadid=139194786694976, signo=6, no_tid=0) at pthread_kill.c:44

1 __pthread_kill_internal (threadid=139194786694976, signo=6) at pthread_kill.c:89

2 __GI___pthread_kill (threadid=139194786694976, signo=signo@entry=6) at pthread_kill.c:100

3 0x00007e98cfa1f7a2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26

4 0x00007e98cfa01243 in __GI_abort () at abort.c:73

5 0x00007e98cfa011b6 in __assert_fail_base (fmt=<optimized out="">, assertion=<optimized out="">, file=<optimized out="">, line=1532, function=<optimized out="">) at assert.c:118</optimized></optimized></optimized></optimized>

6 0x00007e98cfa160bc in __assert_fail (assertion=<optimized out="">, file=<optimized out="">, line=<optimized out="">, function=<optimized out="">) at assert.c:127</optimized></optimized></optimized></optimized>

7 0x00007e98cfd3b931 in crypt_blk_cb (fst=0x5d84626e4200 <_fstate.2>, bf=<optimized out="">, towr=0x7ffcb53a841c, eof=1, recall=<optimized out="">, stat=<optimized out="">)</optimized></optimized></optimized>

at libddr_crypt.c:1436

8 0x00005d84626d6f0e in call_plugins_block (op=<optimized out="">, fst=0x5d84626e4200 <_fstate.2>, skip=0x7ffcb53a8424, eof=1, towr=0x7ffcb53a841c, bf=<optimized out="">) at dd_rescue.c:537</optimized></optimized>

9 call_plugins_block (bf=<optimized out="">, towr=0x7ffcb53a841c, eof=1, skip=0x7ffcb53a8424, op=<optimized out="">, fst=0x5d84626e4200 <_fstate.2>) at dd_rescue.c:520</optimized></optimized>

10 0x00005d84626d8b8d in writeblock (towrite=towrite@entry=0, shouldwrite=shouldwrite@entry=0x7ffcb53a8484, op=op@entry=0x5d84626e42a0 <_opts.5>,

fst=fst@entry=0x5d84626e4200 <_fstate.2>, prg=prg@entry=0x5d84626e41c0 <_progress.1>, dop=0x5d84626e4270 <_dpopts.4>) at dd_rescue.c:1904

11 0x00005d84626d8fc5 in real_cleanup (op=0x5d84626e42a0 <_opts.5>, fst=0x5d84626e4200 <_fstate.2>, prg=0x5d84626e41c0 <_progress.1>, dop=<optimized out="">,</optimized>

dst=0x5d84626e4260 <_dpstate.3>, closelog=closelog@entry=0 '\000') at dd_rescue.c:1482

12 0x00005d84626d929a in cleanup (closelog=closelog@entry=0 '\000') at dd_rescue.c:1552

13 0x00005d84626de1c5 in main (argc=<optimized out="">, argv=<optimized out="">) at dd_rescue.c:3638</optimized></optimized>

```

Related

Tickets: #9

Discussion

1 2 > >> (Page 1 of 2)
  • Sam James

    Sam James - 2025-02-14

    Sorry, I can't seem to edit the first post to fix the formatting.

    It usually takes 30s or so for the loop running on a fast machine (AMD Ryzen 3950X) to hit a failure, sometimes longer (1m, maybe up to 3m).

    Another example of a failure is:

    ./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:decrypt testfile.copy2 testfile.copy
    dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB
    dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line!
    dd_rescue: (info): expect to copy 8192.0kiB from testfile.copy2
    dd_rescue: (info): crypt ( 0): Derived salt from testfile.copy2=0000000000800004
    dd_rescue: (info): crypt ( 0): Generate KEY and IV from same passwd/salt
    dd_rescue: (info): read testfile.copy2 (8192.0kiB): EOF
    dd_rescue: (info): Summary for testfile.copy2 -> testfile.copy
    dd_rescue: (info): ipos:      8192.0k, opos:      8192.0k, xferd:      8192.0k
                       errs:      0, errxfer:         0.0k, succxfer:      8192.0k
                 +curr.rate:   608890kB/s, avg.rate:   608890kB/s, avg.load: 86.3%
                 >-----------------------------------------< 100%  TOT:  0:00:00
    dd_rescue: (warning): crypt ( 0): Adjusted offset 8388608 -> 8388612 (524288), towr 0
    dd_rescue: (warning): crypt ( 0): Dec alignment error! (8388612-0)=8388612 4/4
    8196    testfile.copy2
    8196    testfile
    5584    testfile.copy
    testfile testfile.copy differ: byte 8388609, line 26117
    ERROR 97: Sparse compare error
    

    This is with linux-6.13.2 and I've hit it on a btrfs and a tmpfs partition. Let me know what more information I can share or obtain for you. Happy to try things too. Thanks!

     
  • Kurt Garloff

    Kurt Garloff - 2025-02-15

    Thanks for the report!
    It looks like it hits sporadically when using 16k blocks both during de- and encryption. Interestingly not even a 2nd plugin (like de/compression) seems to be involved, which I would have suspected, as that complicates things and I had a bit of work b/f releasing 1.99.20 to get all corner cases right there. I'll let you know what I can find.

     
  • Kurt Garloff

    Kurt Garloff - 2025-02-15

    Hmm, tested on two machines, (both x86-64, AMD Zen 3 and Zen 4).
    Ubuntu 24.04 (gcc-13.3) bare metal and openSUSE-15.6 (with a self-compiled gcc-15) in a VM.
    Running this for ~20mins on either machine did not yield any error ...
    Any hints on what may be special in your setup? Can you reproduce this on several different setups (distributions, CPUs, compilers, ... ?)

     
  • Sam James

    Sam James - 2025-02-16

    Thanks for taking a look Kurt.

    I'd actually started off assuming it was either a GCC bug or at least specific to GCC 15, but then managed to reproduce with GCC 13 too and wrote off the fact I hadn't hit it before as related to how I'm just not guaranteed to hit it every time (i.e. I assumed it's not a new issue).

    Let me try on a few other machines and environments and get back to you. I'll first try on my other Gentoo machines then try some other distros in Docker or a chroot (I'm a Gentoo developer so I don't have others to hand ;)) or VM if must.

    I have at least reproduced it on another machine already (which is also Gentoo, linux-6.13.2, and is Zen 4) so at least I haven't wasted your time with a hardware problem!

     
  • Kurt Garloff

    Kurt Garloff - 2025-02-16

    Trying reproduction in many loops on many devices.
    Nothing.
    Except for one finding, which I suspect not to be a dd_rescue bug:
    On an ARM64 SBC (Orange Pi5B, kernel 5.10.160), I see typically two corrupted bytes (3 bytes apart or so) after running for a minute or so. This only happens when testing on NFS, not when writing to the local filesystem. Other NFS clients (x86-64, kernel 612.x mostly) do not show this behavior. Nor do I see it on the OrangePi5 when using a local filesystem. Weird.

     
  • Sam James

    Sam James - 2025-02-17

    Bizarre. Is there anything I can do to get more information other than figuring out environments it does, and doesn't, happen in? Happy to run with custom patches or build with whatever options.

     
  • Kurt Garloff

    Kurt Garloff - 2025-02-23

    Are there .qcow2 images for Gentoo available for download somewhere?
    Maybe things reproduce in a Gentoo VM ...

     
  • Kurt Garloff

    Kurt Garloff - 2025-03-01

    OK, gentoo .qcow2 for cloud-init downloaded from https://www.gentoo.org/downloads/

     
  • Kurt Garloff

    Kurt Garloff - 2025-03-01

    Compiling in and running on a Gentoo 2.17 VM (kernel 6.6.47, gcc-14.2.1) on x86-64 (Zen3) for several hours did not yield any error.
    local-19353 ~/dd_rescue # ./dd_rescue --version

    dd_rescue Version 1.99.20, kurt@garloff.de, GNU GPL v2/v3
    (DD_RESCUE_1_99_20-2-g1dd2a7a)
    (compiled Mar 1 2025 16:43:33 by gcc (Gentoo 14.2.1_p20241221 p7) 14.2.1 20241221)
    (features: O_DIRECT dl/libfallocate fallocate splice fitrim xattr rdrnd sha vaes avx2)
    dd_rescue is free software. It's protected by the terms of GNU GPL v2 or v3
    (at your option).

    So I am a bit clueless what's going on on your side, Sam.

     
  • Sam James

    Sam James - 2025-03-01

    (Sorry for delay, I've been working on some other bits this week.)

    Don't stress over it more for now and I'll try to reproduce in a VM. I'm sorry about the mystery :(

     
  • Sam James

    Sam James - 2025-03-04

    On one of the runs when it differs:

    $ diffoscope testfile testfile.copy
    --- testfile
    +++ testfile.copy
    @@ -524282,8 +524282,8 @@
     007fff90: 8257 553a b086 0b31 c88c 558f 5400 71bf  .WU:...1..U.T.q.
     007fffa0: 983e 49c9 74f1 8220 5777 b11b 119f 9000  .>I.t.. Ww......
     007fffb0: 1aed a523 8120 ab20 c94a 4e9c e0d7 aab8  ...#. . .JN.....
     007fffc0: c5f1 945d 399d 0fd2 1e28 6106 e09d a777  ...]9....(a....w
     007fffd0: c6bd 6382 b708 4633 c526 90c7 3443 5e7f  ..c...F3.&..4C^.
     007fffe0: 215c 5e10 abe8 dc1a 7be0 ae61 a717 cf3c  !\^.....{..a...<
     007ffff0: d1f8 25fa 5ff9 43d9 dbcb 1b7e f333 c8e2  ..%._.C....~.3..
    -00800000: bbad 4856                                ..HV
    +00800000: 0000 0000                                ....
    

    I managed to reproduce it on another machine. Just need to try reproduce in a VM next.

    Meanwhile, I tried ASAN and UBSAN with make EXTRA_CFLAGS="-Og -ggdb3 -fsanitize=address,undefined" and then make EXTRA_CFLAGS="-Og -ggdb3 -fsanitize=address,undefined" and got a tonne of errors.

    In particular, for the loop above, I get...

    $ time ( while true ; do LD_LIBRARY_PATH=. ./test_sparse.sh "-L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:" "encrypt" "decrypt" 8388612 || break ; done ; )
    [...]
    ./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:decrypt testfile.copy2 testfile.copy
    dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB
    dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line!
    dd_rescue: (info): expect to copy 8192.0kiB from testfile.copy2
    dd_rescue: (info): crypt ( 0): Derived salt from testfile.copy2=0000000000800004
    dd_rescue: (info): crypt ( 0): Generate KEY and IV from same passwd/salt
    =================================================================
    ==2385574==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ba1988004a1 at pc 0x5647ba223646 bp 0x7fff72ba6030 sp 0x7fff72ba6020
    READ of size 1 at 0x7ba1988004a1 thread T0
        #0 0x5647ba223645 in real_writeblock /tmp/dd_rescue-1.99.20/dd_rescue.c:1801
        #1 0x5647ba21e591 in writeblock /tmp/dd_rescue-1.99.20/dd_rescue.c:1943
        #2 0x5647ba225845 in dowrite /tmp/dd_rescue-1.99.20/dd_rescue.c:2098
        #3 0x5647ba226437 in dowrite_sparse /tmp/dd_rescue-1.99.20/dd_rescue.c:2189
        #4 0x5647ba22920b in copyfile_softbs /tmp/dd_rescue-1.99.20/dd_rescue.c:2447
        #5 0x5647ba23987a in main /tmp/dd_rescue-1.99.20/dd_rescue.c:3628
        #6 0x7fa19ae0394a in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
        #7 0x7fa19ae039f9 in __libc_start_main_impl ../csu/libc-start.c:360
        #8 0x5647ba212014 in _start (/tmp/dd_rescue-1.99.20/dd_rescue+0x3014)
    
    Address 0x7ba1988004a1 is located in stack of thread T0 at offset 33 in frame
        #0 0x5647ba21e05d in writeblock /tmp/dd_rescue-1.99.20/dd_rescue.c:1879
    
      This frame has 3 object(s):
        [32, 33) 'retry' (line 1892) <== Memory access at offset 33 overflows this variable
        [48, 52) 'redo' (line 1884)
        [64, 68) 'towrite' (line 1877)
    HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
          (longjmp and C++ exceptions *are* supported)
    SUMMARY: AddressSanitizer: stack-buffer-overflow /tmp/dd_rescue-1.99.20/dd_rescue.c:1801 in real_writeblock
    Shadow bytes around the buggy address:
      0x7ba198800200: f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 00 00 00 00
      0x7ba198800280: f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 00 00 00 00
      0x7ba198800300: f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5
      0x7ba198800380: f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 00 00 00 00
      0x7ba198800400: f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 f5 00 00 00 00
    =>0x7ba198800480: f1 f1 f1 f1[01]f2 04 f2 04 f3 f3 f3 00 00 00 00
      0x7ba198800500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x7ba198800580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x7ba198800600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x7ba198800680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      0x7ba198800700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    Shadow byte legend (one shadow byte represents 8 application bytes):
      Addressable:           00
      Partially addressable: 01 02 03 04 05 06 07
      Heap left redzone:       fa
      Freed heap region:       fd
      Stack left redzone:      f1
      Stack mid redzone:       f2
      Stack right redzone:     f3
      Stack after return:      f5
      Stack use after scope:   f8
      Global redzone:          f9
      Global init order:       f6
      Poisoned by user:        f7
      Container overflow:      fc
      Array cookie:            ac
      Intra object redzone:    bb
      ASan internal:           fe
      Left alloca redzone:     ca
      Right alloca redzone:    cb
    ==2385574==ABORTING
    ERROR 65: Error with sparse
    
    real    0m0.862s
    user    0m0.576s
    sys     0m0.285s
    

    Can you reproduce that if you run it in a loop with a ddrescue built with -fsanitize=address,undefined?

     
  • Kurt Garloff

    Kurt Garloff - 2025-03-05

    I can reproduce the -fsanitizer error.
    Did I really get the precendence rules of C wrong? Kind of embarassing after 30yrs ...
    Change !*retry++ to !(*retry)++ at the two places in real_writeblock() and try again ...

     
  • Kurt Garloff

    Kurt Garloff - 2025-03-05

    I pushed this fix to DD_RESCUE_1_99_BRANCH.
    I'm still looking at a few unaligned warnings that the sanitizer uncovered.
    At first look, these are hard to avoid, as lzo is not designed to guarantee any alignment (larger than one byte) of compressed content.

     
  • Kurt Garloff

    Kurt Garloff - 2025-03-05

    Notes:

    1. I found no further places in the code where I may have made the same mistake.
    2. The issues I saw on OrangePi5B before which I attributed to NFS are fixed by this. I have a theory here: NFS causes short (incomplete) writes with some likelihood, while many local filesystems are unlikely to do so. So that's how this got triggered. (You needed a second retry to cause stack corruption.)

    I plan to release 1.99.21 in a few days, so we have an official release soon with this fixed.
    For the time being, picking up the attached patch would seem helpful.
    Patch is also included already in my home:garloff:storage OBS project.

     
  • Sam James

    Sam James - 2025-03-05

    Nice! I'll pull that in to our packaging now and re-test with it.

     
  • Sam James

    Sam James - 2025-03-05

    Good news: ASAN is now happy. UBSAN still has some alignment issues but I haven't looked into those.

    Bad news: I can still reproduce the assert (or failed file comparison, depending on luck).

    Assert:

    ./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:encrypt testfile testfile.copy2
    dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB
    dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line!
    dd_rescue: (info): expect to copy 8192.0kiB from testfile
    dd_rescue: (info): crypt ( 0): Derived salt from testfile.copy2=0000000000800004
    dd_rescue: (info): crypt ( 0): Generate KEY and IV from same passwd/salt
    dd_rescue: (info): read testfile (8192.0kiB): EOF
    dd_rescue: (info): Summary for testfile -> testfile.copy2
    dd_rescue: (info): ipos:      8192.0k, opos:      8192.0k, xferd:      8192.0k
                       errs:      0, errxfer:         0.0k, succxfer:      8192.0k
                 +curr.rate:  2159769kB/s, avg.rate:  2159769kB/s, avg.load:199.6%
                 >-----------------------------------------< 100%  TOT:  0:00:00
    dd_rescue: (warning): crypt ( 0): Enc alignment error! (8388612-0)=8388612 4/4
    dd_rescue: libddr_crypt.c:1532: crypt_blk_cb: Assertion `left < BLKSZ-state->inbuf' failed.
    ./test_sparse.sh: line 14: 3580176 Aborted                 (core dumped) ./dd_rescue $1 $2$3 ${TESTFILE} ${TESTFILE}.copy2
    ERROR 198: Error with sparse
    
    real    0m22.070s
    user    0m9.072s
    sys     0m12.846s
    

    Backtrace (which is somewhat wrong -- look at the assert arguments wrt the line it then blames; no idea if gdb is confused there or what, but the assert mentioned in assertion= is the one I see often):

    #0  __pthread_kill_implementation (threadid=140563215513408, signo=6, no_tid=0) at pthread_kill.c:44
    #1  __pthread_kill_internal (threadid=140563215513408, signo=6) at pthread_kill.c:89
    #2  __GI___pthread_kill (threadid=140563215513408, signo=signo@entry=6) at pthread_kill.c:100
    #3  0x00007fd76c61fd56 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
    #4  0x00007fd76c601655 in __GI_abort () at abort.c:73
    #5  0x00007fd76c6015be in __assert_fail_base (fmt=<optimized out>, assertion=0x7fd76c86ac69 "left < BLKSZ-state->inbuf", file=<optimized out>, line=<optimized out>,
        function=function@entry=0xc <error: Cannot access memory at address 0xc>) at assert.c:118
    #6  0x00007fd76c6160d0 in __assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=0xc <error: Cannot access memory at address 0xc>)
        at assert.c:127
    #7  0x00007fd76c84e1b3 in crypt_blk_cb (fst=0x559930eef940 <_fstate.2>, bf=<optimized out>, towr=0x7ffc2e8e353c, eof=1, recall=<optimized out>, stat=<optimized out>)
        at libddr_crypt.c:1436
    #8  0x0000559930edc26c in call_plugins_block (op=<optimized out>, bf=<optimized out>, towr=0x7ffc2e8e353c, eof=1, skip=0x7ffc2e8e3544, fst=0x559930eef940 <_fstate.2>) at dd_rescue.c:537
    #9  call_plugins_block (bf=<optimized out>, towr=0x7ffc2e8e353c, eof=1, skip=0x7ffc2e8e3544, op=<optimized out>, fst=0x559930eef940 <_fstate.2>) at dd_rescue.c:520
    #10 0x0000559930eddf1f in writeblock (towrite=towrite@entry=0, shouldwrite=shouldwrite@entry=0x7ffc2e8e35a4, op=op@entry=0x559930eef9e0 <_opts.5>,
        fst=fst@entry=0x559930eef940 <_fstate.2>, prg=prg@entry=0x559930eef900 <_progress.1>, dop=0x559930eef9b0 <_dpopts.4>) at dd_rescue.c:1904
    #11 0x0000559930ede35b in real_cleanup (op=0x559930eef9e0 <_opts.5>, fst=0x559930eef940 <_fstate.2>, prg=0x559930eef900 <_progress.1>, dop=<optimized out>,
        dst=0x559930eef9a0 <_dpstate.3>, closelog=closelog@entry=0 '\000') at dd_rescue.c:1482
    #12 0x0000559930ede62a in cleanup (closelog=closelog@entry=0 '\000') at dd_rescue.c:1552
    #13 0x0000559930eda392 in main (argc=<optimized out>, argv=<optimized out>) at dd_rescue.c:3638
    

    Then the file comparison one:

    ./dd_rescue -a -b 16k -L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:decrypt testfile.copy2 testfile.copy
    dd_rescue: (info): Using softbs=16.0kiB, hardbs=4.0kiB
    dd_rescue: (warning): crypt (-1): Don't specify sensitive data on the command line!
    dd_rescue: (info): expect to copy 8192.0kiB from testfile.copy2
    dd_rescue: (info): crypt ( 0): Derived salt from testfile.copy2=0000000000800004
    dd_rescue: (info): crypt ( 0): Generate KEY and IV from same passwd/salt
    dd_rescue: (info): read testfile.copy2 (8192.0kiB): EOF
    dd_rescue: (info): Summary for testfile.copy2 -> testfile.copy
    dd_rescue: (info): ipos:      8192.0k, opos:      8192.0k, xferd:      8192.0k
                       errs:      0, errxfer:         0.0k, succxfer:      8192.0k
                 +curr.rate:   598459kB/s, avg.rate:   598459kB/s, avg.load:198.5%
                 >-----------------------------------------< 100%  TOT:  0:00:00
    dd_rescue: (warning): crypt ( 0): Adjusted offset 8388608 -> 8388612 (524288), towr 0
    dd_rescue: (warning): crypt ( 0): Dec alignment error! (8388612-0)=8388612 4/4
    8196    testfile.copy2
    8196    testfile
    5584    testfile.copy
    testfile testfile.copy differ: byte 8388609, line 26134
    ERROR 97: Sparse compare error
    
    real    0m9.622s
    user    0m3.683s
    sys     0m5.029s
    

    When I get that comparison failure, the difference is always in the last byte:

    --- testfile
    +++ testfile.copy
    @@ -524282,8 +524282,8 @@
     007fff90: a7cc e70f 1ccc 5b3d 6915 662c 1ad4 bb18  ......[=i.f,....
     007fffa0: 22ee 8f57 39f7 a5bc a3be fae3 8160 3401  "..W9........`4.
     007fffb0: be95 225d 2204 e3cb b55a 5204 6b8f 466d  .."]"....ZR.k.Fm
     007fffc0: d68d 2612 a8de 1d51 190b 12b2 9be9 7f65  ..&....Q.......e
     007fffd0: 6ca6 78d6 42bc 4360 345d 207f 2bdf 93e4  l.x.B.C`4] .+...
     007fffe0: f1a9 2f0c dcf9 be66 88f2 25d0 7d65 50a1  ../....f..%.}eP.
     007ffff0: 5f84 d860 ae5d f593 38eb 9d71 92bd c878  _..`.]..8..q...x
    -00800000: bba0 ec30                                ...0
    +00800000: 0000 0000                                ....
    
     

    Last edit: Sam James 2025-03-05
  • Sam James

    Sam James - 2025-03-05

    I'd asked a friend to try reproduce, both to help get to the bottom of it, and also make sure I'm not wasting your time somehow -- he couldn't reproduce at first, but then pulled a fresh Gentoo stage3 tarball, chrooted in (just extracted to some temporary location), and could pretty quickly in a loop.

    I'm still trying to think of ideas, but could you try the loop on a tmpfs mount?

     

    Last edit: Sam James 2025-03-05
  • Kurt Garloff

    Kurt Garloff - 2025-03-05

    No luck with 3hrs on Orange Pi (aarch64, gcc-11, on NFS).
    x86-64 gcc-14 on a openSUSE VM on NFS is running now since 10mins without issue.
    Will leave it running for a while. Will start tmpfs tests in parallel, though I would suspect tmpfs to less likely fail ...
    On Gentoo stage3 tarball: Is that significantly different from the qcow2 I downloaded a week ago?

     
    • Sam James

      Sam James - 2025-03-05

      The qcow2 should be functionally identical for our purposes here (other than some small amount of changes in stable packages in the last week), as it's a stage3 + a kernel shoved in (more or less).

       

      Last edit: Sam James 2025-03-05
  • Kurt Garloff

    Kurt Garloff - 2025-03-05

    Compiled with the -fsanitize options .... Nothing thus far (after 1hr), neither on tmpfs nor NFS.

     
  • Kurt Garloff

    Kurt Garloff - 2025-03-06

    Stopping after 11hrs.

     
  • Kurt Garloff

    Kurt Garloff - 2025-03-13

    Running in a Gentoo VM again.
    The binary that you use to reproduce this:

    • How do you compile it? Passing any compiler flags, e.g. from the Gentoo build system?
    • How long does it take to reproduce the issue?
     
    • Sam James

      Sam James - 2025-03-14
      • Just regular configure and make (building from tarball, no package manager; I originally reproduced it in our PM, but I wanted to rule all of that out) with nothing exported in the environment or passed to configure or make
      • It takes a few minutes at most
       
      • Kurt Garloff

        Kurt Garloff - 2025-03-14

        OK, care to give complete instructions for the reproduction?

        git checkout DD_RESCUE_1_99_BRANCH
        ./autogen.sh
        make -j all
        make check
        while true; do ./test_sparse.sh "-L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:" "encrypt" "decrypt" 8388612 || break; done
        

        is what I do without any success.
        Somehow I need to get this reproduced!

         
        • Sam James

          Sam James - 2025-03-14

          Ah, sorry, I thought I had but I can't find it indeed.

          This fails for me:

          wget https://www.garloff.de/kurt/linux/ddrescue/dd_rescue-1.99.20.tar.bz2
          tar xvf dd_rescue-1.99.20.tar.bz2
          cd dd_rescue-1.99.20
          make -j all
          make check
          while true; do ./test_sparse.sh "-L ./libddr_crypt.so=AES192-CTR:weakrnd:pbkdf2:pass=ABC:" "encrypt" "decrypt" 8388612 || break; done
          

          ... but I haven't reproduced it more than once so far this morning. Bad (good) luck, I guess.

          A friend who can reproduce it is also going to setup a clean VM for you to SSH into.

           

          Last edit: Sam James 2025-03-14
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB