Menu

SnapRAID 12.0 segfaults when running (diff/sync..) on Alpine Linux/musl libc

Help
saintdev
2022-01-06
2022-01-10
  • saintdev

    saintdev - 2022-01-06

    Hi, SnapRAID is segfaulting for me on Alpine Linux (which uses the musl libc) starting with 12.0. I didn't have this behavior with 11.6. I recompiled SnapRAID myself with debug symbols and see the same behavior as the system package, and it always segfaults on the same path. I've attached the log output, along with a back-trace and the contents of the directory in question.

     
  • Andrea Mazzoleni

    Hi saintdev,

    Which Alpine version is it ? Does SnapRAID pass the "make check" test after building ?

    Try also building it with starting with "./configure --enable-debug". It should give even more debug information inside gdb.

    Anyway, from your gdb log, it looks like that the issues is inside the readdir() function of the musl library. Not sure yet, but it's possible that such function is not thread safe, and crashes due to the use of multithreading added in 12.0. POSIX doesn't require such function to be thread safe, even if most other libraries, like glibc, have such property.

    Ciao,
    Andrea

     
    • saintdev

      saintdev - 2022-01-08

      Which Alpine version is it ?

      3.15

      Does SnapRAID pass the "make check" test after building ?

      Yes it does.

      Try also building it with starting with "./configure --enable-debug". It should give even more debug information inside gdb.

      The version I captured the logs from was built with --enable-debug. Also, I just realized that gdb's log capture does not capture the gdb commands as well, so it makes the log a bit more difficult to parse. Below is a fixed log.

      (gdb) run
      Starting program: /usr/bin/snapraid diff --log snapraid-segfault.log
      [New LWP 28782]
      [New LWP 28783]
      [New LWP 28784]
      
      Thread 4 "snapraid" received signal SIGSEGV, Segmentation fault.
      [Switching to LWP 28784]
      0x000055555557e247 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=10, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a2dfa0 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/", sub=sub@entry=0x7fffc3a2ffa0 "home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/") at cmdline/scan.c:1247
      1247    cmdline/scan.c: No such file or directory.
      (gdb) info threads
        Id   Target Id            Frame 
        1    LWP 28778 "snapraid" 0x00007ffff7fbc413 in ?? () from /lib/ld-musl-x86_64.so.1
        2    LWP 28782 "snapraid" 0x00007ffff7f834c5 in readdir64 () from /lib/ld-musl-x86_64.so.1
        3    LWP 28783 "snapraid" 0x00007ffff7fadf68 in fstatat64 () from /lib/ld-musl-x86_64.so.1
      * 4    LWP 28784 "snapraid" 0x000055555557e247 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=10, is_diff=is_diff@entry=1, 
          dir=dir@entry=0x7fffc3a2dfa0 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/", sub=sub@entry=0x7fffc3a2ffa0 "home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/")
          at cmdline/scan.c:1247
      (gdb) bt
      #0  0x000055555557e247 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=10, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a2dfa0 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/", 
          sub=sub@entry=0x7fffc3a2ffa0 "home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/") at cmdline/scan.c:1247
      #1  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=9, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a310d0 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/data/", 
          sub=sub@entry=0x7fffc3a330d0 "home/nate/.kde4/share/apps/nepomuk/repository/main/data/") at cmdline/scan.c:1509
      #2  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=8, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a34200 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/", 
          sub=sub@entry=0x7fffc3a36200 "home/nate/.kde4/share/apps/nepomuk/repository/main/") at cmdline/scan.c:1509
      #3  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=7, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a37330 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/", 
          sub=sub@entry=0x7fffc3a39330 "home/nate/.kde4/share/apps/nepomuk/repository/") at cmdline/scan.c:1509
      #4  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=6, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a3a460 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/", 
          sub=sub@entry=0x7fffc3a3c460 "home/nate/.kde4/share/apps/nepomuk/") at cmdline/scan.c:1509
      #5  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=5, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a3d590 "/media/disk2/home/nate/.kde4/share/apps/", 
          sub=sub@entry=0x7fffc3a3f590 "home/nate/.kde4/share/apps/") at cmdline/scan.c:1509
      #6  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=4, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a406c0 "/media/disk2/home/nate/.kde4/share/", sub=sub@entry=0x7fffc3a426c0 "home/nate/.kde4/share/")
          at cmdline/scan.c:1509
      #7  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=3, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a437f0 "/media/disk2/home/nate/.kde4/", sub=sub@entry=0x7fffc3a457f0 "home/nate/.kde4/")
          at cmdline/scan.c:1509
      #8  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=2, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a46920 "/media/disk2/home/nate/", sub=sub@entry=0x7fffc3a48920 "home/nate/") at cmdline/scan.c:1509
      #9  0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=1, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a49a50 "/media/disk2/home/", sub=sub@entry=0x7fffc3a4ba50 "home/") at cmdline/scan.c:1509
      #10 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=0, is_diff=1, dir=dir@entry=0x7ffff7f3b0b0 "/media/disk2/", sub=sub@entry=0x5555555c5c89 "") at cmdline/scan.c:1509
      #11 0x000055555557ebf0 in scan_disk (arg=0x7fffd1617860) at cmdline/scan.c:1590
      #12 0x00007ffff7fba221 in ?? () from /lib/ld-musl-x86_64.so.1
      #13 0x0000000000000000 in ?? ()
      

      Hopefully that is a little more clear. It looks like the segfault is happening at cmdline/scan.c:1247. Meanwhile, Frederik's below is segfaulting in the same file at line 692.

      From these backtraces, it looks like this is a recursive scan of the directories? Based on the fact that they both segfault at the beginning of a function at depth 10 (level=level@entry=10), I wonder if this a stack overflow? Alpine does ship with a much smaller default thread stack size than most other platforms so it may be overflowing the stack while descending directories.

       
  • Frederik

    Frederik - 2022-01-07

    Hello together,

    I have the same issue. 12.0 throws a segmantation fault, 11.5 works fine on the same host.
    I will try to compile it manually with debug enabled.

    / # snapraid diff -v
    Loading state from /config/snapraid.content...
      139007 files
           0 hardlinks
           0 symlinks
         464 empty dirs
    Comparing...
    Excluding content '/mnt/HDD3000/snapraid.content'
    Excluding content '/mnt/HDD3000/snapraid.content.lock'
    Excluding content '/mnt/HDD1500/snapraid.content'
    Segmentation fault (core dumped)
    
    / # snapraid sync -v
    Self test...
    Loading state from /config/snapraid.content...
      139007 files
           0 hardlinks
           0 symlinks
         464 empty dirs
    Scanning...
    Excluding content '/mnt/HDD3000/snapraid.content'
    Excluding content '/mnt/HDD3000/snapraid.content.lock'
    Excluding content '/mnt/HDD1500/snapraid.content'
    Segmentation fault (core dumped)
    
    / # snapraid --version
    snapraid v12.0 by Andrea Mazzoleni, http://www.snapraid.it
    
    / # grep '^VERSION' /etc/os-release
    VERSION_ID=3.13.0
    
     
  • Frederik

    Frederik - 2022-01-07

    I compiled snapraid with the "./configure --enable-debug" option but my gdb output is much shorter. Most likely I am missing the necessary arguments, using it the first time.

    If I move the affected .jpg file another file is affected. But it is always thread 3 for me.

    /usr/local/bin # gdb --args snapraid diff -v
    GNU gdb (GDB) 10.1
    Copyright (C) 2020 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Type "show copying" and "show warranty" for details.
    This GDB was configured as "x86_64-alpine-linux-musl".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <https://www.gnu.org/software/gdb/bugs/>.
    Find the GDB manual and other documentation resources online at:
        <http://www.gnu.org/software/gdb/documentation/>.
    
    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from snapraid...
    (gdb) set verbose on
    (gdb) run
    Starting program: /usr/local/bin/snapraid diff -v
    Using PIE (Position Independent Executable) displacement 0x555555554000 for "/usr/local/bin/snapraid".
    Reading symbols from /lib/ld-musl-x86_64.so.1...
    (No debugging symbols found in /lib/ld-musl-x86_64.so.1)
    Reading symbols from system-supplied DSO at 0x7ffff7f65000...
    (No debugging symbols found in system-supplied DSO at 0x7ffff7f65000)
    Loading state from /config/snapraid.content...
      139007 files
           0 hardlinks
           0 symlinks
         464 empty dirs
    Comparing...
    [New LWP 8505]
    [New LWP 8506]
    Excluding content '/mnt/HDD3000/snapraid.content'
    Excluding content '/mnt/HDD3000/snapraid.content.lock'
    Excluding content '/mnt/HDD1500/snapraid.content'
    Reading in symbols for cmdline/scan.c...
    
    Thread 3 "snapraid" received signal SIGSEGV, Segmentation fault.
    [Switching to LWP 8506]
    0x000055555557bb0b in scan_file (scan=scan@entry=0x7fffe9032c20, is_diff=is_diff@entry=1, 
        sub=sub@entry=0x7fffe6966fa0 "path/to/some/file.jpg", st=st@entry=0x7fffe6965f10, 
        physical=physical@entry=0) at cmdline/scan.c:692
    692     {
    (gdb)
    
     
  • Andrea Mazzoleni

    Hi Frederik,

    Ensure to cleanup everything before rebuilding. Like with:

    make distclean
    ./configure --enable-debug
    make
    gdb --args ./snapraid diff -v
    run

    and when it crashes in gdb, type "bt" to get the full stack backtrace.

    Note also the use of "./snapraid", instead of "snapraid" to ensure to run the one just built.

    Ciao,
    Andrea

     
  • Frederik

    Frederik - 2022-01-07

    Hi Andrea,

    thank you ery much for these tips! I have now done the following on a fresh alpine:latest docker iamge:

    export SNAPRAID_VERSION=12.0
    cd
    wget https://github.com/amadvance/snapraid/releases/download/v$SNAPRAID_VERSION/snapraid-$SNAPRAID_VERSION.tar.gz
    tar xzvf snapraid-$SNAPRAID_VERSION.tar.gz
    cd snapraid-$SNAPRAID_VERSION
    ./configure --prefix=/usr --sysconfdir=/etc --mandir=/usr/share/man --local statedir=/var --enable-debug
    make
    make install
    ./snapraid --version (returned 12.0)
    gdb --args ./snapraid diff -v
    (gdb) run
    (gdb) bt
    

    I get nearly the identical backtrace as saintdev:

    (gdb) run
    Starting program: /root/snapraid-12.0/snapraid diff -v
    Loading state from /config/snapraid.content...
      139009 files
           0 hardlinks
           0 symlinks
         466 empty dirs
    Comparing...
    [New LWP 2485]
    Excluding content '/mnt/HDD3000/snapraid.content'
    Excluding content '/mnt/HDD3000/snapraid.content.lock'
    [New LWP 2486]
    Excluding content '/mnt/HDD1500/snapraid.content'
    
    Thread 3 "snapraid" received signal SIGSEGV, Segmentation fault.
    [Switching to LWP 2486]
    0x000055555557bb0b in scan_file (scan=scan@entry=0x7fffe8feac20, is_diff=is_diff@entry=1, sub=sub@entry=0x7fffe691ffa0 "A/B/C/D/E/F/G/H/I/photo.jpg", st=st@entry=0x7fffe691ef10, physical=physical@entry=0) at cmdline/scan.c:692
    692     {
    (gdb) bt
    #0  0x000055555557bb0b in scan_file (scan=scan@entry=0x7fffe8feac20, 
        is_diff=is_diff@entry=1, 
        sub=sub@entry=0x7fffe691ffa0 "A/B/C/D/E/F/G/H/I/photo.jpg", st=st@entry=0x7fffe691ef10, 
        physical=physical@entry=0) 
        at cmdline/scan.c:692
    #1  0x000055555557cad7 in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=9, 
        is_diff=is_diff@entry=1, 
        dir=dir@entry=0x7fffe69220d0 "/mnt/HDD1500/A/B/C/D/E/F/G/H/I/", 
        sub=sub@entry=0x7fffe69240d0 "A/B/C/D/E/F/G/H/I/") 
        at cmdline/scan.c:1454
    #2  0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=8, 
        is_diff=is_diff@entry=1, 
        dir=dir@entry=0x7fffe6925200 "/mnt/HDD1500/A/B/C/D/E/F/G/H/", 
        sub=sub@entry=0x7fffe6927200 "A/B/C/D/E/F/G/H/") 
        at cmdline/scan.c:1509
    #3  0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=7, 
        is_diff=is_diff@entry=1, 
        dir=dir@entry=0x7fffe6928330 "/mnt/HDD1500/A/B/C/D/E/F/G/", 
        sub=sub@entry=0x7fffe692a330 "A/B/C/D/E/F/G/") 
        at cmdline/scan.c:1509
    #4  0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=6, 
        is_diff=is_diff@entry=1, 
        dir=dir@entry=0x7fffe692b460 "/mnt/HDD1500/A/B/C/D/E/F/", 
        sub=sub@entry=0x7fffe692d460 "A/B/C/D/E/F/")
        at cmdline/scan.c:1509
    #5  0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=5, 
        is_diff=is_diff@entry=1,
        dir=dir@entry=0x7fffe692e590 "/mnt/HDD1500/A/B/C/D/E/", 
        sub=sub@entry=0x7fffe6930590 "A/B/C/D/E/") 
        at cmdline/scan.c:1509
    #6  0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=4, 
        is_diff=is_diff@entry=1, 
        dir=dir@entry=0x7fffe69316c0 "/mnt/HDD1500/A/B/C/D/", 
        sub=sub@entry=0x7fffe69336c0 "A/B/C/D/") 
        at cmdline/scan.c:1509
    #7  0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=3, 
        is_diff=is_diff@entry=1, 
        dir=dir@entry=0x7fffe69347f0 "/mnt/HDD1500/A/B/C/", 
        sub=sub@entry=0x7fffe69367f0 "A/B/C/") 
        at cmdline/scan.c:1509
    #8  0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=2, 
        is_diff=is_diff@entry=1, 
        dir=dir@entry=0x7fffe6937920 "/mnt/HDD1500/A/B/", 
        sub=sub@entry=0x7fffe6939920 "A/B/") 
        at cmdline/scan.c:1509
    #9  0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=1, 
        is_diff=is_diff@entry=1, 
        dir=dir@entry=0x7fffe693aa50 "/mnt/HDD1500/A/", 
        sub=sub@entry=0x7fffe693ca50 "A/") 
        at cmdline/scan.c:1509
    #10 0x000055555557ce7f in scan_dir (scan=scan@entry=0x7fffe8feac20, level=level@entry=0, 
        is_diff=1, 
        dir=dir@entry=0x7ffff7f41080 "/mnt/HDD1500/", 
        sub=sub@entry=0x5555555c3c89 "") 
        at cmdline/scan.c:1509
    #11 0x000055555557d069 in scan_disk (arg=0x7fffe8feac20) 
        at cmdline/scan.c:1590
    #12 0x00007ffff7fba221 in ?? () from /lib/ld-musl-x86_64.so.1
    #13 0x0000000000000000 in ?? ()
    (gdb) Quit
    
     
  • saintdev

    saintdev - 2022-01-09

    It looks like this is a stack overflow.

    I recompiled with LDFLAGS="-Wl,-z,stack-size=1024768" as suggested as one of the possible fixes in the article I linked to in my other reply. This fixed the segfault and I can now diff/sync with no more crashes.

     
  • Andrea Mazzoleni

    Hi saintdev,

    Yes. It looks like a stack overflow issue. MUSL gives only 128 kB to threads, compared to the typical 1 MB.

    Please try the beta version at http://beta.snapraid.it/

    It reduces a lot the stack usage, and it should fit also in MUSL.

    Ciao,
    Andrea

     
  • saintdev

    saintdev - 2022-01-09

    Hi Andrea,

    That seems to have fixed it. Diff/sync with the 12.1 beta works without crashing!

    Thanks for all your work!

    As a note, I noticed that make check took significantly longer for the beta than it did for 12.0. Not sure if that is expected behavior or not.

     

    Last edit: saintdev 2022-01-09
  • Frederik

    Frederik - 2022-01-10

    Hi together,

    for me the beta is also working correctly, no more crashing.
    The build time on github actions was even a little quicker including "make check" (29 vs 35 mins), but ofcourse I can not see the used ressources there.

    Best regards

     

Log in to post a comment.