Hi, SnapRAID is segfaulting for me on Alpine Linux (which uses the musl libc) starting with 12.0. I didn't have this behavior with 11.6. I recompiled SnapRAID myself with debug symbols and see the same behavior as the system package, and it always segfaults on the same path. I've attached the log output, along with a back-trace and the contents of the directory in question.
Which Alpine version is it ? Does SnapRAID pass the "make check" test after building ?
Try also building it with starting with "./configure --enable-debug". It should give even more debug information inside gdb.
Anyway, from your gdb log, it looks like that the issues is inside the readdir() function of the musl library. Not sure yet, but it's possible that such function is not thread safe, and crashes due to the use of multithreading added in 12.0. POSIX doesn't require such function to be thread safe, even if most other libraries, like glibc, have such property.
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Does SnapRAID pass the "make check" test after building ?
Yes it does.
Try also building it with starting with "./configure --enable-debug". It should give even more debug information inside gdb.
The version I captured the logs from was built with --enable-debug. Also, I just realized that gdb's log capture does not capture the gdb commands as well, so it makes the log a bit more difficult to parse. Below is a fixed log.
(gdb)runStartingprogram:/usr/bin/snapraiddiff--logsnapraid-segfault.log[NewLWP28782][NewLWP28783][NewLWP28784]Thread4"snapraid"receivedsignalSIGSEGV,Segmentationfault.[SwitchingtoLWP28784]0x000055555557e247inscan_dir(scan=scan@entry=0x7fffd1617860,level=level@entry=10,is_diff=is_diff@entry=1,dir=dir@entry=0x7fffc3a2dfa0"/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/",sub=sub@entry=0x7fffc3a2ffa0"home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/")atcmdline/scan.c:12471247cmdline/scan.c:Nosuchfileordirectory.(gdb)infothreadsIdTargetIdFrame1LWP28778"snapraid"0x00007ffff7fbc413in??()from/lib/ld-musl-x86_64.so.12LWP28782"snapraid"0x00007ffff7f834c5inreaddir64()from/lib/ld-musl-x86_64.so.13LWP28783"snapraid"0x00007ffff7fadf68infstatat64()from/lib/ld-musl-x86_64.so.1*4LWP28784"snapraid"0x000055555557e247inscan_dir(scan=scan@entry=0x7fffd1617860,level=level@entry=10,is_diff=is_diff@entry=1,dir=dir@entry=0x7fffc3a2dfa0"/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/",sub=sub@entry=0x7fffc3a2ffa0"home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/")atcmdline/scan.c:1247(gdb)bt#0 0x000055555557e247 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=10, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a2dfa0 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/", sub=sub@entry=0x7fffc3a2ffa0"home/nate/.kde4/share/apps/nepomuk/repository/main/data/virtuosobackend/")atcmdline/scan.c:1247#1 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=9, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a310d0 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/data/", sub=sub@entry=0x7fffc3a330d0"home/nate/.kde4/share/apps/nepomuk/repository/main/data/")atcmdline/scan.c:1509#2 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=8, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a34200 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/main/", sub=sub@entry=0x7fffc3a36200"home/nate/.kde4/share/apps/nepomuk/repository/main/")atcmdline/scan.c:1509#3 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=7, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a37330 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/repository/", sub=sub@entry=0x7fffc3a39330"home/nate/.kde4/share/apps/nepomuk/repository/")atcmdline/scan.c:1509#4 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=6, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a3a460 "/media/disk2/home/nate/.kde4/share/apps/nepomuk/", sub=sub@entry=0x7fffc3a3c460"home/nate/.kde4/share/apps/nepomuk/")atcmdline/scan.c:1509#5 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=5, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a3d590 "/media/disk2/home/nate/.kde4/share/apps/", sub=sub@entry=0x7fffc3a3f590"home/nate/.kde4/share/apps/")atcmdline/scan.c:1509#6 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=4, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a406c0 "/media/disk2/home/nate/.kde4/share/", sub=sub@entry=0x7fffc3a426c0 "home/nate/.kde4/share/")atcmdline/scan.c:1509#7 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=3, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a437f0 "/media/disk2/home/nate/.kde4/", sub=sub@entry=0x7fffc3a457f0 "home/nate/.kde4/")atcmdline/scan.c:1509#8 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=2, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a46920 "/media/disk2/home/nate/", sub=sub@entry=0x7fffc3a48920 "home/nate/") at cmdline/scan.c:1509#9 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=1, is_diff=is_diff@entry=1, dir=dir@entry=0x7fffc3a49a50 "/media/disk2/home/", sub=sub@entry=0x7fffc3a4ba50 "home/") at cmdline/scan.c:1509#10 0x000055555557ea06 in scan_dir (scan=scan@entry=0x7fffd1617860, level=level@entry=0, is_diff=1, dir=dir@entry=0x7ffff7f3b0b0 "/media/disk2/", sub=sub@entry=0x5555555c5c89 "") at cmdline/scan.c:1509#11 0x000055555557ebf0 in scan_disk (arg=0x7fffd1617860) at cmdline/scan.c:1590#12 0x00007ffff7fba221 in ?? () from /lib/ld-musl-x86_64.so.1#13 0x0000000000000000 in ?? ()
Hopefully that is a little more clear. It looks like the segfault is happening at cmdline/scan.c:1247. Meanwhile, Frederik's below is segfaulting in the same file at line 692.
From these backtraces, it looks like this is a recursive scan of the directories? Based on the fact that they both segfault at the beginning of a function at depth 10 (level=level@entry=10), I wonder if this a stack overflow? Alpine does ship with a much smaller default thread stack size than most other platforms so it may be overflowing the stack while descending directories.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I compiled snapraid with the "./configure --enable-debug" option but my gdb output is much shorter. Most likely I am missing the necessary arguments, using it the first time.
If I move the affected .jpg file another file is affected. But it is always thread 3 for me.
I recompiled with LDFLAGS="-Wl,-z,stack-size=1024768" as suggested as one of the possible fixes in the article I linked to in my other reply. This fixed the segfault and I can now diff/sync with no more crashes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
for me the beta is also working correctly, no more crashing.
The build time on github actions was even a little quicker including "make check" (29 vs 35 mins), but ofcourse I can not see the used ressources there.
Best regards
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, SnapRAID is segfaulting for me on Alpine Linux (which uses the musl libc) starting with 12.0. I didn't have this behavior with 11.6. I recompiled SnapRAID myself with debug symbols and see the same behavior as the system package, and it always segfaults on the same path. I've attached the log output, along with a back-trace and the contents of the directory in question.
Hi saintdev,
Which Alpine version is it ? Does SnapRAID pass the "make check" test after building ?
Try also building it with starting with "./configure --enable-debug". It should give even more debug information inside gdb.
Anyway, from your gdb log, it looks like that the issues is inside the readdir() function of the musl library. Not sure yet, but it's possible that such function is not thread safe, and crashes due to the use of multithreading added in 12.0. POSIX doesn't require such function to be thread safe, even if most other libraries, like glibc, have such property.
Ciao,
Andrea
3.15
Yes it does.
The version I captured the logs from was built with
--enable-debug
. Also, I just realized that gdb's log capture does not capture the gdb commands as well, so it makes the log a bit more difficult to parse. Below is a fixed log.Hopefully that is a little more clear. It looks like the segfault is happening at cmdline/scan.c:1247. Meanwhile, Frederik's below is segfaulting in the same file at line 692.
From these backtraces, it looks like this is a recursive scan of the directories? Based on the fact that they both segfault at the beginning of a function at depth 10 (
level=level@entry=10
), I wonder if this a stack overflow? Alpine does ship with a much smaller default thread stack size than most other platforms so it may be overflowing the stack while descending directories.Hello together,
I have the same issue. 12.0 throws a segmantation fault, 11.5 works fine on the same host.
I will try to compile it manually with debug enabled.
I compiled snapraid with the "./configure --enable-debug" option but my gdb output is much shorter. Most likely I am missing the necessary arguments, using it the first time.
If I move the affected .jpg file another file is affected. But it is always thread 3 for me.
Hi Frederik,
Ensure to cleanup everything before rebuilding. Like with:
make distclean
./configure --enable-debug
make
gdb --args ./snapraid diff -v
run
and when it crashes in gdb, type "bt" to get the full stack backtrace.
Note also the use of "./snapraid", instead of "snapraid" to ensure to run the one just built.
Ciao,
Andrea
Hi Andrea,
thank you ery much for these tips! I have now done the following on a fresh alpine:latest docker iamge:
I get nearly the identical backtrace as saintdev:
It looks like this is a stack overflow.
I recompiled with
LDFLAGS="-Wl,-z,stack-size=1024768"
as suggested as one of the possible fixes in the article I linked to in my other reply. This fixed the segfault and I can now diff/sync with no more crashes.Hi saintdev,
Yes. It looks like a stack overflow issue. MUSL gives only 128 kB to threads, compared to the typical 1 MB.
Please try the beta version at http://beta.snapraid.it/
It reduces a lot the stack usage, and it should fit also in MUSL.
Ciao,
Andrea
Hi Andrea,
That seems to have fixed it. Diff/sync with the 12.1 beta works without crashing!
Thanks for all your work!
As a note, I noticed that
make check
took significantly longer for the beta than it did for 12.0. Not sure if that is expected behavior or not.Last edit: saintdev 2022-01-09
Hi together,
for me the beta is also working correctly, no more crashing.
The build time on github actions was even a little quicker including "make check" (29 vs 35 mins), but ofcourse I can not see the used ressources there.
Best regards