Menu

#57 mksquashfs segfault

v1.0 (example)
open
nobody
None
5
2014-11-20
2014-11-06
No

I started using mksquashfs as an archiving tool. Quite often it segfaults during batch jobs, but whenever I try to reproduce the problem it works fine. I upgraded to version 4.3, but the problem remains. Here's the backtrace from a core file

Core was generated by `mksquashfs . ../1.sqsh -processors 1 -no-progress -no-xattrs -noD'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fffecf3938a in _int_free () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fffecf3938a in _int_free () from /lib64/libc.so.6
#1 0x00007fffecf3cb5c in free () from /lib64/libc.so.6
#2 0x0000000000417fea in cache_block_put ()
#3 0x00000000004079f2 in deflator ()
#4 0x00007fffed6d0806 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fffecf9ce8d in clone () from /lib64/libc.so.6
#6 0x0000000000000000 in ?? ()

Another is only slightly different

Core was generated by `mksquashfs . ../1.sqsh -processors 1 -no-progress -no-xat
trs -noD'.
Program terminated with signal 6, Aborted.
#0 0x00007fffecef3885 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fffecef3885 in raise () from /lib64/libc.so.6
#1 0x00007fffecef4e61 in abort () from /lib64/libc.so.6
#2 0x00007fffecf3230f in __libc_message () from /lib64/libc.so.6
#3 0x00007fffecf37b18 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007fffecf3cb5c in free () from /lib64/libc.so.6
#5 0x0000000000417fea in cache_block_put ()
#6 0x00000000004079f2 in deflator ()
#7 0x00007fffed6d0806 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fffecf9ce8d in clone () from /lib64/libc.so.6
#9 0x0000000000000000 in ?? ()

I'm running on a Linux cluster with Lustre filesystem. The kernel version is

$ uname -r
3.0.101-0.31.1.20140612-nasa

Typical archive sizes are 150 GB; the core dumps are usually 2-5 GB; system RAM is 24 GB. Let me know if there's anything else I can provide to narrow down the problem, though I don't have root on the system.

Discussion

  • Phillip Lougher

    Phillip Lougher - 2014-11-07

    A couple of initial thoughts

    1. You state you upgraded to 4.3, but, the problem still remains. This suggests the problem also occurs in version 4.2.

    Now that is really odd - version 4.2 has been publically available since early 2011, and version 4.3 has been publically available since May 2014. Yet, this is the first report of a major SegV fault I've received in all of that 4 years!

    This suggests there's something about your use-case which is triggering a fault previously untriggered in 4 years.

    You are doing something fairly unusual using the -noD (no data compression option), but, it is used sufficiently often to know there must be something else in your use-case to trigger this.

    1. The backtraces are not much use... Both of them point to the fact when the deflator thread freed a data buffer, the free code hit memory corruption. But there is no way of knowing when, why or where that happened - it could be anywhere in the code.

    These faults - memory corruption, double frees etc. - are notoriously difficult to track down, given that it will often depend on a specific input, libraries, parallelisation etc. to trigger.

    At the very least I need to discover exactly why this is triggering for you now, but hasn't triggered before in over 4 years.

    So, to go any further, I need a reproducer - a set of input files which are known to trigger this fault, and knowledge of how to reproduce your environment, distro version, libraries etc. Or access to a system which has the set of reproducer data and environment.

    I know this could be quite difficult to achieve. But, without that there is no way I can debug this, it is literally worse than the proverbial finding a needle in a haystack.

     
  • Brian Hawkins

    Brian Hawkins - 2014-11-07

    Thanks for the quick and thoughtful response. You're correct that I had the same problem with version 4.2. I'd guess the failure rate is under 10%. I use the -noD option because my data has very high entropy, making compression a poor use of time.

    I wouldn't be surprised if this wound up being a problem particular to the system, as it's a fairly unique one (Pleiades at NASA Ames). As you pointed out, it would be difficult to provide you access. Do you think a system log or maybe valgrind output would be helpful?

     
  • Phillip Lougher

    Phillip Lougher - 2014-11-19

    as it's a fairly unique one (Pleiades at NASA Ames).

    Do you mean this?

    Pleiades Supercomputer http://www.nas.nasa.gov/hecc/resources/pleiades.html

    Total cores: 198,432 (32,768 additional GPU cores)
    Total memory: 616 TB

    Yep, that certainly counts as fairly unique, though that is rather an understatement!

    The first thoughts are this evidently runs Linux (the page says SUSE Linux), but how does this system present itself to a program running in user-space. Does it report 198,432 cores available with a total of 616 TB of RAM available to all cores (shared memory)? and if so is that really useable? Or is it a segmented memory system with most of the memory private to each node (requiring explicit synchronisation) with only a "smaller" amount of memory shared across all cores with memory coherence and does it get reported as such? Mksquashfs assumes all memory is accessible across all cores and is coherent using memory barriers (as part of pthread operations).

    Mksquashfs for maximum performance adapts itself to the self-reported characteristics of the host it is running on, if it reports 198,432 cores with 616 TB of memory, it will try and use 198,432 cores and 25% of that memory (Mkquashfs version 4.3). I can imagine library code has never been tested to that limit, certainly Mksquashfs code hasn't. The maximum size machine I've run it on is 24 cores, and it's been tested/developed on a 4 core machine with 8 Gbytes (*). I can imagine a whole host of problems with statically sized tables which size to the amount of cores/memory which given the above exceed addressing limits and/or code assumptions.

    You could try artifcially limiting the amount of memory and cores Mksquashfs thinks it can use with the -processors and -mem options, i.e.

    %mksquashfs some-dir some-file.sqsh -processors 24 -mem 24G

    which will bring it's usage to within the ballpark of what it is known to work with, and work upwards until it goes bang.

    You could try system logs etc. but I suspect the problem is the sheer size of the system. System logs might be interesting, but, only indirectly insofar as it might show how Mksquashfs is trying to scale to the reported system. But such information is probably better obtained directly by just knowing how the system appears to user space. That, and perhaps results from artificially limiting Mksquashfs' usage may point to a way forward.

    If this information is sensitive it may be better to take this offline. If you give me your email address this can be taken offline, or alternatively you can email me at phillip@lougher.demon.co.uk

    (*) part and parcel of the issue that Squashfs is still a hobbyist filesystem written in my spare time for no money, despite the widespread use of the system. Most people assume I get paid for work on Squashfs, but, that sadly is not the case.

     
  • Brian Hawkins

    Brian Hawkins - 2014-11-20

    Yes, that's the one. From a user point of view it's just a bunch of Linux boxes on an InfiniBand network. I've typically been running mksquashfs on nodes with 12 cores and 24 GB RAM, probably not too far removed from normal experience.

    I'm most suspicious of the Lustre filesystem, which is probably pretty rare outside HPC environments. My hunch is that a disk sometimes fills up behind the scenes and interacts poorly with mksquashfs somehow. However, I'm not able to test this easily (retried a failed job 100x in a loop with no failures). A sysadmin assures me this case just results in an ENOSPC error, and a quick glance at the mksquashfs source suggests that errno is checked pretty diligently, so perhaps I'm barking up the wrong tree.

    Since the backtraces aren't very helpful and I can't provide a failing test case, feel free to close the bug. Thanks again for your assistance.

     
    • Phillip Lougher

      Phillip Lougher - 2014-11-20

      I've typically been running mksquashfs on nodes with 12 cores and 24 GB RAM, probably not too far removed from normal experience.

      Hmm, if that's the case then the scale/size of the system maybe a total red herring, and it has no bearing on the bug. If that's the case there's a chance it could be reproducible elsewhere.

      Unfortunately the concentration on the size of the system is because that's the only thing I've got to go on, the one thing that is known to be different and using Occams Razor that is going to be the likeliest cause. However, I'm also aware we may be hitting the "streetlight effect" (http://en.wikipedia.org/wiki/Streetlight_effect) here.

      I'm most suspicious of the Lustre filesystem, which is probably pretty rare outside HPC environments.

      Yes, I've never used Lustre but I'm aware of it, or at least the techniques behind it - I did post-doc research into scalable distributed high performance video storage servers in the 90s (1993 - 1996). Some ancient pages from 1995 I resurrected some time ago (http://www.lougher.demon.co.uk/research/scams.html).

      I have never seen any bad reports about Lustre and though it is conceivable it is silently dropping data through out of space or simply backing up on I/O, such things in my experience tend to show up in the output filesystem, and not in memory corruption at the front-end reader/deflator compression threads as we're seeing here.

      So we're really back to needing a failing test case. One thing has occurred to me here which maybe a way out of the impasse. You're using no data compression and so the output blocks are the same size as the input blocks. As such Mksquashfs doesn't care what's in the data blocks because it's behaviour will be the same irrespective (with one caveat regarding sparse block handling and duplicates). If you can get a test case which fails when using the -no-sparse option (so sparse block detection isn't being performed) and the -no-duplicates option (so duplicate file detection isn't being performed), then it should be possible for me to reproduce the test case knowing only the filenames/directory hierarchy, and the sizes of the files. In otherwords I'll be able to reproduce the behaviour in Mksquashfs without knowing the data content (the files will be of the same size but filled with an arbitrary non-zero value).

      Unfortunately I still need to know the filenames/directory hierarchy because that affects the order of scanning and filesystem generation (depth first, in C sort order), and that ordering could be a significant factor in triggering the bug. Unlikelier even the size of the filenames might be significant.

      If the above filenames/directory hierarchy and file sizes are not sensitive (or you can produce a test case that has no sensitive elements there) then this maybe a way forward.

       

Log in to post a comment.