Menu

#86 singlepulse + threads = errors and segfaults

version_1.0
open
nobody
None
5
2020-05-27
2020-04-15
Mike Keith
No

Running dspsr with threading and single-pulses seems to produce "random" errors or segfaults when writing out files.

dspsr -a psrfits -s -K /fred/oz005/search/J0437-4715/2019-07-20-03:14:15/1284/2019-07-20-03:14:15.sf

works as expected.

but

dspsr -a psrfits -s -K -t 2 /fred/oz005/search/J0437-4715/2019-07-20-03:14:15/1284/2019-07-20-03:14:15.sf

produces errors such as:

dspsr: blocksize=4096 samples or 255.996 MB
dsp::UnloaderShare::nonblocking_unload error division 0
Error::stack
    dsp::Archiver::unload
    Pulsar::Archive::unload
    FITSArchive::unload_file
    FITSArchive::unload_file
Error::FailedCall
Error::message
    fits_execute_template (/home/mkeith/soft/share/psrheader.fits): null pointer arg (parser)
THREAD ERROR: 
Error::stack
    dsp::SingleThread::run
    dsp::SingleThread::run
    dsp::Transformation[Subint<Fold>]::operation
    dsp::Subint::transformation
    dsp::UnloaderShare::Submit::unload
    dsp::Archiver::unload
    Pulsar::Archive::unload
    FITSArchive::unload_file
    FITSArchive::unload_file
Error::FailedCall
Error::message
    fits_execute_template (/home/mkeith/soft/share/psrheader.fits): null pointer arg (parser)

or a massive memory dump...

*** Error in `dspsr': double free or corruption (fasttop): 0x00007f6cd03eb310 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3f2ee7f7c4]
/lib64/libc.so.6[0x3f2ee84ac0]
/lib64/libc.so.6(realloc+0x1d2)[0x3f2ee86162]
/apps/users/pulsar/skylake/software/psr_cfitsio/3.450/lib/libcfitsio.so.7(ngp_line_from_file+0xfc)[0x7f6ce963e0ac]
/apps/users/pulsar/skylake/software/psr_cfitsio/3.450/lib/libcfitsio.so.7(ngp_read_line+0x63)[0x7f6ce963ea13]
/apps/users/pulsar/skylake/software/psr_cfitsio/3.450/lib/libcfitsio.so.7(fits_execute_template+0x20f)[0x7f6ce96403bf]
/home/mkeith/soft/lib/libpsrbase.so.0(_ZNK6Pulsar11FITSArchive11unload_fileEPKc+0x437)[0x7f6ce9eb0257]
/home/mkeith/soft/lib/libpsrbase.so.0(_ZNK6Pulsar7Archive6unloadEPKc+0x890)[0x7f6ce9dd5d60]
/home/mkeith/soft/lib/libdspsr.so.0(_ZN3dsp8Archiver6unloadEPKNS_11PhaseSeriesE+0x694)[0x7f6cea93b334]
/home/mkeith/soft/lib/libdspsr.so.0(_ZN3dsp13UnloaderShare6Submit6unloadEPKNS_11PhaseSeriesE+0x46)[0x7f6cea9574d6]
/home/mkeith/soft/lib/libdspsr.so.0(_ZN3dsp6SubintINS_4FoldEE14transformationEv+0x4bc)[0x7f6cea96958c]
/home/mkeith/soft/lib/libdspsr.so.0(_ZN3dsp14TransformationINS_10TimeSeriesENS_11PhaseSeriesEE9operationEv+0xc19)[0x7f6cea954029]
/home/mkeith/soft/lib/libdspbase.so.0(_ZN3dsp9Operation7operateEv+0x4f)[0x7f6cea6787bf]
/home/mkeith/soft/lib/libdspdsp.so.0(_ZN3dsp12SingleThread3runEv+0x43b)[0x7f6cea840d8b]
/home/mkeith/soft/lib/libdspdsp.so.0(_ZN3dsp11MultiThread6threadEPv+0xd0)[0x7f6cea84ba50]
/lib64/libpthread.so.0[0x3f2fa07e65]
/lib64/libc.so.6(clone+0x6d)[0x3f2eefe88d]

or other errors related to cfitsio.

dspsr: b68528e15e83386c9b6b6093d6faa5939db59210
psrchive: 1c239d662708e317738a495a6d8ee384dabeb999

Discussion

  • Mike Keith

    Mike Keith - 2020-05-15

    I have done a tiny bit more research into this bug, which still exists as of 0abe673a7566691b527df4ab28e22bb6a3a6dd02. In particular it "goes away" if you

    • Don't use psrfits as an output option, OR
    • Use the -A to append to a single archive.

    I suspect the issue is that something is not being initialised correctly in the output archiver that is causing a null or invalid pointer to be passed to cfitsio. I'm not sure why single pulses causes it in particular though...

     

    Last edit: Mike Keith 2020-05-15
  • Willem van Straten

    Hi Mike,

    Thanks for digging into this. I think that both of these clues point to the fact that the CFITSIO library must not be thread safe. A google search for "cfitsio thread safe" comes up with

    https://www.google.com/search?q=cfitsio+thread+safe&oq=cfitsio+thread+safe&aqs=chrome..69i57j33.3635j0j7&sourceid=chrome&ie=UTF-8

    which says that, during compilation, the CFITSIO library must be configured with the --enable-reentrant command line option. Perhaps it's worth giving this a try?

    When -A is used, the different threads simply add their single-pulse profiles to a grand total archive in memory, which is written out, using a single thread, only at the end of execution.

    Cheers,
    Willem

     
  • Willem van Straten

    OzStar has a module named cfitsio/3.450-reentrant

    By default,

    $ module list cfitsio
    
    Currently Loaded Modules Matching: cfitsio
      1) psr_cfitsio/3.450
    

    I'll ask if psr_cfitsio is configured with --enable-reentrant ...

     
  • Mike Keith

    Mike Keith - 2020-05-16

    I had the same thought - but copying a .sf file to my laptop and using a locally compiled version of cfitsio with the --enble-reentrant didn't seem to fix it, though I will try with the new module on ozstar. I started debugging on my laptop but it really is not very clear what's going on. The errors are very "random" and seem to be from bits of memory being overwritten (e.g. sometimes it has an array of pointers to dereference but they have 'silly values' like 0x16 - so I think the actual error occurs well before the segfault), but indeed as far as I can tell are always in cfitsio, and I think (but haven't really checked) multiple threads are indeed calling cfitsio routines at the same time. Maybe there are some cfitio functions that are not in fact thread safe even with this option. I'll try to dig a bit more, but I got kind of sick of it yesterday!

     
  • Willem van Straten

    Me too. My trial recompilation with --enable-reentrant didn't seem to fix the problem.

    I'm going to search the CFITSIO documentation to see if any routines have been flagged as "never thread safe" and I'll double-check the psrchive usage of cfitsio to ensure that there are no static variables or other smoking guns.

     
  • Mike Keith

    Mike Keith - 2020-05-27

    For the record - We belive this is due to a calls to the cfitsio routine fits_execute_template which is not thread safe. I have made a patched version of cfitsio that fixes it, but also likely that we don't really need to call fits_execute_template every time we write a file and perhaps it would be better for psrchive to cache a empty psrfits file in memory.

     

Log in to post a comment.