Running dspsr with threading and single-pulses seems to produce "random" errors or segfaults when writing out files.
dspsr -a psrfits -s -K /fred/oz005/search/J0437-4715/2019-07-20-03:14:15/1284/2019-07-20-03:14:15.sf
works as expected.
but
dspsr -a psrfits -s -K -t 2 /fred/oz005/search/J0437-4715/2019-07-20-03:14:15/1284/2019-07-20-03:14:15.sf
produces errors such as:
dspsr: blocksize=4096 samples or 255.996 MB
dsp::UnloaderShare::nonblocking_unload error division 0
Error::stack
dsp::Archiver::unload
Pulsar::Archive::unload
FITSArchive::unload_file
FITSArchive::unload_file
Error::FailedCall
Error::message
fits_execute_template (/home/mkeith/soft/share/psrheader.fits): null pointer arg (parser)
THREAD ERROR:
Error::stack
dsp::SingleThread::run
dsp::SingleThread::run
dsp::Transformation[Subint<Fold>]::operation
dsp::Subint::transformation
dsp::UnloaderShare::Submit::unload
dsp::Archiver::unload
Pulsar::Archive::unload
FITSArchive::unload_file
FITSArchive::unload_file
Error::FailedCall
Error::message
fits_execute_template (/home/mkeith/soft/share/psrheader.fits): null pointer arg (parser)
or a massive memory dump...
*** Error in `dspsr': double free or corruption (fasttop): 0x00007f6cd03eb310 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3f2ee7f7c4]
/lib64/libc.so.6[0x3f2ee84ac0]
/lib64/libc.so.6(realloc+0x1d2)[0x3f2ee86162]
/apps/users/pulsar/skylake/software/psr_cfitsio/3.450/lib/libcfitsio.so.7(ngp_line_from_file+0xfc)[0x7f6ce963e0ac]
/apps/users/pulsar/skylake/software/psr_cfitsio/3.450/lib/libcfitsio.so.7(ngp_read_line+0x63)[0x7f6ce963ea13]
/apps/users/pulsar/skylake/software/psr_cfitsio/3.450/lib/libcfitsio.so.7(fits_execute_template+0x20f)[0x7f6ce96403bf]
/home/mkeith/soft/lib/libpsrbase.so.0(_ZNK6Pulsar11FITSArchive11unload_fileEPKc+0x437)[0x7f6ce9eb0257]
/home/mkeith/soft/lib/libpsrbase.so.0(_ZNK6Pulsar7Archive6unloadEPKc+0x890)[0x7f6ce9dd5d60]
/home/mkeith/soft/lib/libdspsr.so.0(_ZN3dsp8Archiver6unloadEPKNS_11PhaseSeriesE+0x694)[0x7f6cea93b334]
/home/mkeith/soft/lib/libdspsr.so.0(_ZN3dsp13UnloaderShare6Submit6unloadEPKNS_11PhaseSeriesE+0x46)[0x7f6cea9574d6]
/home/mkeith/soft/lib/libdspsr.so.0(_ZN3dsp6SubintINS_4FoldEE14transformationEv+0x4bc)[0x7f6cea96958c]
/home/mkeith/soft/lib/libdspsr.so.0(_ZN3dsp14TransformationINS_10TimeSeriesENS_11PhaseSeriesEE9operationEv+0xc19)[0x7f6cea954029]
/home/mkeith/soft/lib/libdspbase.so.0(_ZN3dsp9Operation7operateEv+0x4f)[0x7f6cea6787bf]
/home/mkeith/soft/lib/libdspdsp.so.0(_ZN3dsp12SingleThread3runEv+0x43b)[0x7f6cea840d8b]
/home/mkeith/soft/lib/libdspdsp.so.0(_ZN3dsp11MultiThread6threadEPv+0xd0)[0x7f6cea84ba50]
/lib64/libpthread.so.0[0x3f2fa07e65]
/lib64/libc.so.6(clone+0x6d)[0x3f2eefe88d]
or other errors related to cfitsio.
dspsr: b68528e15e83386c9b6b6093d6faa5939db59210
psrchive: 1c239d662708e317738a495a6d8ee384dabeb999
I have done a tiny bit more research into this bug, which still exists as of 0abe673a7566691b527df4ab28e22bb6a3a6dd02. In particular it "goes away" if you
I suspect the issue is that something is not being initialised correctly in the output archiver that is causing a null or invalid pointer to be passed to cfitsio. I'm not sure why single pulses causes it in particular though...
Last edit: Mike Keith 2020-05-15
Hi Mike,
Thanks for digging into this. I think that both of these clues point to the fact that the CFITSIO library must not be thread safe. A google search for "cfitsio thread safe" comes up with
https://www.google.com/search?q=cfitsio+thread+safe&oq=cfitsio+thread+safe&aqs=chrome..69i57j33.3635j0j7&sourceid=chrome&ie=UTF-8
which says that, during compilation, the CFITSIO library must be configured with the --enable-reentrant command line option. Perhaps it's worth giving this a try?
When -A is used, the different threads simply add their single-pulse profiles to a grand total archive in memory, which is written out, using a single thread, only at the end of execution.
Cheers,
Willem
OzStar has a module named cfitsio/3.450-reentrant
By default,
I'll ask if psr_cfitsio is configured with --enable-reentrant ...
I had the same thought - but copying a .sf file to my laptop and using a locally compiled version of cfitsio with the --enble-reentrant didn't seem to fix it, though I will try with the new module on ozstar. I started debugging on my laptop but it really is not very clear what's going on. The errors are very "random" and seem to be from bits of memory being overwritten (e.g. sometimes it has an array of pointers to dereference but they have 'silly values' like 0x16 - so I think the actual error occurs well before the segfault), but indeed as far as I can tell are always in cfitsio, and I think (but haven't really checked) multiple threads are indeed calling cfitsio routines at the same time. Maybe there are some cfitio functions that are not in fact thread safe even with this option. I'll try to dig a bit more, but I got kind of sick of it yesterday!
Me too. My trial recompilation with --enable-reentrant didn't seem to fix the problem.
I'm going to search the CFITSIO documentation to see if any routines have been flagged as "never thread safe" and I'll double-check the psrchive usage of cfitsio to ensure that there are no static variables or other smoking guns.
For the record - We belive this is due to a calls to the cfitsio routine
fits_execute_templatewhich is not thread safe. I have made a patched version of cfitsio that fixes it, but also likely that we don't really need to callfits_execute_templateevery time we write a file and perhaps it would be better for psrchive to cache a empty psrfits file in memory.