I am receiveing a repeatable segmentation fault deep in a script. It appears that it is associated with running the stats matrix command on a datablock. I admit that I am having trouble distilling the bug down into a minimal repeatable example. I recompiled gnuplot without optimizations and ran it through lldb. See below for a stacktrace.
I can only run my script on gnuplot 5.2.4 due to [bugs:#2115]
I'm hoping that something in the backtrace catchs a maintainers eye and either points them to the bug or a better way of exposing it. If not, I can continue to try and get to a minimal repeatable example
GNUPLOT_LIB=${GNUPLOT_LIB}:${PWD}/gnuplotrc:${PWD}/histograms lldb -- /Users/tegtmeye/usr/local/bin/gnuplot -d doit_script.gp
(lldb) target create "/Users/tegtmeye/usr/local/bin/gnuplot"
Current executable set to '/Users/tegtmeye/usr/local/bin/gnuplot' (x86_64).
(lldb) settings set -- target.run-args "-d" "doit_script.gp"
(lldb) run
Process 9735 launched: '/Users/tegtmeye/usr/local/bin/gnuplot' (x86_64)
line 48: warning: matrix contains missing or undefined values
Process 9735 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x0000000100030267 gnuplot`df_read_matrix(rows=0x00007ffeefbfe670, cols=0x00007ffeefbfe66c) at datafile.c:877
874 }
875
876 /* skip leading spaces */
-> 877 while (isspace((unsigned char) *s) && NOTSEP)
878 ++s;
879
880 /* skip blank lines and comments */
Target 0: (gnuplot) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
* frame #0: 0x0000000100030267 gnuplot`df_read_matrix(rows=0x00007ffeefbfe670, cols=0x00007ffeefbfe66c) at datafile.c:877
frame #1: 0x0000000100027026 gnuplot`df_determine_matrix_info(fin=0x0000000000000000) at datafile.c:2481
frame #2: 0x0000000100024594 gnuplot`df_open(cmd_filename="$LBLOCK1", max_using=2, plot=0x0000000000000000) at datafile.c:1409
frame #3: 0x00000001000ec5cf gnuplot`statsrequest at stats.c:808
frame #4: 0x000000010001d739 gnuplot`stats_command at command.c:2248
frame #5: 0x000000010001933b gnuplot`command at command.c:629
frame #6: 0x00000001000190d8 gnuplot`do_line at command.c:419
frame #7: 0x00000001000194db gnuplot`do_string_and_free(cmdline="stats $LBLOCK1 matrix nooutput name 'GMT_BLOCK_CAT_TMP'") at command.c:466
frame #8: 0x000000010001ad2e gnuplot`eval_command at command.c:1022
frame #9: 0x000000010001933b gnuplot`command at command.c:629
frame #10: 0x00000001000190d8 gnuplot`do_line at command.c:419
frame #11: 0x000000010007e16d gnuplot`load_file(fp=0x00007fffa8e1a1f8, name="block_cat.gp", calltype=2) at misc.c:424
frame #12: 0x000000010001aaf7 gnuplot`call_command at command.c:968
frame #13: 0x000000010001933b gnuplot`command at command.c:629
frame #14: 0x00000001000190d8 gnuplot`do_line at command.c:419
frame #15: 0x00000001000194db gnuplot`do_string_and_free(cmdline="call 'block_cat.gp' GMT_PAIRED_HISTOGRAM_BLOCK_ALL LBLOCK1 RBLOCK1") at command.c:466
frame #16: 0x000000010001ad2e gnuplot`eval_command at command.c:1022
frame #17: 0x000000010001933b gnuplot`command at command.c:629
frame #18: 0x00000001000190d8 gnuplot`do_line at command.c:419
frame #19: 0x000000010007e16d gnuplot`load_file(fp=0x00007fffa8e1a0c8, name="histograms/paired_histogram.gp", calltype=2) at misc.c:424
frame #20: 0x000000010001aaf7 gnuplot`call_command at command.c:968
frame #21: 0x000000010001933b gnuplot`command at command.c:629
frame #22: 0x00000001000190d8 gnuplot`do_line at command.c:419
frame #23: 0x000000010007e16d gnuplot`load_file(fp=0x00007fffa8e1a030, name="doit_script.gp", calltype=4) at misc.c:424
frame #24: 0x0000000100092e5e gnuplot`main(argc=1, argv=0x00007ffeefbff5e8) at plot.c:654
frame #25: 0x00007fff7602ded9 libdyld.dylib`start + 1
frame #26: 0x00007fff7602ded9 libdyld.dylib`start + 1
Thanks in advance
While there may of course be a real bug somewhere, I am suspicious of that stack trace. I get similar traces from correct code running under a debugger when the compiler optimization level is set to -O2 or higher. The compiler optimizes string comparisons to use SSE instructions that load a register from a memory chunk that can extend beyond the end of the string, leaving one or more potentially uninitialized bytes in the SSE register. The debugger (or valgrind) flags this as an error but so far as I know it is a non-fatal side effect of overly eager optimization.
Hi Ethan, thanks for taking a look at this. This definitely causes gnuplot to crash for my particular use case.
For the stack trace above, I pulled down the latest version of the repository and checked out the 5.2.4 tag and built it with optimizations off to get better debugging messages ie:
Here is a dump of the last 5 stack frames after the error. Again, if this doesn't make you suspect anything, I can try to get to a distilled reproducible test case.
Mike
Your output contains an earlier warning message:
line 48: warning: matrix contains missing or undefined values
That may be where the actual problem is first encountered. Missing/NaN values are supposed to be handled by the plotting routines, so that is a warning rather than a fatal error. I'm not sure that the stats code handles this condition gracefully, but let's defer that for the moment because the program apparently then segfaults before it gets back to the stats code. Nevertheless if you figure out why your script is triggering the warning it might suggest how to fix the problem before it gets as far as the code that is crashing.
Any chance you can trap and dump the content of the data block that the "stats" command is tring to analyze? That might be a huge clue. Can you add a command line "print $datablock" right before the call to "stats $datablock"?
I did some additional digging and I'm starting to suspect a memory corruption issue that only gets tickled in the stats code.
I've attached some contrived scripts that causes the crash (~90% of the time) on my machine. I apologize for not being able to get it down smaller and that it is snipits from what I'm currently working on.
See comments below.
'datafile.txt'
A tab-separated file containing text in the first column and numbers in the remaining 4 columns. Some rows contain NaNs in the numeric columns.
'block_load.gp'
A script that gets called to load a file and put it into a datablock. The column data is read in using strcol rather than numeric values
'block_select.gp'
A script that gets called to select columns out of one datablock and put into another datablock
'doit_script.gp'
A simple driver to demonstrate the crash. Calls 'load_datablock' on one column in the datafile and then calls 'block_select.gp' to select the same column. Finally calls stats on resulting datablock
Some comments:
1) datafile, Reducing the number of columns in the input datablock appears to make the crash disappear. I'm not seeing any non-ASCII or other strange characters in the file. Crash does not appear to matter which column is pulled out of the datafile into the datablock
2) 'block_load' changing how the data is loaded makes the crash go away, ie strcol -> column. Not calling 'block_load' but simply putting the exact commands made by 'block_load' into the driver script makes the crash go away.
3) calling print on any of the datablocks, copying them to a new file and re-running on that data makes the crash go away.
4) I've called the stats command using 'matrix'. Not using 'matrix' makes the crash go away. I think the 'warning: matrix contains missing or undefined values' warning comes from the fact that I've only fed a single column to the stats matrix command.
Again, this is on 5.2.4 due to the argument bug in 5.2.{5,6}
Let me know if you have problems duplicating or have any other questions,
Mike
Is there a reason that you can build 5.2.4 from the repository but cannot build from the current tip of branch-5-2-stable in which the ARGV bug is fixed? Even if I track down the issue using the 5.2.4 source we're still going to need to confirm the fix works when applied to the current tip.
Last edit: Ethan Merritt 2019-02-14
Hi Ethan,
Apologies on the late reply--I've been unable to spend additional time on this until now. I'll catch up to the later content but I wanted to let you know that the current tip won't compile for me.
and
Compilation error info:
I hadn't been keeping track of the various branches outside of the releases and I didn't know the stability. I'll try the current tip and see what happens.
Hmmm.
I don't get a segfault using these scripts on either 5.2.4, 5.2.6, or current development version. Nor can I trap an error of any sort at the line your trace indicates. Valgrind does indicate a potential problem a few lines earlier
==10508== Conditional jump or move depends on uninitialised value(s)
==10508== at 0x445E57: df_read_matrix (datafile.c:870)
but so far that has not enlightened me.
I think I understand what is strange about your data block, however, even though I don't know why that would trigger the segfault. The data block contains 216 lines containing
<value> <tab>
I assume the intent is to hold a single column of values. However the output I see from the stats command starts with:
Apparently the "matrix" modifier has caused it to interpret the datablock as holding *** two*** columns of values, but the second column is entirely blank. So the mean and stddev and everything else beyond that is undefined. I don't get a segfault but also I don't get anything useful from the analysis.
So I tentatively suspect the top-level problem is a disagreement about whether the lines in a tab-separated file end with <tab><cr> or just <cr>. gnuplot is seeing a trailing <tab> and interpreting that as being followed by a blank column of data. That may not match the convention your scripts are assuming. For that matter gnuplot may not be consistent about this, since I think the output from "with table" also generates a trailing field separator. </tab></cr></cr></tab>
Of course the program should not segfault in any case, but at this point I suspect that even if I prevent the segfault it still will not behave as you intend.
Last edit: Ethan Merritt 2019-02-14
Mike:
Can you turn off the tab separator setting before the "stats" command to see if that avoids the segfault? I am pretty sure the stats output will still be nonsense but I'd like to see what happens if the trailing separator is not a factor. I.e. change the last bit of doit_script.gp to
Ethan,
Good chatch on the initial datablock... I guess I could have removed the 'block_select.gp" script as it is irrelevant to the issue. ie doit_script.gp is now:
Few updates.
As above, I couldn't compile the latest tip but I can confirm that I still get the same behavior and segfault on 5.2.6.
I can confirm that running stats right after 'block_load' does indeed report as 432 records, MATRIX: [2 X 216] as you found.
inserting "unset datafile separator" before the stats call correctly reports 216 records as MATRIX: [1 X 216] . The stats values are still undefined however and gnuplot still segfaults.
Before I get to apply your patches and seeing wht happens; these scripts are are a little stripped down to get to a minimal working example but my workflow has evolved to mostly manipulating datablocks to feed to downstream plot scripts. I generally like this worflow because I'm generating about 300 custom plots in an automated fashion at a time. The original basis for the datablock scrips was to deal with gnuplot idiosyncrasies of how it addressed data in files. The isuess regarding extra columns would explain a lot.
After much poking at code bits here and there I have 3 fixes.
The first one is sufficient to make your test scripts work here in the sense that "stats" reports meaningful values. It adds a test for NaN values in matrix data fed to "stats". Individual "plot" modes do various things to handle NaN input but the parallel code in "stats" did not do anything to handle it. It is debatable whether the statistical measures output by "stats" are correct in the presence of NaN input values. Should the standard error divide by N=size of matrix or divide by (N - number_of_NaN values)? And so on. It may be worth adding additional warning messages to the output.
The second fix removes the trailing field separator from records output by "plot with table". This does not change the mean/stddev/etc reported by "stats" but does make it recognize the correct dimensions of the matrix.
The third has no apparent affect when I run your tests here, but does prevent the "depends on uninitialised value" warning when run under valgrind. I am guessing that this will also prevent your segfault.
I want to test these some more before commiting them to the git source, but it would help if you could confirm that applying the attached patch (fixes 1 + 3) to the 5.2.4 source fixes your problem.
I also append the patch text below if that is easier for you to test.
Hi again Ethan,
Applied your patches.
First, segfault issue appears to be fixed. so that is a good thing. Thanks!
So yes, the actual results are not what I was expecting due to the way the number of columns are interpreted. The "load_block.gp" script essentially just loads a datafile and plots it to block but as you suggest above, the unexpected bahavior appears to be really how columns are parsed in and used. To me, simply plotting one column of data from an infile should result in one column of data in the output block but that doesn't seem to be the case. In my scripts, trying to standardize on a 'tab' separator was an attempt to deal with the inconsistancies. For example, given a datafile:
Now read in the first column and call stats
However setting the datafile separator to tab yields different results:
So it appears that parsing columns with the default whitespace (which should include tab) behaves differently than setting the separator to tab.
So would you like to proceed. The top level issue of the segfault appears to be fixed. Would you like to close this bug and open up another one regarding the tab/newline/column inconsistancy?
Ethan-
Disregard previous. In my swapping out of different versions of gnuplot I was accidently running the above examples on 5.2.4 not your patched version. It appears things might be fixed including the segfault.
Again, thanks for taking care of this.
Mike
Thanks for testing. I have commited all three changes in both 5.2 and 5.3