Hi!
I was trying to speed up the compilation of some 1000 programs by launching up to 8 cobc sessions in parallel, only to encounter failures that do not occur when I compile the same programs serially.
The stdout output looks like:
cobc (GnuCOBOL) 3.1.2.0
Built Dec 14 2021 16:58:39 Packaged Dec 23 2020 12:04:58 UTC
C version "4.8.5 20150623 (Red Hat 4.8.5-36)"
Error: cobc failed to give a parse tree, stopping
Checked, and each program gets a unique /tmp/cob*_0.cob file, and the word parallel does not appear in the output of "cobc --help". If it matters, my environment is "3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux", and I used a Korn shell script for the parallel launches.
~~~
$ ksh --version
version sh (AT&T Research) 93u+ 2012-08-01
~~~
Suggestions? Thanks!
Jonathan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When running the testsuite in parallel I also commonly use most 14 cores and when compiling with GnuCOBOL on production environments I've also seen 32 parallel compiles.
The temporary files could conflict but that is very unlikely.
Where does the "cobc failed to give a parse tree" comes from?
How did you started the compiles in parallel?
Just a note: I personally would suggest using make for that as it is quite reliable and while you only define "sequentially" what needs to be done you can run in parallel as you like with make checking the dependencies, if any are defined.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I cannot use "make -j" at this point because this phase of my process
analyzes the COPY and CALL statements in order to build the Makefile.
The message you questioned comes from this Korn shell code snippet in my
"cob2" script:
cobc ${MY_COBCFLAGS} -o ${IN1}.out ${IN1}.0
if[ ! -e ${IN1}.out];thenecho"Error: cobc failed to give a parse tree, stopping"exit17fi
The code that submits the parallel runs looks like this:
ls *.cob *.cbl 2>/dev/null |whileread f ;doPARTITION=${WAYS_PARALLEL}while[${PARTITION} -gt 0];do(j=$(basename ${f}| cut -f1 -d'.');if[ ! -s ${j}.deps ];then
print "Preparing ${f} in partition: $((${WAYS_PARALLEL}-${PARTITION}))";if["${f}" !="$(basename ${f} .cob)"];then(eval"$(cat Makefile.opts ${f}.opts)"; cob2 ${f}${charset})else(eval"$(cat Makefile.cbl.opts ${f}.opts)"; cob2 ${f}${charset})fiif[ ! -s ${j}.deps ];thenexportHALT=$((${HALT}+1));
print " Failed to produce ${j}.deps ";fifi)&PARTITION=$((${PARTITION}-1))if[${PARTITION} -gt 0];then# Get next and handle end of the
input list
read f ;if[ -z "${f}"];thenPARTITION=0;fifidonewaitdone
I've seen no problems with this code before, running 2, and 4 ways parallel, and had it reviewed by colleagues, but now experiencing random compile failures when not running serial, and these failures all occurred during cobc execution. I was hoping there was a knob for parallel runs, or at least for debugging them.
Finally, please ignore the confidentiality message my company will attach at the bottom: I made sure to put nothing confidential in this message.
Thanks!
Jonathan
Last edit: Simon Sobisch 2021-12-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Digging further, for each program the first error reported by cobc was on a line just prior to a COPY .. REPLACING directive. I'll dig in the code, but does anyone already know if COPY REPLACING processing creates an intermediate file with a fixed name?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The suggested strace was a good idea.
But no, COPY REPLACING is only applied to the preprocessing which is done in the files you already know of.
Just guessing here - maybe you want to try with export TMPDIR set to a different place (doesn't make more sense than the error, but also not much less)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually from glancing over your parallel build code, and with a serial build working fine, the following two options may really help - please recheck and report:
add --save-temps to your cobc command (and then manually delete the additional files you don't need - this way TMPDIR is not used from cobc itself
change your script to do export TMPDIR=/tmp/$$-$PARTITION; mkdir $TMPDIR to hard-separate the temporary files between the builds
Also an strace in the failing parallel builds would be fine, it is likely that they will show the failing build to use the same file as another one.
The strange thing here: the temporary file names are created based on the PID of the running cobc process, and there can be only one with the same PID... the "_0" part is also incremented for each of the files a single cobc process handles,
Simon
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Simon what happens when 10 - 100 program try to access the same copybook pseudo concurrently ? Does the thread block the file copybook processing until it is complete ?
I presume the copybook is opened read only.
Is the copybook file closed after having been copied by the preprocessor ?
Does it remain open until the end of the compile process ?
Seems as if the he has already identified copybook processing as a suspect.
Ralph
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The failure is sporadic / intermittent... the worst kind :-(
I ran the translation multiple times on a small sample of programs, using varying degrees of parallelism (2 to 10) and got these results meaning that in this last run the failures occurred for 5, 7, 8, 9, and 10 ways parallel:
$ ls -1 Makefile.good.*
Makefile.good.2
Makefile.good.3
Makefile.good.4
Makefile.good.6
Because the COBOL code belongs to a customer, not me, and I am bound by an NDA, I cannot provide you with the intermediate files, but here is what I can show to confirm / narrow down the problem:
$ for f in our_tmp2/*;do diff -q ${f} our_tmp3/ ;done
$ for f in our_tmp2/*;do diff -q ${f} our_tmp4 ;done
$ for f in our_tmp2/*;do diff -q ${f} our_tmp5 ;done
Files our_tmp2/PV0810.i and our_tmp5/PV0810.i differ
Files our_tmp2/PV081023.i and our_tmp5/PV081023.i differ
Files our_tmp2/PV081031.i and our_tmp5/PV081031.i differ
So it seems the failure, whatever it is, occurs in the output of the intermediate files:
The failure is sporadic / intermittent... the worst kind :-(
I ran the translation multiple times on a small sample of programs, using
varying degrees of parallelism (2 to 10) and got these results meaning that
in this last run the failures occurred for 5, 7, 8, 9, and 10 ways parallel:
$ ls -1 Makefile.good.*
Makefile.good.2
Makefile.good.3
Makefile.good.4
Makefile.good.6
Because the COBOL code belongs to a customer, not me, and I am bound by an
NDA, I cannot provide you with the intermediate files, but here is what I
can show to confirm / narrow down the problem:
$ for f in our_tmp2/; do diff -q ${f} our_tmp3/ ; done
$ for f in our_tmp2/; do diff -q ${f} our_tmp4 ; done
$ for f in our_tmp2/*; do diff -q ${f} our_tmp5 ; done
Files our_tmp2/PV0810.i and our_tmp5/PV0810.i differ
Files our_tmp2/PV081023.i and our_tmp5/PV081023.i differ
Files our_tmp2/PV081031.i and our_tmp5/PV081031.i differ
So it seems the failure, whatever it is, occurs in the output of the
intermediate files:
Thank you, a zoom session tomorrow (when I'm not in the fog of Flu+Zoster vaccines) would be lovely. I'm in Boston (US Eastern Time), and you are welcome to call me at +1-617-828-4591 to coordinate it. In the meantime, here is the command I was using:
I think Ralph hit the nail on the head... but haven't figured out the fix.
Got strace to work, and have its output for a failure in 4 way parallel processing:
1. Can you guide me as to what I should be looking for in these traces?
2. Would it still be useful if I mask text values in calls to "write"? They contain a mixture of customer source snippets, and our proprietary syntax representation.
All the best,
Jonathan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just for grins:
Can you have your preprocessor directly insert the copy / replace member in the source being being preprocessed? Perhaps create a copybook with the replace parameters already "replaced".
Is there a suite of programs that you could compile that do not have the replace option of the copy statement.
I speculate that the replace option creates a work file that is not being managed correctly.
Ralph
👍
1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just for grins:
Can you have your preprocessor directly insert the copy / replace member
in the source being being preprocessed? Perhaps create a copybook with the
replace parameters already "replaced".
Is there a suite of programs that you could compile that do not have the
replace option of the copy statement.
I speculate that the replace option creates a work file that is not being
managed correctly.
Ralph
Attached is a copy of an strace that captured a failure. I had to replace most customer data with the word "censored" on many lines, as required by NDA -- I hope it will not detract from the usefulness of the trace.
I began to examine the trace with much appreciated help from pottmi. Below are my notes from our examination. BTW, in case it matters, the response to ulimit was "unlimited".
The nearest thing Michael and I found to suspicious activity was around the open("/proc/meminfo" line, which occurs between several read(3 lines.
There are two open("copylib/YV23___BY__pp" lines, because the source program has two COPY REPLACING line for that copybook, to replace ==()== with valid prefixes in UTF8 (Japanese) characters. The problem occurred on that copybook, after the second open.
BTW, we were surprised by mmap / munmap use for memory allocation, although this may or may not have anything to do with the problem.
Here is my $0.02 based on looking at the strace output.
This is pure speculation as I am not looking at the source of cobc.
I speculate that the system is hitting some artificial limit on memory
usage and fgets (or one of its friends) is returning null and setting
errno. cobc is detecting that is EOF when it should be detected as an
error. As is happens fgets is returns null for EOF and for errors. The
difference is the eof indicator is set for successful EOF and errno is set
for an error.
It will take me a while to look into that, but if someone more
familiar with the code could take a peek first it might be more efficient.
Attached is a copy of an strace that captured a failure. I had to replace
most customer data with the word "censored" on many lines, as required by
NDA -- I hope it will not detract from the usefulness of the trace.
I began to examine the trace with much appreciated help from pottmi. Below
are my notes from our examination. BTW, in case it matters, the response to
ulimit was "unlimited".
The nearest thing Michael and I found to suspicious activity was around
the open("/proc/meminfo" line, which occurs between several read(3 lines.
There are two open("copylib/YV23___BY__pp" lines, because the source
program has two COPY REPLACING line for that copybook, to replace ==()==
with valid prefixes in UTF8 (Japanese) characters. The problem occurred on
that copybook, after the second open.
BTW, we were surprised by mmap / munmap use for memory allocation,
although this may or may not have anything to do with the problem.
Hi!
I was trying to speed up the compilation of some 1000 programs by launching up to 8 cobc sessions in parallel, only to encounter failures that do not occur when I compile the same programs serially.
The stdout output looks like:
The stderr output looks like:
Checked, and each program gets a unique /tmp/cob*_0.cob file, and the word parallel does not appear in the output of "cobc --help". If it matters, my environment is "3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux", and I used a Korn shell script for the parallel launches.
~~~
$ ksh --version
version sh (AT&T Research) 93u+ 2012-08-01
~~~
Suggestions? Thanks!
Jonathan
When running the testsuite in parallel I also commonly use most 14 cores and when compiling with GnuCOBOL on production environments I've also seen 32 parallel compiles.
The temporary files could conflict but that is very unlikely.
Where does the "cobc failed to give a parse tree" comes from?
How did you started the compiles in parallel?
Just a note: I personally would suggest using
make
for that as it is quite reliable and while you only define "sequentially" what needs to be done you can run in parallel as you like withmake
checking the dependencies, if any are defined.Thank you, Simon, for your quick response!
I cannot use "make -j" at this point because this phase of my process
analyzes the COPY and CALL statements in order to build the Makefile.
The message you questioned comes from this Korn shell code snippet in my
"cob2" script:
The code that submits the parallel runs looks like this:
I've seen no problems with this code before, running 2, and 4 ways parallel, and had it reviewed by colleagues, but now experiencing random compile failures when not running serial, and these failures all occurred during cobc execution. I was hoping there was a knob for parallel runs, or at least for debugging them.
Finally, please ignore the confidentiality message my company will attach at the bottom: I made sure to put nothing confidential in this message.
Thanks!
Jonathan
Last edit: Simon Sobisch 2021-12-16
Digging further, for each program the first error reported by cobc was on a line just prior to a COPY .. REPLACING directive. I'll dig in the code, but does anyone already know if COPY REPLACING processing creates an intermediate file with a fixed name?
Try this:
That will output all the files that the compiler opened. Then you can look for duplicates. Make a small sample so you are not overwhelmed with output.
Last edit: Simon Sobisch 2021-12-16
The suggested
strace
was a good idea.But no,
COPY REPLACING
is only applied to the preprocessing which is done in the files you already know of.Just guessing here - maybe you want to try with
export TMPDIR
set to a different place (doesn't make more sense than the error, but also not much less)?Actually from glancing over your parallel build code, and with a serial build working fine, the following two options may really help - please recheck and report:
--save-temps
to your cobc command (and then manually delete the additional files you don't need - this way TMPDIR is not used from cobc itselfexport TMPDIR=/tmp/$$-$PARTITION; mkdir $TMPDIR
to hard-separate the temporary files between the buildsAlso an
strace
in the failing parallel builds would be fine, it is likely that they will show the failing build to use the same file as another one.The strange thing here: the temporary file names are created based on the PID of the running
cobc
process, and there can be only one with the same PID... the "_0" part is also incremented for each of the files a single cobc process handles,Simon
Simon what happens when 10 - 100 program try to access the same copybook pseudo concurrently ? Does the thread block the file copybook processing until it is complete ?
I presume the copybook is opened read only.
Is the copybook file closed after having been copied by the preprocessor ?
Does it remain open until the end of the compile process ?
Seems as if the he has already identified copybook processing as a suspect.
Ralph
Gentlemen, thank you both for your help!
The failure is sporadic / intermittent... the worst kind :-(
I ran the translation multiple times on a small sample of programs, using varying degrees of parallelism (2 to 10) and got these results meaning that in this last run the failures occurred for 5, 7, 8, 9, and 10 ways parallel:
Because the COBOL code belongs to a customer, not me, and I am bound by an NDA, I cannot provide you with the intermediate files, but here is what I can show to confirm / narrow down the problem:
So it seems the failure, whatever it is, occurs in the output of the intermediate files:
Something is happening in the middle of their output... should I try calling "flush"? Any other suggestions?
On top of it, strace didn't generate output... maybe a typo I fail to see, so I'll have to run this again. Oy!
Again, many thanks!
I will do a zoom with you and try to figure out why strace is not
outputting anything. There are flags that need to be set to "Follow
Children".
On Thu, Dec 16, 2021 at 5:35 PM Jonathan Beit-Aharon jbeit-aharon@users.sourceforge.net wrote:
Thank you, a zoom session tomorrow (when I'm not in the fog of Flu+Zoster vaccines) would be lovely. I'm in Boston (US Eastern Time), and you are welcome to call me at +1-617-828-4591 to coordinate it. In the meantime, here is the command I was using:
(FWIW, the "-J" option is one I added to trigger our alternate code generator)
I think Ralph hit the nail on the head... but haven't figured out the fix.
Got strace to work, and have its output for a failure in 4 way parallel processing:
1. Can you guide me as to what I should be looking for in these traces?
2. Would it still be useful if I mask text values in calls to "write"? They contain a mixture of customer source snippets, and our proprietary syntax representation.
All the best,
Jonathan
Just for grins:
Can you have your preprocessor directly insert the copy / replace member in the source being being preprocessed? Perhaps create a copybook with the replace parameters already "replaced".
Is there a suite of programs that you could compile that do not have the replace option of the copy statement.
I speculate that the replace option creates a work file that is not being managed correctly.
Ralph
You should be able to see the files that were opened by looking at the
strace output.
my email is pottmi@gmail.com if you want to zoom and look at the trace
output together.
On Fri, Dec 17, 2021 at 5:08 PM Ralph Linkletter zosralph@users.sourceforge.net wrote:
Attached is a copy of an strace that captured a failure. I had to replace most customer data with the word "censored" on many lines, as required by NDA -- I hope it will not detract from the usefulness of the trace.
I began to examine the trace with much appreciated help from pottmi. Below are my notes from our examination. BTW, in case it matters, the response to ulimit was "unlimited".
The nearest thing Michael and I found to suspicious activity was around the open("/proc/meminfo" line, which occurs between several read(3 lines.
There are two open("copylib/YV23___BY__pp" lines, because the source program has two COPY REPLACING line for that copybook, to replace ==()== with valid prefixes in UTF8 (Japanese) characters. The problem occurred on that copybook, after the second open.
BTW, we were surprised by mmap / munmap use for memory allocation, although this may or may not have anything to do with the problem.
Here is my $0.02 based on looking at the strace output.
This is pure speculation as I am not looking at the source of cobc.
I speculate that the system is hitting some artificial limit on memory
usage and fgets (or one of its friends) is returning null and setting
errno. cobc is detecting that is EOF when it should be detected as an
error. As is happens fgets is returns null for EOF and for errors. The
difference is the eof indicator is set for successful EOF and errno is set
for an error.
It will take me a while to look into that, but if someone more
familiar with the code could take a peek first it might be more efficient.
errno https://man7.org/linux/man-pages/man3/errno.3.html to
indicate the error.
On Mon, Dec 20, 2021 at 8:52 PM Jonathan Beit-Aharon jbeit-aharon@users.sourceforge.net wrote: