GnuCOBOL / Discussion / GnuCOBOL: Failures when cobc launched in parallel processes, but not in serial execution

Jonathan Beit-Aharon - 2021-12-15

Hi!
I was trying to speed up the compilation of some 1000 programs by launching up to 8 cobc sessions in parallel, only to encounter failures that do not occur when I compile the same programs serially.

The stdout output looks like:

cobc (GnuCOBOL) 3.1.2.0 Built Dec 14 2021 16:58:39 Packaged Dec 23 2020 12:04:58 UTC C version "4.8.5 20150623 (Red Hat 4.8.5-36)" Error: cobc failed to give a parse tree, stopping

The stderr output looks like:

command line: cobc -fsection-segments=warning --verbose -free -ext cpy -I copylib -I copylib2 -std=ibm ABCD10.dir/ABCD10.0 preprocessing: ABCD10.dir/ABCD10.0 -> /tmp/cob25192_0.cob return status: 0 parsing: /tmp/cob25192_0.cob (ABCD10.dir/ABCD10.0) ABCD10.dir/ABCD10.0:46: error: PICTURE clause required for 'ABCDE-FGHIJ-KLMN' ABCD10.dir/ABCD10.0: in section '300-DTL': ABCD10.dir/ABCD10.0: in paragraph '300-10': ABCD10.dir/ABCD10.0:217: error: 'ABCDE-FGHIJ-VWXYZ' is not defined return status: 1

Checked, and each program gets a unique /tmp/cob*_0.cob file, and the word parallel does not appear in the output of "cobc --help". If it matters, my environment is "3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux", and I used a Korn shell script for the parallel launches.
~~~
$ ksh --version
version sh (AT&T Research) 93u+ 2012-08-01
~~~
Suggestions? Thanks!
Jonathan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Simon Sobisch - 2021-12-15
  
  When running the testsuite in parallel I also commonly use most 14 cores and when compiling with GnuCOBOL on production environments I've also seen 32 parallel compiles.
  
  The temporary files could conflict but that is very unlikely.
  
  Where does the "cobc failed to give a parse tree" comes from?
  How did you started the compiles in parallel?
  
  Just a note: I personally would suggest using make for that as it is quite reliable and while you only define "sequentially" what needs to be done you can run in parallel as you like with make checking the dependencies, if any are defined.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jonathan Beit-Aharon - 2021-12-15
    
    Thank you, Simon, for your quick response!
    
    I cannot use "make -j" at this point because this phase of my process
    analyzes the COPY and CALL statements in order to build the Makefile.
    
    The message you questioned comes from this Korn shell code snippet in my
    "cob2" script:
    
    cobc ${MY_COBCFLAGS} -o ${IN1}.out ${IN1}.0 if [ ! -e ${IN1}.out] ; then echo "Error: cobc failed to give a parse tree, stopping" exit 17 fi
    
    The code that submits the parallel runs looks like this:
    
    ls *.cob *.cbl 2>/dev/null | while read f ; do PARTITION=${WAYS_PARALLEL} while [ ${PARTITION} -gt 0 ]; do (j=$(basename ${f} | cut -f1 -d'.'); if [ ! -s ${j}.deps ]; then print "Preparing ${f} in partition: $((${WAYS_PARALLEL}-${PARTITION}))"; if [ "${f}" != "$(basename ${f} .cob)" ]; then (eval "$(cat Makefile.opts ${f}.opts)" ; cob2 ${f} ${charset}) else (eval "$(cat Makefile.cbl.opts ${f}.opts)" ; cob2 ${f} ${charset}) fi if [ ! -s ${j}.deps ]; then export HALT=$((${HALT}+1)); print " Failed to produce ${j}.deps "; fi fi) & PARTITION=$((${PARTITION}-1)) if [ ${PARTITION} -gt 0 ]; then # Get next and handle end of the input list read f ; if [ -z "${f}" ]; then PARTITION=0; fi fi done wait done
    
    I've seen no problems with this code before, running 2, and 4 ways parallel, and had it reviewed by colleagues, but now experiencing random compile failures when not running serial, and these failures all occurred during cobc execution. I was hoping there was a knob for parallel runs, or at least for debugging them.
    
    Finally, please ignore the confidentiality message my company will attach at the bottom: I made sure to put nothing confidential in this message.
    
    Thanks!
    Jonathan
    
    Last edit: Simon Sobisch 2021-12-16
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Beit-Aharon - 2021-12-16

Digging further, for each program the first error reported by cobc was on a line just prior to a COPY .. REPLACING directive. I'll dig in the code, but does anyone already know if COPY REPLACING processing creates an intermediate file with a fixed name?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- pottmi - 2021-12-16
  
  Try this:
  
  strace cobc x.cbl
  
  That will output all the files that the compiler opened. Then you can look for duplicates. Make a small sample so you are not overwhelmed with output.
  
  Last edit: Simon Sobisch 2021-12-16
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Simon Sobisch - 2021-12-16
  
  The suggested strace was a good idea.
  But no, COPY REPLACING is only applied to the preprocessing which is done in the files you already know of.
  
  Just guessing here - maybe you want to try with export TMPDIR set to a different place (doesn't make more sense than the error, but also not much less)?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Simon Sobisch - 2021-12-16
    
    Actually from glancing over your parallel build code, and with a serial build working fine, the following two options may really help - please recheck and report:
    
    add --save-temps to your cobc command (and then manually delete the additional files you don't need - this way TMPDIR is not used from cobc itself
    
    change your script to do export TMPDIR=/tmp/$$-$PARTITION; mkdir $TMPDIR to hard-separate the temporary files between the builds
    
    Also an strace in the failing parallel builds would be fine, it is likely that they will show the failing build to use the same file as another one.
    
    The strange thing here: the temporary file names are created based on the PID of the running cobc process, and there can be only one with the same PID... the "_0" part is also incremented for each of the files a single cobc process handles,
    
    Simon
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ralph Linkletter - 2021-12-16
    
    Simon what happens when 10 - 100 program try to access the same copybook pseudo concurrently ? Does the thread block the file copybook processing until it is complete ?
    I presume the copybook is opened read only.
    Is the copybook file closed after having been copied by the preprocessor ?
    Does it remain open until the end of the compile process ?
    Seems as if the he has already identified copybook processing as a suspect.
    Ralph
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Beit-Aharon - 2021-12-16

Gentlemen, thank you both for your help!

The failure is sporadic / intermittent... the worst kind :-(

I ran the translation multiple times on a small sample of programs, using varying degrees of parallelism (2 to 10) and got these results meaning that in this last run the failures occurred for 5, 7, 8, 9, and 10 ways parallel:

$ ls -1 Makefile.good.* Makefile.good.2 Makefile.good.3 Makefile.good.4 Makefile.good.6

Because the COBOL code belongs to a customer, not me, and I am bound by an NDA, I cannot provide you with the intermediate files, but here is what I can show to confirm / narrow down the problem:

$ for f in our_tmp2/*; do diff -q ${f} our_tmp3/ ; done $ for f in our_tmp2/*; do diff -q ${f} our_tmp4 ; done $ for f in our_tmp2/*; do diff -q ${f} our_tmp5 ; done Files our_tmp2/PV0810.i and our_tmp5/PV0810.i differ Files our_tmp2/PV081023.i and our_tmp5/PV081023.i differ Files our_tmp2/PV081031.i and our_tmp5/PV081031.i differ

So it seems the failure, whatever it is, occurs in the output of the intermediate files:

$ wc -l our_tmp2/PV0810.i our_tmp5/PV0810.i 15487 our_tmp2/PV0810.i 15476 our_tmp5/PV0810.i 30963 total $ wc -l our_tmp2/PV081023.i our_tmp5/PV081023.i 30048 our_tmp2/PV081023.i 23146 our_tmp5/PV081023.i 53194 total $ wc -l our_tmp2/PV081031.i our_tmp5/PV081031.i 23659 our_tmp2/PV081031.i 13519 our_tmp5/PV081031.i 37178 total $ diff our_tmp2/PV0810.i our_tmp5/PV0810.i |grep ^[1-9] 11626,11637c11626 ec-cbldev1 ~/yard/2021/jbtests/parallel/cobol $ diff our_tmp2/PV081023.i our_tmp5/PV081023.i |grep ^[1-9] 9451,15811c9451,9457 15813c9459 15815c9461 15817c9463 15819c9465 15821,15828c9467 15830,15890c9469 15892,16191c9471 16193c9473 16195c9475 16197,16269c9477 16271,16289c9479 16291,16335c9481 16337,16361c9483 16363,16384c9485 16386,16400c9487 16401a9489,9499

Something is happening in the middle of their output... should I try calling "flush"? Any other suggestions?

On top of it, strace didn't generate output... maybe a typo I fail to see, so I'll have to run this again. Oy!

Again, many thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- pottmi - 2021-12-16
  
  I will do a zoom with you and try to figure out why strace is not
  outputting anything. There are flags that need to be set to "Follow
  Children".
  
  On Thu, Dec 16, 2021 at 5:35 PM Jonathan Beit-Aharon jbeit-aharon@users.sourceforge.net wrote:
  
  Gentlemen, thank you both for your help!
  
  The failure is sporadic / intermittent... the worst kind :-(
  
  I ran the translation multiple times on a small sample of programs, using
  varying degrees of parallelism (2 to 10) and got these results meaning that
  in this last run the failures occurred for 5, 7, 8, 9, and 10 ways parallel:
  
  $ ls -1 Makefile.good.*
  Makefile.good.2
  Makefile.good.3
  Makefile.good.4
  Makefile.good.6
  
  Because the COBOL code belongs to a customer, not me, and I am bound by an
  NDA, I cannot provide you with the intermediate files, but here is what I
  can show to confirm / narrow down the problem:
  
  $ for f in our_tmp2/; do diff -q ${f} our_tmp3/ ; done
  $ for f in our_tmp2/; do diff -q ${f} our_tmp4 ; done
  $ for f in our_tmp2/*; do diff -q ${f} our_tmp5 ; done
  Files our_tmp2/PV0810.i and our_tmp5/PV0810.i differ
  Files our_tmp2/PV081023.i and our_tmp5/PV081023.i differ
  Files our_tmp2/PV081031.i and our_tmp5/PV081031.i differ
  
  So it seems the failure, whatever it is, occurs in the output of the
  intermediate files:
  
  $ wc -l our_tmp2/PV0810.i our_tmp5/PV0810.i
  15487 our_tmp2/PV0810.i
  15476 our_tmp5/PV0810.i
  30963 total
  $ wc -l our_tmp2/PV081023.i our_tmp5/PV081023.i
  30048 our_tmp2/PV081023.i
  23146 our_tmp5/PV081023.i
  53194 total
  $ wc -l our_tmp2/PV081031.i our_tmp5/PV081031.i
  23659 our_tmp2/PV081031.i
  13519 our_tmp5/PV081031.i
  37178 total
  $ diff our_tmp2/PV0810.i our_tmp5/PV0810.i |grep ^[1-9]11626,11637c11626
  ec-cbldev1 ~/yard/2021/jbtests/parallel/cobol
  $ diff our_tmp2/PV081023.i our_tmp5/PV081023.i |grep ^[1-9]9451,15811c9451,9457
  15813c9459
  15815c9461
  15817c9463
  15819c946515821,15828c946715830,15890c946915892,16191c9471
  16193c9473
  16195c947516197,16269c947716271,16289c947916291,16335c948116337,16361c948316363,16384c948516386,16400c9487
  16401a9489,9499
  
  Something is happening in the middle of their output... should I try
  calling "flush"? Any other suggestions?
  
  On top of it, strace didn't generate output... maybe a typo I fail to see,
  so I'll have to run this again. Oy!
  
  Again, many thanks!
  
  Failures when cobc launched in parallel processes, but not in serial
  execution
  https://sourceforge.net/p/gnucobol/discussion/cobol/thread/c78ce37697/?limit=25#d66c
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/gnucobol/discussion/cobol/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Beit-Aharon - 2021-12-16

Thank you, a zoom session tomorrow (when I'm not in the fog of Flu+Zoster vaccines) would be lovely. I'm in Boston (US Eastern Time), and you are welcome to call me at +1-617-828-4591 to coordinate it. In the meantime, here is the command I was using:

strace -ff -o /u/jbeit-aharon/yard/2021/jbtests/parallel/cobol/our_tmp10/strace_10_ cobc -J

(FWIW, the "-J" option is one I added to trigger our alternate code generator)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Beit-Aharon - 2021-12-17

I think Ralph hit the nail on the head... but haven't figured out the fix.

Got strace to work, and have its output for a failure in 4 way parallel processing:
1. Can you guide me as to what I should be looking for in these traces?
2. Would it still be useful if I mask text values in calls to "write"? They contain a mixture of customer source snippets, and our proprietary syntax representation.

All the best,
Jonathan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ralph Linkletter - 2021-12-17
  
  Just for grins:
  Can you have your preprocessor directly insert the copy / replace member in the source being being preprocessed? Perhaps create a copybook with the replace parameters already "replaced".
  Is there a suite of programs that you could compile that do not have the replace option of the copy statement.
  I speculate that the replace option creates a work file that is not being managed correctly.
  Ralph
  
  👍
  1
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - pottmi - 2021-12-17
    
    You should be able to see the files that were opened by looking at the
    strace output.
    
    my email is pottmi@gmail.com if you want to zoom and look at the trace
    output together.
    
    On Fri, Dec 17, 2021 at 5:08 PM Ralph Linkletter zosralph@users.sourceforge.net wrote:
    
    Just for grins:
    Can you have your preprocessor directly insert the copy / replace member
    in the source being being preprocessed? Perhaps create a copybook with the
    replace parameters already "replaced".
    Is there a suite of programs that you could compile that do not have the
    replace option of the copy statement.
    I speculate that the replace option creates a work file that is not being
    managed correctly.
    Ralph
    
    Failures when cobc launched in parallel processes, but not in serial
    execution
    https://sourceforge.net/p/gnucobol/discussion/cobol/thread/c78ce37697/?limit=25#1d22/804d
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/gnucobol/discussion/cobol/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonathan Beit-Aharon - 2021-12-21

Attached is a copy of an strace that captured a failure. I had to replace most customer data with the word "censored" on many lines, as required by NDA -- I hope it will not detract from the usefulness of the trace.

I began to examine the trace with much appreciated help from pottmi. Below are my notes from our examination. BTW, in case it matters, the response to ulimit was "unlimited".

The nearest thing Michael and I found to suspicious activity was around the open("/proc/meminfo" line, which occurs between several read(3 lines.

There are two open("copylib/YV23___BY__pp" lines, because the source program has two COPY REPLACING line for that copybook, to replace ==()== with valid prefixes in UTF8 (Japanese) characters. The problem occurred on that copybook, after the second open.

BTW, we were surprised by mmap / munmap use for memory allocation, although this may or may not have anything to do with the problem.

strace.censored

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- pottmi - 2021-12-21
  
  Here is my $0.02 based on looking at the strace output.
  
  This is pure speculation as I am not looking at the source of cobc.
  
  I speculate that the system is hitting some artificial limit on memory
  usage and fgets (or one of its friends) is returning null and setting
  errno. cobc is detecting that is EOF when it should be detected as an
  error. As is happens fgets is returns null for EOF and for errors. The
  difference is the eof indicator is set for successful EOF and errno is set
  for an error.
  
  It will take me a while to look into that, but if someone more
  familiar with the code could take a peek first it might be more efficient.
  
  Upon successful completion, *fgets*() shall return *s*. If the stream is at end-of-file, the end-of-file indicator for the stream shall be set and *fgets*() shall return a null pointer. If a read error occurs, the error indicator for the stream shall be set, *fgets*() shall return a null pointer, and shall set
  
  errno https://man7.org/linux/man-pages/man3/errno.3.html to
  indicate the error.
  
  On Mon, Dec 20, 2021 at 8:52 PM Jonathan Beit-Aharon jbeit-aharon@users.sourceforge.net wrote:
  
  Attached is a copy of an strace that captured a failure. I had to replace
  most customer data with the word "censored" on many lines, as required by
  NDA -- I hope it will not detract from the usefulness of the trace.
  
  I began to examine the trace with much appreciated help from pottmi. Below
  are my notes from our examination. BTW, in case it matters, the response to
  ulimit was "unlimited".
  
  The nearest thing Michael and I found to suspicious activity was around
  the open("/proc/meminfo" line, which occurs between several read(3 lines.
  
  There are two open("copylib/YV23___BY__pp" lines, because the source
  program has two COPY REPLACING line for that copybook, to replace ==()==
  with valid prefixes in UTF8 (Japanese) characters. The problem occurred on
  that copybook, after the second open.
  
  BTW, we were surprised by mmap / munmap use for memory allocation,
  although this may or may not have anything to do with the problem.
  
  Attachments:
  
  strace.censored
  https://sourceforge.net/p/gnucobol/discussion/cobol/thread/c78ce37697/1563/attachment/strace.censored
  (6.0 MB; application/octet-stream)
  
  Failures when cobc launched in parallel processes, but not in serial
  execution
  https://sourceforge.net/p/gnucobol/discussion/cobol/thread/c78ce37697/?limit=25#1563
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/gnucobol/discussion/cobol/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Failures when cobc launched in parallel processes, but not in serial execution

A free COBOL compiler

Forums

Help

Failures when cobc launched in parallel processes, but not in serial execution

Again, many thanks!

Failures when cobc launched in parallel processes, but not in serial execution

A free COBOL compiler

Forums

Help

Failures when cobc launched in parallel processes, but not in serial execution document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Again, many thanks!

Failures when cobc launched in parallel processes, but not in serial execution