Menu

Failures when cobc launched in parallel processes, but not in serial execution

GnuCOBOL
2021-12-15
2021-12-21
  • Jonathan Beit-Aharon

    Hi!
    I was trying to speed up the compilation of some 1000 programs by launching up to 8 cobc sessions in parallel, only to encounter failures that do not occur when I compile the same programs serially.

    The stdout output looks like:

    cobc (GnuCOBOL) 3.1.2.0
    Built     Dec 14 2021 16:58:39  Packaged  Dec 23 2020 12:04:58 UTC
    C version "4.8.5 20150623 (Red Hat 4.8.5-36)"
    Error: cobc failed to give a parse tree, stopping
    

    The stderr output looks like:

    command line:   cobc -fsection-segments=warning --verbose -free -ext cpy -I copylib -I copylib2 -std=ibm ABCD10.dir/ABCD10.0
    preprocessing:  ABCD10.dir/ABCD10.0 -> /tmp/cob25192_0.cob
    return status:  0
    parsing:        /tmp/cob25192_0.cob (ABCD10.dir/ABCD10.0)
    ABCD10.dir/ABCD10.0:46: error: PICTURE clause required for 'ABCDE-FGHIJ-KLMN'
    ABCD10.dir/ABCD10.0: in section '300-DTL':
    ABCD10.dir/ABCD10.0: in paragraph '300-10':
    ABCD10.dir/ABCD10.0:217: error: 'ABCDE-FGHIJ-VWXYZ' is not defined
    return status:  1
    

    Checked, and each program gets a unique /tmp/cob*_0.cob file, and the word parallel does not appear in the output of "cobc --help". If it matters, my environment is "3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux", and I used a Korn shell script for the parallel launches.
    ~~~
    $ ksh --version
    version sh (AT&T Research) 93u+ 2012-08-01
    ~~~
    Suggestions? Thanks!
    Jonathan

     
    • Simon Sobisch

      Simon Sobisch - 2021-12-15

      When running the testsuite in parallel I also commonly use most 14 cores and when compiling with GnuCOBOL on production environments I've also seen 32 parallel compiles.

      The temporary files could conflict but that is very unlikely.

      Where does the "cobc failed to give a parse tree" comes from?
      How did you started the compiles in parallel?

      Just a note: I personally would suggest using make for that as it is quite reliable and while you only define "sequentially" what needs to be done you can run in parallel as you like with make checking the dependencies, if any are defined.

       
      • Jonathan Beit-Aharon

        Thank you, Simon, for your quick response!

        I cannot use "make -j" at this point because this phase of my process
        analyzes the COPY and CALL statements in order to build the Makefile.

        The message you questioned comes from this Korn shell code snippet in my
        "cob2" script:

        cobc ${MY_COBCFLAGS} -o ${IN1}.out ${IN1}.0
        if [ ! -e ${IN1}.out] ; then
          echo "Error: cobc failed to give a parse tree, stopping"
          exit 17
        fi
        

        The code that submits the parallel runs looks like this:

        ls *.cob *.cbl 2>/dev/null | while read f ; do
          PARTITION=${WAYS_PARALLEL}
          while [ ${PARTITION} -gt 0 ]; do
            (j=$(basename ${f} | cut -f1 -d'.');
             if [ ! -s ${j}.deps ]; then
               print "Preparing ${f} in partition: 
        $((${WAYS_PARALLEL}-${PARTITION}))";
               if [ "${f}" != "$(basename ${f} .cob)" ]; then
                 (eval "$(cat Makefile.opts ${f}.opts)" ; cob2 ${f} ${charset})
               else
                 (eval "$(cat Makefile.cbl.opts ${f}.opts)" ; cob2 ${f} 
        ${charset})
               fi
               if [ ! -s ${j}.deps ]; then
                 export HALT=$((${HALT}+1));
                 print " Failed to produce ${j}.deps ";
               fi
             fi) &
            PARTITION=$((${PARTITION}-1))
            if  [ ${PARTITION} -gt 0 ]; then    # Get next and handle end of the 
        input list
               read f ;
               if [ -z "${f}" ]; then PARTITION=0; fi
            fi
          done
          wait
        done
        

        I've seen no problems with this code before, running 2, and 4 ways parallel, and had it reviewed by colleagues, but now experiencing random compile failures when not running serial, and these failures all occurred during cobc execution. I was hoping there was a knob for parallel runs, or at least for debugging them.

        Finally, please ignore the confidentiality message my company will attach at the bottom: I made sure to put nothing confidential in this message.

        Thanks!
        Jonathan

         

        Last edit: Simon Sobisch 2021-12-16
  • Jonathan Beit-Aharon

    Digging further, for each program the first error reported by cobc was on a line just prior to a COPY .. REPLACING directive. I'll dig in the code, but does anyone already know if COPY REPLACING processing creates an intermediate file with a fixed name?

     
    • pottmi

      pottmi - 2021-12-16

      Try this:

      strace cobc x.cbl
      

      That will output all the files that the compiler opened. Then you can look for duplicates. Make a small sample so you are not overwhelmed with output.

       

      Last edit: Simon Sobisch 2021-12-16
    • Simon Sobisch

      Simon Sobisch - 2021-12-16

      The suggested strace was a good idea.
      But no, COPY REPLACING is only applied to the preprocessing which is done in the files you already know of.

      Just guessing here - maybe you want to try with export TMPDIR set to a different place (doesn't make more sense than the error, but also not much less)?

       
      • Simon Sobisch

        Simon Sobisch - 2021-12-16

        Actually from glancing over your parallel build code, and with a serial build working fine, the following two options may really help - please recheck and report:

        • add --save-temps to your cobc command (and then manually delete the additional files you don't need - this way TMPDIR is not used from cobc itself
        • change your script to do export TMPDIR=/tmp/$$-$PARTITION; mkdir $TMPDIR to hard-separate the temporary files between the builds

        Also an strace in the failing parallel builds would be fine, it is likely that they will show the failing build to use the same file as another one.

        The strange thing here: the temporary file names are created based on the PID of the running cobc process, and there can be only one with the same PID... the "_0" part is also incremented for each of the files a single cobc process handles,

        Simon

         
      • Ralph Linkletter

        Simon what happens when 10 - 100 program try to access the same copybook pseudo concurrently ? Does the thread block the file copybook processing until it is complete ?
        I presume the copybook is opened read only.
        Is the copybook file closed after having been copied by the preprocessor ?
        Does it remain open until the end of the compile process ?
        Seems as if the he has already identified copybook processing as a suspect.
        Ralph

         
  • Jonathan Beit-Aharon

    Gentlemen, thank you both for your help!

    The failure is sporadic / intermittent... the worst kind :-(

    I ran the translation multiple times on a small sample of programs, using varying degrees of parallelism (2 to 10) and got these results meaning that in this last run the failures occurred for 5, 7, 8, 9, and 10 ways parallel:

    $ ls -1 Makefile.good.*
    Makefile.good.2
    Makefile.good.3
    Makefile.good.4
    Makefile.good.6
    

    Because the COBOL code belongs to a customer, not me, and I am bound by an NDA, I cannot provide you with the intermediate files, but here is what I can show to confirm / narrow down the problem:

    $ for f in our_tmp2/*; do diff -q ${f} our_tmp3/ ; done
    $ for f in our_tmp2/*; do diff -q ${f} our_tmp4 ; done
    $ for f in our_tmp2/*; do diff -q ${f} our_tmp5 ; done
    Files our_tmp2/PV0810.i and our_tmp5/PV0810.i differ
    Files our_tmp2/PV081023.i and our_tmp5/PV081023.i differ
    Files our_tmp2/PV081031.i and our_tmp5/PV081031.i differ
    

    So it seems the failure, whatever it is, occurs in the output of the intermediate files:

    $ wc -l our_tmp2/PV0810.i our_tmp5/PV0810.i
     15487 our_tmp2/PV0810.i
     15476 our_tmp5/PV0810.i
     30963 total
    $ wc -l our_tmp2/PV081023.i our_tmp5/PV081023.i
      30048 our_tmp2/PV081023.i
      23146 our_tmp5/PV081023.i
      53194 total
    $ wc -l our_tmp2/PV081031.i our_tmp5/PV081031.i
      23659 our_tmp2/PV081031.i
      13519 our_tmp5/PV081031.i
      37178 total
    $ diff our_tmp2/PV0810.i our_tmp5/PV0810.i |grep ^[1-9]
    11626,11637c11626
    ec-cbldev1 ~/yard/2021/jbtests/parallel/cobol
    $ diff our_tmp2/PV081023.i our_tmp5/PV081023.i  |grep ^[1-9]
    9451,15811c9451,9457
    15813c9459
    15815c9461
    15817c9463
    15819c9465
    15821,15828c9467
    15830,15890c9469
    15892,16191c9471
    16193c9473
    16195c9475
    16197,16269c9477
    16271,16289c9479
    16291,16335c9481
    16337,16361c9483
    16363,16384c9485
    16386,16400c9487
    16401a9489,9499
    

    Something is happening in the middle of their output... should I try calling "flush"? Any other suggestions?

    On top of it, strace didn't generate output... maybe a typo I fail to see, so I'll have to run this again. Oy!

    Again, many thanks!

     
    • pottmi

      pottmi - 2021-12-16

      I will do a zoom with you and try to figure out why strace is not
      outputting anything. There are flags that need to be set to "Follow
      Children".

      On Thu, Dec 16, 2021 at 5:35 PM Jonathan Beit-Aharon jbeit-aharon@users.sourceforge.net wrote:

      Gentlemen, thank you both for your help!

      The failure is sporadic / intermittent... the worst kind :-(

      I ran the translation multiple times on a small sample of programs, using
      varying degrees of parallelism (2 to 10) and got these results meaning that
      in this last run the failures occurred for 5, 7, 8, 9, and 10 ways parallel:

      $ ls -1 Makefile.good.*
      Makefile.good.2
      Makefile.good.3
      Makefile.good.4
      Makefile.good.6

      Because the COBOL code belongs to a customer, not me, and I am bound by an
      NDA, I cannot provide you with the intermediate files, but here is what I
      can show to confirm / narrow down the problem:

      $ for f in our_tmp2/; do diff -q ${f} our_tmp3/ ; done
      $ for f in our_tmp2/
      ; do diff -q ${f} our_tmp4 ; done
      $ for f in our_tmp2/*; do diff -q ${f} our_tmp5 ; done
      Files our_tmp2/PV0810.i and our_tmp5/PV0810.i differ
      Files our_tmp2/PV081023.i and our_tmp5/PV081023.i differ
      Files our_tmp2/PV081031.i and our_tmp5/PV081031.i differ

      So it seems the failure, whatever it is, occurs in the output of the
      intermediate files:

      $ wc -l our_tmp2/PV0810.i our_tmp5/PV0810.i
      15487 our_tmp2/PV0810.i
      15476 our_tmp5/PV0810.i
      30963 total
      $ wc -l our_tmp2/PV081023.i our_tmp5/PV081023.i
      30048 our_tmp2/PV081023.i
      23146 our_tmp5/PV081023.i
      53194 total
      $ wc -l our_tmp2/PV081031.i our_tmp5/PV081031.i
      23659 our_tmp2/PV081031.i
      13519 our_tmp5/PV081031.i
      37178 total
      $ diff our_tmp2/PV0810.i our_tmp5/PV0810.i |grep ^[1-9]11626,11637c11626
      ec-cbldev1 ~/yard/2021/jbtests/parallel/cobol
      $ diff our_tmp2/PV081023.i our_tmp5/PV081023.i |grep ^[1-9]9451,15811c9451,9457
      15813c9459
      15815c9461
      15817c9463
      15819c946515821,15828c946715830,15890c946915892,16191c9471
      16193c9473
      16195c947516197,16269c947716271,16289c947916291,16335c948116337,16361c948316363,16384c948516386,16400c9487
      16401a9489,9499

      Something is happening in the middle of their output... should I try
      calling "flush"? Any other suggestions?

      On top of it, strace didn't generate output... maybe a typo I fail to see,
      so I'll have to run this again. Oy!

      Again, many thanks!

      Failures when cobc launched in parallel processes, but not in serial
      execution
      https://sourceforge.net/p/gnucobol/discussion/cobol/thread/c78ce37697/?limit=25#d66c


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/gnucobol/discussion/cobol/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • Jonathan Beit-Aharon

    Thank you, a zoom session tomorrow (when I'm not in the fog of Flu+Zoster vaccines) would be lovely. I'm in Boston (US Eastern Time), and you are welcome to call me at +1-617-828-4591 to coordinate it. In the meantime, here is the command I was using:

    strace -ff -o /u/jbeit-aharon/yard/2021/jbtests/parallel/cobol/our_tmp10/strace_10_ cobc -J 
    

    (FWIW, the "-J" option is one I added to trigger our alternate code generator)

     
  • Jonathan Beit-Aharon

    I think Ralph hit the nail on the head... but haven't figured out the fix.

    Got strace to work, and have its output for a failure in 4 way parallel processing:
    1. Can you guide me as to what I should be looking for in these traces?
    2. Would it still be useful if I mask text values in calls to "write"? They contain a mixture of customer source snippets, and our proprietary syntax representation.

    All the best,
    Jonathan

     
    • Ralph Linkletter

      Just for grins:
      Can you have your preprocessor directly insert the copy / replace member in the source being being preprocessed? Perhaps create a copybook with the replace parameters already "replaced".
      Is there a suite of programs that you could compile that do not have the replace option of the copy statement.
      I speculate that the replace option creates a work file that is not being managed correctly.
      Ralph

       
      👍
      1
      • pottmi

        pottmi - 2021-12-17

        You should be able to see the files that were opened by looking at the
        strace output.

        my email is pottmi@gmail.com if you want to zoom and look at the trace
        output together.

        On Fri, Dec 17, 2021 at 5:08 PM Ralph Linkletter zosralph@users.sourceforge.net wrote:

        Just for grins:
        Can you have your preprocessor directly insert the copy / replace member
        in the source being being preprocessed? Perhaps create a copybook with the
        replace parameters already "replaced".
        Is there a suite of programs that you could compile that do not have the
        replace option of the copy statement.
        I speculate that the replace option creates a work file that is not being
        managed correctly.
        Ralph


        Failures when cobc launched in parallel processes, but not in serial
        execution
        https://sourceforge.net/p/gnucobol/discussion/cobol/thread/c78ce37697/?limit=25#1d22/804d


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/gnucobol/discussion/cobol/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/

         
  • Jonathan Beit-Aharon

    Attached is a copy of an strace that captured a failure. I had to replace most customer data with the word "censored" on many lines, as required by NDA -- I hope it will not detract from the usefulness of the trace.

    I began to examine the trace with much appreciated help from pottmi. Below are my notes from our examination. BTW, in case it matters, the response to ulimit was "unlimited".

    The nearest thing Michael and I found to suspicious activity was around the open("/proc/meminfo" line, which occurs between several read(3 lines.

    There are two open("copylib/YV23___BY__pp" lines, because the source program has two COPY REPLACING line for that copybook, to replace ==()== with valid prefixes in UTF8 (Japanese) characters. The problem occurred on that copybook, after the second open.

    BTW, we were surprised by mmap / munmap use for memory allocation, although this may or may not have anything to do with the problem.

     
    • pottmi

      pottmi - 2021-12-21

      Here is my $0.02 based on looking at the strace output.

      This is pure speculation as I am not looking at the source of cobc.

      I speculate that the system is hitting some artificial limit on memory
      usage and fgets (or one of its friends) is returning null and setting
      errno. cobc is detecting that is EOF when it should be detected as an
      error. As is happens fgets is returns null for EOF and for errors. The
      difference is the eof indicator is set for successful EOF and errno is set
      for an error.

      It will take me a while to look into that, but if someone more
      familiar with the code could take a peek first it might be more efficient.

         Upon successful completion, *fgets*() shall return *s*.  If the
         stream is at end-of-file, the end-of-file indicator for the
         stream shall be set and *fgets*() shall return a null pointer.  If
         a read error occurs, the error indicator for the stream shall be
         set, *fgets*() shall return a null pointer, and shall set
      

      errno https://man7.org/linux/man-pages/man3/errno.3.html to
      indicate the error.

      On Mon, Dec 20, 2021 at 8:52 PM Jonathan Beit-Aharon jbeit-aharon@users.sourceforge.net wrote:

      Attached is a copy of an strace that captured a failure. I had to replace
      most customer data with the word "censored" on many lines, as required by
      NDA -- I hope it will not detract from the usefulness of the trace.

      I began to examine the trace with much appreciated help from pottmi. Below
      are my notes from our examination. BTW, in case it matters, the response to
      ulimit was "unlimited".

      The nearest thing Michael and I found to suspicious activity was around
      the open("/proc/meminfo" line, which occurs between several read(3 lines.

      There are two open("copylib/YV23___BY__pp" lines, because the source
      program has two COPY REPLACING line for that copybook, to replace ==()==
      with valid prefixes in UTF8 (Japanese) characters. The problem occurred on
      that copybook, after the second open.

      BTW, we were surprised by mmap / munmap use for memory allocation,
      although this may or may not have anything to do with the problem.

      Attachments:


      Failures when cobc launched in parallel processes, but not in serial
      execution
      https://sourceforge.net/p/gnucobol/discussion/cobol/thread/c78ce37697/?limit=25#1563


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/gnucobol/discussion/cobol/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

Log in to post a comment.