Menu

#597 Add COBOL to function list

Next_major_release
open
nobody
None
5
2014-12-19
2014-09-02
No

Hi, these are the necessary changes to function.xml to support COBOL (sections and paragraphs are included). There are some regressions (included as xml comments below), but it's much better than the current "nothing" implementation.

...
            <association langID="50" id="cobol_section"/>
...

            <parser id="cobol_section" displayName="COBOL">
                <function
                    mainExpr="^.{6}[\sD]\s{0,3}[A-Za-z0-9_-]{1,}(\s{1,}section\s*||\s*)\."
                    displayMode="$functionName">
                    <!-- Variant for COBOL free-form reference format
                         (it's only able to parse sections but not
                         paragraphs, because of missing areas)
                    mainExpr="[A-Za-z0-9_-]*\s*section\s*\."
                    -->
                    <functionName>
                        <nameExpr expr="[A-Za-z0-9_-]*(\s*(section)){0,1}\."/>
                    </functionName>
                </function>
            </parser>

With this sample:

       TEST1 SECTION.
       TEST2 SECTION PAR1.
       PAR2.
      *comment.
       TEST3
          SECTION
          PAR1.
          exit section.
       exit-prog section.

sections and paragraphs are shown as you can see in the attachement.

This works and can be added as it is, optional points for bettering the function list:

  • remove multiple spaces in function name for example "TEST3 SECTION PAR1." instead of the current "TEST3 SECTION PAR1." (is this change possible via <functionName>, and if yes: how)
  • remove "exit section." from being shown as a section (this is a statement leaving the section, not a new section name), I've added "(?!(exit))" to mainExpr attribute without any change and to expr attribute of nameExpr where it only leads to the leading "e" from exit not being shown ??? (if anybody can explain this to me than I'd be smarter than before)
  • inline comments "*>" are not skipped yet. One can add the attribute commentExpr="(^.{0,5}*>.*?$)" which skips commented lines, but the line BEFORE is skipped, too (place it before TEST2 and both TEST1 and TEST2 will be missing). Enlighten me by explaining this would be nice, too :-)

Simon

1 Attachments

Related

Discussion: Function List for COBOL
Patches: #548

Discussion

  • Simon Sobisch

    Simon Sobisch - 2014-09-02

    Test prog attached, too

     
  • Menno Vogels

    Menno Vogels - 2014-09-02
    1. Not possible with current FunctionList implementation.
    2. Try this:

      <parser 
          id="cobol_section" displayName="COBOL"
          commentExpr="(?m-s)(?:^[\d\t ]{6}\*|\*&gt;).*$">
        <function
            mainExpr="(?s-m)[\d\t ]{6}[D ][\t ]{0,3}(?!exit)[\w\-]+\s+section(?:\s+\w+)?\s*?\."
            displayMode="$functionName">
          <functionName>
            <nameExpr expr="(?!exit)[\w\-]+(?=\s+section)"/>
         </functionName>
        </function>
      </parser>
      
    3. See 2. However, current FunctionList implementation has a problem with comment boundaries. Single-line comments should be preceded by an empty line and inline comments should be preceded by at least 2 spaces.
      e.g.

       PROCEDURE DIVISION.
      
      *-----------------------------------------------------------------
       TEST1 SECTION.  *> inline comment
      
     

    Last edit: Menno Vogels 2014-09-02
  • Simon Sobisch

    Simon Sobisch - 2014-09-03

    Hi Menno,

    big thanks for your post.

    To 3. commentExpr="(?m-s)(?:^[\d\t ]{6}*|*>).*$" works fine with the restriction you've mentioned [which is why it should be used yet, especially COBOLers don't use much empty lines or add a lot of spaces].

    But I do not understand why the working part works - shouldn't [\d\t ] only match digits, tabs and spaces? a-zA-Z are matched too - as it must be in this case.

    If you're sure that the current FunctionList implementation has a problem with comment boundaries please open a bug ticket for that (I didn't found any).

    To 1. Does it make sense to open a feature request ticket for that (it is related to all parsers of Function List)? Something like search and replace in displayed function names would take care of every thing I can think of, the currently not used displayMode could be used for this.

    To 2. I see the idea. The version you've posted removes paragraphs from the result and doesn't support newlines before/after section, I've changed this and will post a proposed patch tomorrow.

    Here is the already working part for free-form reference format

                <!-- Variant for COBOL free-form reference format -->
                <parser id="cobol_section_free" displayName="COBOL free-form reference format">
                    <!-- working comment Expression:
                             commentExpr="(?m-s)(?:\*&gt;).*$"
                         cannot be used because problems with comment boundaries
                         in current FunctionList implementation, for details see
                         https://sourceforge.net/p/notepad-plus/patches/597/
                    -->
                    <!-- Variant with paragraphs (don't work with comment lines before section/paragraph header, see above) -->
                    <!--
                    <function
                        mainExpr="(?<=\.)\s*(?!exit\s)[\w_-]+(\s+section(\s*|(\s+[\w_-]+)?))?(?=\.)"
                        displayMode="$functionName">
                        <functionName>
                            <nameExpr expr="(?<=[\s\.])[\w_-]+(\s*section\s*([\w_-]+)?)?"/>
                        </functionName>
                    </function>
                    <!-- Variant without paragraphs (works with comment lines before section header) -->
                    <function
                        mainExpr="[\s\.](?!exit\s)[\w_-]+\s+section(\s*|(\s+[\w_-]+)?)(?=\.)"
                        displayMode="$functionName">
                        <functionName>
                            <nameExpr expr="[\w_-]+\s*section"/>
                        </functionName>
                    </function>
                </parser>
    

    Simon

     
    • Menno Vogels

      Menno Vogels - 2014-09-04

      Hi Simon,

      I'm not a COBOLer myself so please excuse me if I didn't get the syntax right.

      To 3. What do you mean with 'working part'?
      What should [A-Za-z] match to too?
      I did open a patch-ticket (#548) which includes a solution for the comment boundaries problem.

      To 1. Feature request ticket makes sense. I don't know what Don's e.a. intention was with the 'displayMode' attribute but it would be nice to somehow be able to 'clean up' the function name. My current implementation removes comment zones and changes white space characters to a single space.

       
      • Simon Sobisch

        Simon Sobisch - 2014-09-04

        No problem :-) I've only used simple regex before (at least compared to the now suggested COBOL parser). We all have our strengths and we all can learn something from time to time.
        This leads me to 3: you're right, 'working part' was confusing. I don't understand why your regex "(?m-s)(?:^[\d\t ]{6}|>).*$" work: it matches

        aBzT  
        

        too, but [\d\t ] should only match 0-9, not a-zA-Z and the modifier don't change this.

        What piece am I missing here?

        To 1: As posted below this wouldn't be necessary any more if the multiple ' ' will be replaced by 1 and the comments will be cut.
        Therefore it's more a nice-to-have and I want to post the feature request after testing the results of the comment boundary fix.

        Simon

         

        Last edit: Simon Sobisch 2014-09-04
        • Menno Vogels

          Menno Vogels - 2014-09-05

          The expression

          "(?m-s)(?:^[\d\t ]{6}\*|\*&gt;).*$"
          

          filters out comment and thus should not match 'aBzT'. It should not be visible in the FunctionList tree view at least.
          It should match:

          000000* aBzT
          

          or

            *> aBzT
          

          What FunctionList does:

          1. It uses "commentExpr" to find all the comment zones in the document;
          2. It uses "mainExpr" to find all the function definitions in the document while skipping the comment zones;
          3. It uses "expr" (one or more) to filter out the function names for every match of the "mainExpr" search. That is, the output of the "mainExpr" search is the input for the first "expr" search. For some languages it's easier to define more than one "expr" search to find the function name or to exclude the function parameters.
          4. Additional function name clean up (Patch #548 only):
            * replace each comment zone with one space char i.e. prevent 'intArg1' in case of 'int/*comment*/Arg1';
            * change two or more white-spaces to one space char
            i.e. prevent ' ' i.c.o. '/*comment*/ /*comment*/' or ' /*comment*/ ';
            * remove leading and trailing spaces;
            * remove the white-space character preceding any parenthesis or comma;
            * remove the white-space character succeeding an opening parenthesis;

          Hmm ... I think it's nicer to extend the function name search with a 'replace' attribute for the clean up e.g.

          <nameExpr expr="..." replace="...">
          

          to be able to customize.

          To prevent function declarations in literal strings from being listed one could handle string literals as comment e.g. add

          |(?:(?s-m)&quot;[^&quot;\\]*(?:\\.[^&quot;\\]*)*&quot;)
          

          to the 'commentExpr' of the C++ parser.

           

          Last edit: Menno Vogels 2014-10-13
          • Simon Sobisch

            Simon Sobisch - 2014-09-05

            I thought it matched

            aBzT  * test
            

            Too (this must be matched, to why I had .{6}* which doesn't match newlines because of the modifier.

            Did you tested the COBOL parser with the new function list and the combination of simplified parser + commentExpr already?

            Simon

             
  • Simon Sobisch

    Simon Sobisch - 2014-09-04

    Here is the full proposed patch:

                <association langID="50" id="cobol_section_fixed"/>
    
    [...]
    
                <!-- Variant for COBOL fixed-form reference format -->
                <parser id="cobol_section_fixed" displayName="COBOL fixed-form reference format">
                    <!-- working comment Expression - NOT(!) needed with mainExpr current used:
                            commentExpr="(?m-s)(?:^[\d\t ]{6}\*|\*&gt;).*$"
                         cannot be used because problems with comment boundaries
                         in current FunctionList implementation, for details see
                         https://sourceforge.net/p/notepad-plus/patches/597/
                         As soon as the comment boundaries are fixed the mainExpr and nameExpr
                         can be simplified as below
                        mainExpr="(?m-s)^.{6}[ D][\t ]{0,3}(?!exit\s)[\w_-]+(\.|((?'seps'([\t ]|([\n\r]+(.{6}[ D]|.{0,6}$)))+)section(\.|(?&seps)(\.|[\w_-]+\.))))"
                        expr="[\w_-]+((?=\.)|((?'seps'([\t ]|([\n\r]+(.{6}[ D]|.{0,6}$)))+)section((?=\.)|(?&seps)((?=\.)|[\w_-]+(?=\.)))))"
                    -->
                    <function
                        mainExpr="(?m-s)^.{6}[ D][\t ]{0,3}(?!exit\s)[\w_-]+(\.|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}([ D]|\*.*)|.{0,6}$)))+)section(\.|((?&seps)(\.|[\w_-]+\.)))))"
                        displayMode="$functionName">
                        <functionName>
                            <nameExpr expr="[\w_-]+((?=\.)|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}([ D]|\*.*)|.{0,6}$)))+)section((?=\.)|(?&seps)((?=\.)|[\w_-]+(?=\.)))))"/>
                        </functionName>
                    </function>
                </parser>
    
                <!-- Variant for COBOL free-form reference format -->
                <parser id="cobol_section_free" displayName="COBOL free-form reference format">
                    <!-- working comment Expression:
                             commentExpr="(?m-s)(?:\*&gt;).*$"
                         cannot be used because problems with comment boundaries
                         in current FunctionList implementation, for details see
                         https://sourceforge.net/p/notepad-plus/patches/597/
                    -->
                    <!-- Variant with paragraphs (don't work with comment lines
                         before section/paragraph header, can be activated when
                         comment boundaries work and the commentExpr is used) -->
                    <!--
                    <function
                        mainExpr="(?m-s)(?<=\.)\s*(?!exit\s)[\w_-]+(\s+section(\s*|(\s+[\w_-]+)?))(?=\.)"
                        displayMode="$functionName">
                        <functionName>
                            <nameExpr expr="(?m-s)(?<=[\s\.])[\w_-]+(\s*section\s*([\w_-]+)?)?"/>
                        </functionName>
                    </function>
                    -->
                    <!-- Variant without paragraphs (works with comment lines before section header) -->
                    <function
                        mainExpr="[\s\.](?!exit\s)[\w_-]+\s+section(\s*|(\s+[\w_-]+)?)(?=\.)"
                        displayMode="$functionName">
                        <functionName>
                            <nameExpr expr="[\w_-]+\s*section"/>
                        </functionName>
                    </function>
                </parser>
    

    I've added both reference formats, the user can change fixed/free in the association tag as long as the COBOL syntax highlighter isn't split in two highlighters.

    I've added everything that doesn't work because of the bouncing comments issue as a comment. When this bug is fixed we can uncomment these parts (and remove the others).

    The only current "bug" is that comments in the source code within "function" declarations (very uncommon) are shown as "function names" (likely solved when the bouncing comment issue is fixed, too), along as multiple spaces.
    I suggest to add possible regex for the function names in the display, the not-yet-used displayMode attribute could be used for this.

    Simon

     
  • Simon Sobisch

    Simon Sobisch - 2014-09-04

    And here are the test sources along with the results of the parsers:

     

    Last edit: Simon Sobisch 2014-09-04
    • Menno Vogels

      Menno Vogels - 2014-09-04

      Hi Simon,

      Adding the test sources along with the results is great. However, it's not clear to me whether or not it's the result you expected.

       
      • Simon Sobisch

        Simon Sobisch - 2014-09-04

        They are expected. For better results we need the reduction of multiple spaces in function names and the comments filtering working (I expect them to not be included to in string we have to filter in MainExpr and therefore don't show up in the function names). As I've seen in [patches:#548] you have a working version. Please give it a try with TESTFIXED.cbl and TESTFREE.cbl by using the commented parts instead of the used ones and post the results here (the fixed-form variant with paragraphs should work, too).

        The only "glitch" we have in afterwards is the FIXED sample with "NPAR" showing up while col 1-6 are ignored. This can be fixed by either filtering via displayMode (currently not possible) or by defining col 1-6 as comment, too (which likely need changes in the fixed-form parser). I think this would be the better solution in any way as this leads to less things to match for MainExpr and nameExpr, but we have to wait for the comment boundary fix first.
        Edit: You've included it in the commentExpr already, I just hadn't the chance to test/adjust the parser.

        Simon

         

        Related

        Patches: #548


        Last edit: Simon Sobisch 2014-09-04
        • Menno Vogels

          Menno Vogels - 2014-09-07

          My bad, I should not have used "expected" but "wanted" or better yet "intended".
          You want better results so the screen grabs don't show the intended result.

          :)

           
  • Menno Vogels

    Menno Vogels - 2014-09-06

    Notepad++ Rev.1275 + Patch #548 + Scintilla v3.50 + Boost 1.56

    TestFixed.Menno.1.png :

    commentExpr="(?'SLC'(?m-s)(?:^[\d\t ]{6}\*|\*&gt;).*$)"
    mainExpr="(?m-s)^.{6}[ D][\t ]{0,3}(?!exit\s)[\w\-]+(\.|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}([ D]|\*.*)|.{0,6}$)))+)section(\.|((?&amp;seps)(\.|[\w\-]+\.)))))"
    expr="[\w\-]+((?=\.)|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}([ D]|\*.*)|.{0,6}$)))+)section((?=\.)|(?&amp;seps)((?=\.)|[\w\-]+(?=\.)))))"
    

    TestFixed.Menno.2.png :

    commentExpr="(?'SLC'(?m-s)(?:^[\d\t ]{6}\*|\*&gt;).*$)"
    mainExpr="(?m-s)^.{6}[ D][\t ]{0,3}(?!exit\s)[\w\-]+(\.|((?'seps'([\t ]|([\n\r]+(.{6}[ D]|.{0,6}$)))+)section(\.|(?&amp;seps)(\.|[\w\-]+\.))))"
    expr="[\w\-]+((?=\.)|((?'seps'([\t ]|([\n\r]+(.{6}[ D]|.{0,6}$)))+)section((?=\.)|(?&amp;seps)((?=\.)|[\w\-]+(?=\.)))))"
    

    TestFree.Menno.1.png :

    commentExpr="(?'SLC'(?m-s)\*&gt;.*$)"
    mainExpr="[\s\.](?!exit\s)[\w\-]+\s+section(\s*|(\s+[\w\-]+)?)(?=\.)"
    expr="[\w\-]+\s*section"
    

    TestFree.Menno.2.png :

    commentExpr="(?'SLC'(?m-s)\*&gt;.*$)"
    mainExpr="(?m-s)(?<=\.)\s*(?!exit\s)[\w\-]+(\s+section(\s*|(\s+[\w\-]+)?))(?=\.)"
    expr="(?<=[\s\.])[\w\-]+(\s*section\s*([\w\-]+)?)?"
    

    FYI: \w == [A-Za-z0-9_]

     

    Last edit: Menno Vogels 2014-09-06
  • Simon Sobisch

    Simon Sobisch - 2014-09-06

    Nice to see the comment boundaries working.

    Just "academic": The "propper" version (concerning the language definition) of the "word"-part would be

    [A-Za-z][\w\-]*[A-Za-z0-9]
    

    instead of

    [\w\-]+
    

    But this would only filter stuff that is wrong coded and all COBOL compilers would complain about that. The gain of the more complex syntax for an editor is not real (while I wouldn't think the same from the performance side of view).

    TestFixed.Menno.1.png is the better of your fixed versions, in any case SLC should be

    (?m-s)(?:^.{6}\*|\*&gt;).*$
    

    as it doesn't matter at all what is placed in the first 6 columns (if it isn't [\n\r]+ but this is filtered via modifier already).

    TestFree.Menno.1.png is the better of the free version.
    Both free versions show a possible tweak for your [patches:#548]: replace occurrences of "box char" (likely [\n\r]) with spaces for the function name (before removing duplicate, leading[ [trailing] spaces).

    I'd like to find the best version for the COBOL parsers. Can you review patch 548 with the idea above and upload the necessary delta binary (maybe only notepad++.exe ?) for testing purposes?

    And one thing I'm not sure about - an assumption how the three-step filtering is working:

           PROCEDURE DIVISION.
          *-----------------------------------------------------------------
           TEST1 *> nice section
    
           SECTION.
           TEST2 SECTION
      NPAR .
          DPAR2.
           PAR3.
              exit section.
          *comment.
          *no section.
           TEST3
          * comment line
              SECTION
              PAR1.
           TEST4  SECTION  PAR1.
    012345 exit-prog section.
           prog-exit section.
                 exit program.
    

    converted with

    commentExpr="(?m-s)(?'SingleLineComments'(?:^.{6}\*|\*&gt;).*$)|(?'DebugLineMarker'^.{6}D)|(?'SequenceNumberArea'^.{6})"
    

    to all comments replaced by spaces

           PROCEDURE DIVISION.
    
           TEST1                
    
           SECTION.
           TEST2 SECTION
           .
           PAR2.
           PAR3.
              exit section.
    
           TEST3
    
              SECTION
              PAR1.
           TEST4  SECTION  PAR1.
           exit-prog section.
           prog-exit section.
                 exit program.
    

    matched with

    mainExpr="(?m-s)^.{6} [\t ]{0,3}(?!exit\s)[\w\-]+(\.|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}( |\*.*)|.{0,6}$)))+)section(\.|((?&amp;seps)(\.|[\w\-]+\.)))))"
    

    to all unmatched entries removed

           TEST1                
    
           SECTION.
           TEST2 SECTION
           .
           PAR2.
           PAR3.
           TEST3
    
              SECTION
              PAR1.
           TEST4  SECTION  PAR1.
           exit-prog section.
           prog-exit section.
    

    matched with

    expr="[\w\-]+((?=\.)|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}( |\*.*)|.{0,6}$)))+)section((?=\.)|(?&amp;seps)((?=\.)|[\w\-]+(?=\.)))))"
    

    to the following entries in function list (using [EOE] for marking end of entry)

    TEST1                
    
           SECTION[EOE]
           TEST2 SECTION
           [EOE]
    PAR2[EOE]
    PAR3[EOE]
    TEST3
    
              SECTION
              PAR1[EOE]
    TEST4  SECTION  PAR1[EOE]
    exit-prog section[EOE]
    prog-exit section[EOE]
    

    Is this assumption correct? If not, where and how does the implementation differs?

    I guess with 548 applied there is an additional last step as described above before adding the name to the list, leading to the following names

    • TEST1 SECTION
    • TEST2 SECTION
    • PAR2
    • PAR3
    • TEST3 SECTION PAR1
    • TEST4 SECTION PAR1
    • exit-prog section
    • prog-exit section

    Simon

     

    Related

    Patches: #548

  • Menno Vogels

    Menno Vogels - 2014-09-07

    I guess 'word'-part is what I call identifier.
    What is a "box char"?
    Is there a (A|E)?BNF document for COBOL? That's what I usually start with for the regular expressions.

    The steps don't quit work like that and yes ... with patch #548 there is a 4th step (see updated list above). Furthermore, step 2 has been altered ...

    Step 2 w/o patch #548: the function definition has to start and end in the same non-comment zone (even if 'mainExpr' takes into account function-definition-embedded comments).

    Step 2 w/ patch #548: the function definition has to start in a non-comment zone but can end in any succeeding non-comment zone i.e. making it possible to have comment zones within the definition (as long as 'mainExpr' takes these embedded comments into account). Step 4 filters out these embedded comments.

    e.g. C/C++ function definition (i.e. a step 3 result)

        FunctionName                                           /* comment */ 
        (                                                      /* comment */ 
          /* comment */Argument1Type/* comment */Argument1Name /* comment */
                                                               /* comment */
        , Argument2Type Argument2Name                          /* comment */
        , ...                                                  /* comment */
        )                                                      /* comment */
    

    can become (i.e. a step 4 result)

        FunctionName(Argument1Type Argument1Name, Argument2Type Argument2Name, ...)
    
     

    Last edit: Menno Vogels 2014-09-07
  • Menno Vogels

    Menno Vogels - 2014-09-07

    Removing the comments in step 2 makes more sense.
    However, for steps 1 to 3 the Scintilla API is called to apply the regular expressions. And I guess it's more efficient this way then to apply the 3 steps on a copy of the text, especially for the larger source files.
    Keep in mind that you don't actually want to remove any text!

    I'd have to dig deeper in Scintilla documentation to find out if it has an internal/shadow text buffer that can be manipulated.

     
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.