Notepad++ / Patches / #597 Add COBOL to function list

Simon Sobisch - 2014-09-02

Test prog attached, too

TEST.cbl

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Menno Vogels - 2014-09-02

Not possible with current FunctionList implementation.

Try this:

<parser id="cobol_section" displayName="COBOL" commentExpr="(?m-s)(?:^[\d\t ]{6}\*|\*>).*$"> <function mainExpr="(?s-m)[\d\t ]{6}[D ][\t ]{0,3}(?!exit)[\w\-]+\s+section(?:\s+\w+)?\s*?\." displayMode="$functionName"> <functionName> <nameExpr expr="(?!exit)[\w\-]+(?=\s+section)"/> </functionName> </function> </parser>

See 2. However, current FunctionList implementation has a problem with comment boundaries. Single-line comments should be preceded by an empty line and inline comments should be preceded by at least 2 spaces.
e.g.

PROCEDURE DIVISION. *----------------------------------------------------------------- TEST1 SECTION. *> inline comment

Last edit: Menno Vogels 2014-09-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simon Sobisch - 2014-09-03

Hi Menno,

big thanks for your post.

To 3. commentExpr="(?m-s)(?:^[\d\t ]{6}*|*>).*$" works fine with the restriction you've mentioned [which is why it should be used yet, especially COBOLers don't use much empty lines or add a lot of spaces].

But I do not understand why the working part works - shouldn't [\d\t ] only match digits, tabs and spaces? a-zA-Z are matched too - as it must be in this case.

If you're sure that the current FunctionList implementation has a problem with comment boundaries please open a bug ticket for that (I didn't found any).

To 1. Does it make sense to open a feature request ticket for that (it is related to all parsers of Function List)? Something like search and replace in displayed function names would take care of every thing I can think of, the currently not used displayMode could be used for this.

To 2. I see the idea. The version you've posted removes paragraphs from the result and doesn't support newlines before/after section, I've changed this and will post a proposed patch tomorrow.

Here is the already working part for free-form reference format

::xml  <parser id="cobol_section_free" displayName="COBOL free-form reference format">    <function mainExpr="[\s\.](?!exit\s)[\w_-]+\s+section(\s*|(\s+[\w_-]+)?)(?=\.)" displayMode="$functionName"> <functionName> <nameExpr expr="[\w_-]+\s*section"/> </functionName> </function> </parser>

Simon
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Menno Vogels - 2014-09-04
  
  Hi Simon,
  
  I'm not a COBOLer myself so please excuse me if I didn't get the syntax right.
  
  To 3. What do you mean with 'working part'?
  What should [A-Za-z] match to too?
  I did open a patch-ticket (#548) which includes a solution for the comment boundaries problem.
  
  To 1. Feature request ticket makes sense. I don't know what Don's e.a. intention was with the 'displayMode' attribute but it would be nice to somehow be able to 'clean up' the function name. My current implementation removes comment zones and changes white space characters to a single space.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Simon Sobisch - 2014-09-04
    
    No problem :-) I've only used simple regex before (at least compared to the now suggested COBOL parser). We all have our strengths and we all can learn something from time to time.
    This leads me to 3: you're right, 'working part' was confusing. I don't understand why your regex "(?m-s)(?:^[\d\t ]{6}|>).*$" work: it matches
    
    aBzT
    
    too, but [\d\t ] should only match 0-9, not a-zA-Z and the modifier don't change this.
    
    What piece am I missing here?
    
    To 1: As posted below this wouldn't be necessary any more if the multiple ' ' will be replaced by 1 and the comments will be cut.
    Therefore it's more a nice-to-have and I want to post the feature request after testing the results of the comment boundary fix.
    
    Simon
    
    Last edit: Simon Sobisch 2014-09-04
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Menno Vogels - 2014-09-05
      
      The expression
      
      "(?m-s)(?:^[\d\t ]{6}\*|\*>).*$"
      
      filters out comment and thus should not match 'aBzT'. It should not be visible in the FunctionList tree view at least.
      It should match:
      
      000000* aBzT
      
      or
      
      *> aBzT
      
      What FunctionList does:
      
      It uses "commentExpr" to find all the comment zones in the document;
      
      It uses "mainExpr" to find all the function definitions in the document while skipping the comment zones;
      
      It uses "expr" (one or more) to filter out the function names for every match of the "mainExpr" search. That is, the output of the "mainExpr" search is the input for the first "expr" search. For some languages it's easier to define more than one "expr" search to find the function name or to exclude the function parameters.
      
      Additional function name clean up (Patch #548 only):
      * replace each comment zone with one space char i.e. prevent 'intArg1' in case of 'int/*comment*/Arg1';
      * change two or more white-spaces to one space char
      i.e. prevent ' ' i.c.o. '/*comment*/ /*comment*/' or ' /*comment*/ ';
      * remove leading and trailing spaces;
      * remove the white-space character preceding any parenthesis or comma;
      * remove the white-space character succeeding an opening parenthesis;
      
      Hmm ... I think it's nicer to extend the function name search with a 'replace' attribute for the clean up e.g.
      
      <nameExpr expr="..." replace="...">
      
      to be able to customize.
      
      To prevent function declarations in literal strings from being listed one could handle string literals as comment e.g. add
      
      |(?:(?s-m)"[^"\\]*(?:\\.[^"\\]*)*")
      
      to the 'commentExpr' of the C++ parser.
      
      Last edit: Menno Vogels 2014-10-13
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Simon Sobisch - 2014-09-05
        
        I thought it matched
        
        aBzT * test
        
        Too (this must be matched, to why I had .{6}* which doesn't match newlines because of the modifier.
        
        Did you tested the COBOL parser with the new function list and the combination of simplified parser + commentExpr already?
        
        Simon
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Here is the full proposed patch:

::xml
            <association langID="50" id="cobol_section_fixed"/>

[...]

            <!-- Variant for COBOL fixed-form reference format -->
            <parser id="cobol_section_fixed" displayName="COBOL fixed-form reference format">
                <!-- working comment Expression - NOT(!) needed with mainExpr current used:
                        commentExpr="(?m-s)(?:^[\d\t ]{6}\*|\*&gt;).*$"
                     cannot be used because problems with comment boundaries
                     in current FunctionList implementation, for details see
                     https://sourceforge.net/p/notepad-plus/patches/597/
                     As soon as the comment boundaries are fixed the mainExpr and nameExpr
                     can be simplified as below
                    mainExpr="(?m-s)^.{6}[ D][\t ]{0,3}(?!exit\s)[\w_-]+(\.|((?'seps'([\t ]|([\n\r]+(.{6}[ D]|.{0,6}$)))+)section(\.|(?&seps)(\.|[\w_-]+\.))))"
                    expr="[\w_-]+((?=\.)|((?'seps'([\t ]|([\n\r]+(.{6}[ D]|.{0,6}$)))+)section((?=\.)|(?&seps)((?=\.)|[\w_-]+(?=\.)))))"
                -->
                <function
                    mainExpr="(?m-s)^.{6}[ D][\t ]{0,3}(?!exit\s)[\w_-]+(\.|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}([ D]|\*.*)|.{0,6}$)))+)section(\.|((?&seps)(\.|[\w_-]+\.)))))"
                    displayMode="$functionName">
                    <functionName>
                        <nameExpr expr="[\w_-]+((?=\.)|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}([ D]|\*.*)|.{0,6}$)))+)section((?=\.)|(?&seps)((?=\.)|[\w_-]+(?=\.)))))"/>
                    </functionName>
                </function>
            </parser>

            <!-- Variant for COBOL free-form reference format -->
            <parser id="cobol_section_free" displayName="COBOL free-form reference format">
                <!-- working comment Expression:
                         commentExpr="(?m-s)(?:\*&gt;).*$"
                     cannot be used because problems with comment boundaries
                     in current FunctionList implementation, for details see
                     https://sourceforge.net/p/notepad-plus/patches/597/
                -->
                <!-- Variant with paragraphs (don't work with comment lines
                     before section/paragraph header, can be activated when
                     comment boundaries work and the commentExpr is used) -->
                <!--
                <function
                    mainExpr="(?m-s)(?<=\.)\s*(?!exit\s)[\w_-]+(\s+section(\s*|(\s+[\w_-]+)?))(?=\.)"
                    displayMode="$functionName">
                    <functionName>
                        <nameExpr expr="(?m-s)(?<=[\s\.])[\w_-]+(\s*section\s*([\w_-]+)?)?"/>
                    </functionName>
                </function>
                -->
                <!-- Variant without paragraphs (works with comment lines before section header) -->
                <function
                    mainExpr="[\s\.](?!exit\s)[\w_-]+\s+section(\s*|(\s+[\w_-]+)?)(?=\.)"
                    displayMode="$functionName">
                    <functionName>
                        <nameExpr expr="[\w_-]+\s*section"/>
                    </functionName>
                </function>
            </parser>

I've added both reference formats, the user can change fixed/free in the association tag as long as the COBOL syntax highlighter isn't split in two highlighters.

I've added everything that doesn't work because of the bouncing comments issue as a comment. When this bug is fixed we can uncomment these parts (and remove the others).

The only current "bug" is that comments in the source code within "function" declarations (very uncommon) are shown as "function names" (likely solved when the bouncing comment issue is fixed, too), along as multiple spaces.
I suggest to add possible regex for the function names in the display, the not-yet-used displayMode attribute could be used for this.

Simon

Simon Sobisch - 2014-09-04

And here are the test sources along with the results of the parsers:

Last edit: Simon Sobisch 2014-09-04

TESTFIXED.cbl

TESTFIXED.png

TESTFREE.cbl

TESTFREE_WITHOUT_PAR.png

TESTFREE_WITH_PAR.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Menno Vogels - 2014-09-04
  
  Hi Simon,
  
  Adding the test sources along with the results is great. However, it's not clear to me whether or not it's the result you expected.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Simon Sobisch - 2014-09-04
    
    They are expected. For better results we need the reduction of multiple spaces in function names and the comments filtering working (I expect them to not be included to in string we have to filter in MainExpr and therefore don't show up in the function names). As I've seen in [patches:#548] you have a working version. Please give it a try with TESTFIXED.cbl and TESTFREE.cbl by using the commented parts instead of the used ones and post the results here (the fixed-form variant with paragraphs should work, too).
    
    The only "glitch" we have in afterwards is the FIXED sample with "NPAR" showing up while col 1-6 are ignored. This can be fixed by either filtering via displayMode (currently not possible) or by defining col 1-6 as comment, too (which likely need changes in the fixed-form parser). I think this would be the better solution in any way as this leads to less things to match for MainExpr and nameExpr, but we have to wait for the comment boundary fix first.
    Edit: You've included it in the commentExpr already, I just hadn't the chance to test/adjust the parser.
    
    Simon
    
    Related
    
    Patches: #548
    
    Last edit: Simon Sobisch 2014-09-04
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Menno Vogels - 2014-09-07
      
      My bad, I should not have used "expected" but "wanted" or better yet "intended".
      You want better results so the screen grabs don't show the intended result.
      
      :)
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Notepad++ Rev.1275 + Patch #548 + Scintilla v3.50 + Boost 1.56

TestFixed.Menno.1.png :

commentExpr="(?'SLC'(?m-s)(?:^[\d\t ]{6}\*|\*&gt;).*$)"
mainExpr="(?m-s)^.{6}[ D][\t ]{0,3}(?!exit\s)[\w\-]+(\.|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}([ D]|\*.*)|.{0,6}$)))+)section(\.|((?&amp;seps)(\.|[\w\-]+\.)))))"
expr="[\w\-]+((?=\.)|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}([ D]|\*.*)|.{0,6}$)))+)section((?=\.)|(?&amp;seps)((?=\.)|[\w\-]+(?=\.)))))"

TestFixed.Menno.2.png :

commentExpr="(?'SLC'(?m-s)(?:^[\d\t ]{6}\*|\*&gt;).*$)"
mainExpr="(?m-s)^.{6}[ D][\t ]{0,3}(?!exit\s)[\w\-]+(\.|((?'seps'([\t ]|([\n\r]+(.{6}[ D]|.{0,6}$)))+)section(\.|(?&amp;seps)(\.|[\w\-]+\.))))"
expr="[\w\-]+((?=\.)|((?'seps'([\t ]|([\n\r]+(.{6}[ D]|.{0,6}$)))+)section((?=\.)|(?&amp;seps)((?=\.)|[\w\-]+(?=\.)))))"

TestFree.Menno.1.png :

commentExpr="(?'SLC'(?m-s)\*&gt;.*$)"
mainExpr="[\s\.](?!exit\s)[\w\-]+\s+section(\s*|(\s+[\w\-]+)?)(?=\.)"
expr="[\w\-]+\s*section"

TestFree.Menno.2.png :

commentExpr="(?'SLC'(?m-s)\*&gt;.*$)"
mainExpr="(?m-s)(?<=\.)\s*(?!exit\s)[\w\-]+(\s+section(\s*|(\s+[\w\-]+)?))(?=\.)"
expr="(?<=[\s\.])[\w\-]+(\s*section\s*([\w\-]+)?)?"

FYI: \w == [A-Za-z0-9_]

Last edit: Menno Vogels 2014-09-06

TestFixed.Menno.1.png

TestFixed.Menno.2.png

TestFree.Menno.1.png

TestFree.Menno.2.png

Nice to see the comment boundaries working.

Just "academic": The "propper" version (concerning the language definition) of the "word"-part would be

[A-Za-z][\w\-]*[A-Za-z0-9]

instead of

[\w\-]+

But this would only filter stuff that is wrong coded and all COBOL compilers would complain about that. The gain of the more complex syntax for an editor is not real (while I wouldn't think the same from the performance side of view).

TestFixed.Menno.1.png is the better of your fixed versions, in any case SLC should be

(?m-s)(?:^.{6}\*|\*&gt;).*$

as it doesn't matter at all what is placed in the first 6 columns (if it isn't [\n\r]+ but this is filtered via modifier already).

TestFree.Menno.1.png is the better of the free version.
Both free versions show a possible tweak for your [patches:#548]: replace occurrences of "box char" (likely [\n\r]) with spaces for the function name (before removing duplicate, leading[ [trailing] spaces).

I'd like to find the best version for the COBOL parsers. Can you review patch 548 with the idea above and upload the necessary delta binary (maybe only notepad++.exe ?) for testing purposes?

And one thing I'm not sure about - an assumption how the three-step filtering is working:

::cobol
       PROCEDURE DIVISION.
      *-----------------------------------------------------------------
       TEST1 *> nice section

       SECTION.
       TEST2 SECTION
  NPAR .
      DPAR2.
       PAR3.
          exit section.
      *comment.
      *no section.
       TEST3
      * comment line
          SECTION
          PAR1.
       TEST4  SECTION  PAR1.
012345 exit-prog section.
       prog-exit section.
             exit program.

converted with

commentExpr="(?m-s)(?'SingleLineComments'(?:^.{6}\*|\*&gt;).*$)|(?'DebugLineMarker'^.{6}D)|(?'SequenceNumberArea'^.{6})"

to all comments replaced by spaces

::cobol
       PROCEDURE DIVISION.

       TEST1                

       SECTION.
       TEST2 SECTION
       .
       PAR2.
       PAR3.
          exit section.


       TEST3

          SECTION
          PAR1.
       TEST4  SECTION  PAR1.
       exit-prog section.
       prog-exit section.
             exit program.

matched with

mainExpr="(?m-s)^.{6} [\t ]{0,3}(?!exit\s)[\w\-]+(\.|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}( |\*.*)|.{0,6}$)))+)section(\.|((?&amp;seps)(\.|[\w\-]+\.)))))"

to all unmatched entries removed

::cobol
       TEST1                

       SECTION.
       TEST2 SECTION
       .
       PAR2.
       PAR3.
       TEST3

          SECTION
          PAR1.
       TEST4  SECTION  PAR1.
       exit-prog section.
       prog-exit section.

matched with

expr="[\w\-]+((?=\.)|((?'seps'([\t ]|\*&gt;.*|([\n\r]+(.{6}( |\*.*)|.{0,6}$)))+)section((?=\.)|(?&amp;seps)((?=\.)|[\w\-]+(?=\.)))))"

to the following entries in function list (using [EOE] for marking end of entry)

TEST1                

       SECTION[EOE]
       TEST2 SECTION
       [EOE]
PAR2[EOE]
PAR3[EOE]
TEST3

          SECTION
          PAR1[EOE]
TEST4  SECTION  PAR1[EOE]
exit-prog section[EOE]
prog-exit section[EOE]

Is this assumption correct? If not, where and how does the implementation differs?

I guess with 548 applied there is an additional last step as described above before adding the name to the list, leading to the following names

TEST1 SECTION
TEST2 SECTION
PAR2
PAR3
TEST3 SECTION PAR1
TEST4 SECTION PAR1
exit-prog section
prog-exit section

Simon

Patches: #548

Menno Vogels - 2014-09-07

I guess 'word'-part is what I call identifier.
What is a "box char"?
Is there a (A|E)?BNF document for COBOL? That's what I usually start with for the regular expressions.

The steps don't quit work like that and yes ... with patch #548 there is a 4th step (see updated list above). Furthermore, step 2 has been altered ...

Step 2 w/o patch #548: the function definition has to start and end in the same non-comment zone (even if 'mainExpr' takes into account function-definition-embedded comments).

Step 2 w/ patch #548: the function definition has to start in a non-comment zone but can end in any succeeding non-comment zone i.e. making it possible to have comment zones within the definition (as long as 'mainExpr' takes these embedded comments into account). Step 4 filters out these embedded comments.

e.g. C/C++ function definition (i.e. a step 3 result)

FunctionName /* comment */ ( /* comment */ /* comment */Argument1Type/* comment */Argument1Name /* comment */ /* comment */ , Argument2Type Argument2Name /* comment */ , ... /* comment */ ) /* comment */

can become (i.e. a step 4 result)

FunctionName(Argument1Type Argument1Name, Argument2Type Argument2Name, ...)

Last edit: Menno Vogels 2014-09-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simon Sobisch - 2014-09-07

I referred with "box char" the images you've posted (it's the substitute for "not printable / not in the font"). I think they should be replaced by one space at least if they are NULL or \n or \r.

Looks like COBOL BNF, but I didn't checked the quality:
http://tomcopeland.blogs.com/cobol.html

There is a good GNU-free grammer at https://sourceforge.net/p/open-cobol/code/HEAD/tree/branches/gnu-cobol-2.0/cobc/parser.y with the comment parts in the lexer https://sourceforge.net/p/open-cobol/code/HEAD/tree/branches/gnu-cobol-2.0/cobc/pplex.l but these are not "clean" (full of extensions and dialects).

Thank you for the clarification of the steps. But wouldn't a substitution of comments in step 2 against spaces would be even make more sense than the current behaviour with [patches:#548]?

Simon

Related

Patches: #548

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Menno Vogels - 2014-09-07

Removing the comments in step 2 makes more sense.
However, for steps 1 to 3 the Scintilla API is called to apply the regular expressions. And I guess it's more efficient this way then to apply the 3 steps on a copy of the text, especially for the larger source files.
Keep in mind that you don't actually want to remove any text!

I'd have to dig deeper in Scintilla documentation to find out if it has an internal/shadow text buffer that can be manipulated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Add COBOL to function list

Notepad++ project is moving to GitHub:

Group

Searches

Help

#597 Add COBOL to function list

Related

Discussion

Related

Related

Related