Hi, these are the necessary changes to function.xml to support COBOL (sections and paragraphs are included). There are some regressions (included as xml comments below), but it's much better than the current "nothing" implementation.
... <association langID="50" id="cobol_section"/> ... <parser id="cobol_section" displayName="COBOL"> <function mainExpr="^.{6}[\sD]\s{0,3}[A-Za-z0-9_-]{1,}(\s{1,}section\s*||\s*)\." displayMode="$functionName"> <!-- Variant for COBOL free-form reference format (it's only able to parse sections but not paragraphs, because of missing areas) mainExpr="[A-Za-z0-9_-]*\s*section\s*\." --> <functionName> <nameExpr expr="[A-Za-z0-9_-]*(\s*(section)){0,1}\."/> </functionName> </function> </parser>
With this sample:
TEST1 SECTION. TEST2 SECTION PAR1. PAR2. *comment. TEST3 SECTION PAR1. exit section. exit-prog section.
sections and paragraphs are shown as you can see in the attachement.
This works and can be added as it is, optional points for bettering the function list:
Simon
Discussion: Function List for COBOL
Patches: #548
Test prog attached, too
Try this:
See 2. However, current FunctionList implementation has a problem with comment boundaries. Single-line comments should be preceded by an empty line and inline comments should be preceded by at least 2 spaces.
e.g.
Last edit: Menno Vogels 2014-09-02
Hi Menno,
big thanks for your post.
To 3. commentExpr="(?m-s)(?:^[\d\t ]{6}*|*>).*$" works fine with the restriction you've mentioned [which is why it should be used yet, especially COBOLers don't use much empty lines or add a lot of spaces].
But I do not understand why the working part works - shouldn't [\d\t ] only match digits, tabs and spaces? a-zA-Z are matched too - as it must be in this case.
If you're sure that the current FunctionList implementation has a problem with comment boundaries please open a bug ticket for that (I didn't found any).
To 1. Does it make sense to open a feature request ticket for that (it is related to all parsers of Function List)? Something like search and replace in displayed function names would take care of every thing I can think of, the currently not used displayMode could be used for this.
To 2. I see the idea. The version you've posted removes paragraphs from the result and doesn't support newlines before/after section, I've changed this and will post a proposed patch tomorrow.
Here is the already working part for free-form reference format
Simon
Hi Simon,
I'm not a COBOLer myself so please excuse me if I didn't get the syntax right.
To 3. What do you mean with 'working part'?
What should [A-Za-z] match to too?
I did open a patch-ticket (#548) which includes a solution for the comment boundaries problem.
To 1. Feature request ticket makes sense. I don't know what Don's e.a. intention was with the 'displayMode' attribute but it would be nice to somehow be able to 'clean up' the function name. My current implementation removes comment zones and changes white space characters to a single space.
No problem :-) I've only used simple regex before (at least compared to the now suggested COBOL parser). We all have our strengths and we all can learn something from time to time.
This leads me to 3: you're right, 'working part' was confusing. I don't understand why your regex "(?m-s)(?:^[\d\t ]{6}|>).*$" work: it matches
too, but [\d\t ] should only match 0-9, not a-zA-Z and the modifier don't change this.
What piece am I missing here?
To 1: As posted below this wouldn't be necessary any more if the multiple ' ' will be replaced by 1 and the comments will be cut.
Therefore it's more a nice-to-have and I want to post the feature request after testing the results of the comment boundary fix.
Simon
Last edit: Simon Sobisch 2014-09-04
The expression
filters out comment and thus should not match 'aBzT'. It should not be visible in the FunctionList tree view at least.
It should match:
or
What FunctionList does:
* replace each comment zone with one space char i.e. prevent
'intArg1'
in case of'int/*comment*/Arg1'
;* change two or more white-spaces to one space char
i.e. prevent
' '
i.c.o.'/*comment*/ /*comment*/'
or' /*comment*/ '
;* remove leading and trailing spaces;
* remove the white-space character preceding any parenthesis or comma;
* remove the white-space character succeeding an opening parenthesis;
Hmm ... I think it's nicer to extend the function name search with a 'replace' attribute for the clean up e.g.
to be able to customize.
To prevent function declarations in literal strings from being listed one could handle string literals as comment e.g. add
to the 'commentExpr' of the C++ parser.
Last edit: Menno Vogels 2014-10-13
I thought it matched
Too (this must be matched, to why I had .{6}* which doesn't match newlines because of the modifier.
Did you tested the COBOL parser with the new function list and the combination of simplified parser + commentExpr already?
Simon
Here is the full proposed patch:
I've added both reference formats, the user can change fixed/free in the association tag as long as the COBOL syntax highlighter isn't split in two highlighters.
I've added everything that doesn't work because of the bouncing comments issue as a comment. When this bug is fixed we can uncomment these parts (and remove the others).
The only current "bug" is that comments in the source code within "function" declarations (very uncommon) are shown as "function names" (likely solved when the bouncing comment issue is fixed, too), along as multiple spaces.
I suggest to add possible regex for the function names in the display, the not-yet-used displayMode attribute could be used for this.
Simon
And here are the test sources along with the results of the parsers:
Last edit: Simon Sobisch 2014-09-04
Hi Simon,
Adding the test sources along with the results is great. However, it's not clear to me whether or not it's the result you expected.
They are expected. For better results we need the reduction of multiple spaces in function names and the comments filtering working (I expect them to not be included to in string we have to filter in MainExpr and therefore don't show up in the function names). As I've seen in [patches:#548] you have a working version. Please give it a try with TESTFIXED.cbl and TESTFREE.cbl by using the commented parts instead of the used ones and post the results here (the fixed-form variant with paragraphs should work, too).
The only "glitch" we have in afterwards is the FIXED sample with "NPAR" showing up while col 1-6 are ignored. This can be fixed by either filtering via displayMode (currently not possible) or by defining col 1-6 as comment, too (which likely need changes in the fixed-form parser). I think this would be the better solution in any way as this leads to less things to match for MainExpr and nameExpr, but we have to wait for the comment boundary fix first.
Edit: You've included it in the commentExpr already, I just hadn't the chance to test/adjust the parser.
Simon
Related
Patches: #548
Last edit: Simon Sobisch 2014-09-04
My bad, I should not have used "expected" but "wanted" or better yet "intended".
You want better results so the screen grabs don't show the intended result.
:)
Notepad++ Rev.1275 + Patch #548 + Scintilla v3.50 + Boost 1.56
TestFixed.Menno.1.png :
TestFixed.Menno.2.png :
TestFree.Menno.1.png :
TestFree.Menno.2.png :
FYI: \w == [A-Za-z0-9_]
Last edit: Menno Vogels 2014-09-06
Nice to see the comment boundaries working.
Just "academic": The "propper" version (concerning the language definition) of the "word"-part would be
instead of
But this would only filter stuff that is wrong coded and all COBOL compilers would complain about that. The gain of the more complex syntax for an editor is not real (while I wouldn't think the same from the performance side of view).
TestFixed.Menno.1.png is the better of your fixed versions, in any case SLC should be
as it doesn't matter at all what is placed in the first 6 columns (if it isn't [\n\r]+ but this is filtered via modifier already).
TestFree.Menno.1.png is the better of the free version.
Both free versions show a possible tweak for your [patches:#548]: replace occurrences of "box char" (likely [\n\r]) with spaces for the function name (before removing duplicate, leading[ [trailing] spaces).
I'd like to find the best version for the COBOL parsers. Can you review patch 548 with the idea above and upload the necessary delta binary (maybe only notepad++.exe ?) for testing purposes?
And one thing I'm not sure about - an assumption how the three-step filtering is working:
converted with
to all comments replaced by spaces
matched with
to all unmatched entries removed
matched with
to the following entries in function list (using [EOE] for marking end of entry)
Is this assumption correct? If not, where and how does the implementation differs?
I guess with 548 applied there is an additional last step as described above before adding the name to the list, leading to the following names
Simon
Related
Patches: #548
I guess
'word'
-part is what I call identifier.What is a "box char"?
Is there a (A|E)?BNF document for COBOL? That's what I usually start with for the regular expressions.
The steps don't quit work like that and yes ... with patch #548 there is a 4th step (see updated list above). Furthermore, step 2 has been altered ...
Step 2 w/o patch #548: the function definition has to start and end in the same non-comment zone (even if
'mainExpr'
takes into account function-definition-embedded comments).Step 2 w/ patch #548: the function definition has to start in a non-comment zone but can end in any succeeding non-comment zone i.e. making it possible to have comment zones within the definition (as long as
'mainExpr'
takes these embedded comments into account). Step 4 filters out these embedded comments.e.g. C/C++ function definition (i.e. a step 3 result)
can become (i.e. a step 4 result)
Last edit: Menno Vogels 2014-09-07
I referred with "box char" the images you've posted (it's the substitute for "not printable / not in the font"). I think they should be replaced by one space at least if they are NULL or \n or \r.
Looks like COBOL BNF, but I didn't checked the quality:
http://tomcopeland.blogs.com/cobol.html
There is a good GNU-free grammer at https://sourceforge.net/p/open-cobol/code/HEAD/tree/branches/gnu-cobol-2.0/cobc/parser.y with the comment parts in the lexer https://sourceforge.net/p/open-cobol/code/HEAD/tree/branches/gnu-cobol-2.0/cobc/pplex.l but these are not "clean" (full of extensions and dialects).
Thank you for the clarification of the steps. But wouldn't a substitution of comments in step 2 against spaces would be even make more sense than the current behaviour with [patches:#548]?
Simon
Related
Patches: #548
Removing the comments in step 2 makes more sense.
However, for steps 1 to 3 the Scintilla API is called to apply the regular expressions. And I guess it's more efficient this way then to apply the 3 steps on a copy of the text, especially for the larger source files.
Keep in mind that you don't actually want to remove any text!
I'd have to dig deeper in Scintilla documentation to find out if it has an internal/shadow text buffer that can be manipulated.