Menu

#251 TCoolEdit: Syntax highlighting for INI files

8
pending
5
2024-10-22
2024-10-05
No

Adding an INI parser will allow to detect read more easily not only owlmaker.ini but also other users ini files.

Related

Discussion: CoolEdit in C++Builder 10.x
Discussion: CoolEdit in C++Builder 10.x
Feature Requests: #252
Feature Requests: #268
Feature Requests: #269
Wiki: OWLNext_Roadmap_and_Prereleases

Discussion

  • Sebastian Ledesma

    Commit [r7137] on Code it's the first implementation.
    To use into OWLMaker / CoolPrj
    1 - Add to the build system/makefile
    2- Add $(OWLROOT)\source\coolprj\lang.cpp

    /*extern*/ TSyntaxParser* IniParserCreator(TCoolTextWnd* parent);
    
    TParserSelector::TParserInfo TParserSelector::ParserInfo [] = {
    ...
      { _T("*.ini"), IniParserCreator, 0 },  // <- New
      { _T("*.*"),  plainTextParserCreator, 0  }, // last any file
    };
    

    to

     
    👍
    1

    Related

    Commit: [r7137]

  • Sebastian Ledesma

    Some clean-ups are pending as I want to decide the detail level of the parser vs speed of parsing.

    Also the idea it's not to use any of the original code, just being compatible at API level with original source code, so it's clear that this is our work.

     
    👍
    1
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-08

    Cool, nice to have INI parsing!

    int nLength = Parent->GetLineLength(nLine);
    LPCTSTR pszChars = GetLineText(nLine);
    

    Warning: Note that GetLineText does not return a zero-terminated string. So naming the variable "pszChars" is misleading, as the Hungarian notation prefix "psz" falsely claims that the string pointed to is zero-terminated. As a general rule, do not use Hungarian notation (see our Coding Standards).

    I have now added TCoolTextWnd::GetLine [r7139], which returns a tstring_view encapsulating the line text pointer (retrieved by GetLineText) and the line length (retrieved by GetLineLength). A tstring_view acts much like a tstring, and can be used and passed around safely without any string copying.

    const auto line = Parent->GetLine(lineIndex);
    

    Here is how you can use a tstring_view for your IsIniNumber utility function:

    auto IsIniNumber_(tstring_view s) -> bool
    {
      const auto f = s.empty() ? _T('\0') : s[0];
      const auto b = (f == _T('-') || f == _T('+')) ? ++begin(s) : begin(s); // Skip sign.
      const auto e = end(s);
      return b != e && all_of(b, e, [](tchar c) { return c >= _T('0') && c <= _T('9'); });
    }
    
    // ...
    
    if (IsIniNumber_(line.substr(i + 1)))
    {
      // ...
    }
    
     

    Related

    Commit: [r7139]
    Wiki: Coding_Standards


    Last edit: Vidar Hasfjord 2024-10-08
    • Sebastian Ledesma

      Warning: Note that GetLineText does not return a zero-terminated string. So naming the variable "pszChars" is misleading, as the Hungarian notation prefix "psz" falsely claims that the string pointed to is zero-terminated. As a general rule, do not use Hungarian notation (see our Coding Standards).

      You are right. I was thinking to rename it to

      const _TCHAR *pText = GetLineText(nLine);
      

      or something like that.
      Also in other review - two or more years ago- I've worked in a version that ensured the '\0',
      perhaps we should modify the API to force to pass int *len to ensure that the user will know that the line/buffer it's not null-ended.

       
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-08

    @sebas_ledesma wrote:

    perhaps we should modify the API to force to pass [the string length]

    I agree it would be nice to improve the API and deprecate the unsafe ways. I would think tstring_view is ideal for this, and TCoolTextWnd::GetLine should be a great improvement for handling CoolEdit text lines safely, since tstring_view can be used easily and safely much like a tstring (for iteration, substrings, etc.). Wider use of tstring_view throughout OWLNext is something I would like to see in version 8.

    See "Support for string_view" [feature-requests:#127].

    In the same vein, I would also like to see use of std::span to package any buffer pointer and size throughout OWLNext. Alas, the support for std::span in the Embarcadero toolsets isn't there yet, as it is a C++20 feature.

     
    👍
    1

    Related

    Feature Requests: #127

  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-09

    Hi Sebastian,

    I have now had a closer look at your parser code. It works pretty well, and I couldn't find any big issues, but as you note in the code, it doesn't categorise all syntax elements, and the categorisation it does is not fully accurate. In particular, a lot of whitespace is not categorised, so whitespace is incorrectly included in other elements. To explore this further, I got Claude 3.5 Sonnet to help me write an alternative implementation using regular expressions to categorise the elements more precisely.

    In the following screenshot, your original implementation is on the left, and my alternative implementation is on the right:

    INI parser screenshot

    In your original implementation on the left, note that leading whitespace (indentation) gets the same highlighting as the key names. Also, the INI format allows insignificant whitespace around the section name (within the braces), as well as after the key name, but this is not categorised.

    My alternative implementation on the right fixes these issues.

    I have updated your code to deal with the leading whitespace, among other improvements and refactoring [r7142], and I've included my alternative implementation in the code as well. Note that my implementation uses regular expressions and may hence be slower, so may not be preferable for production release. However, you may find it useful during development, for testing and comparison with the main implementation. My implementation is deactivated by default (you may activate it in the source code).

    PS. I've now deprecated TSyntaxParser::GetLineText, replacing it by GetLine returning tstring_view [r7140]. This change now produces warnings in all the old parsers. If you want to help clean them up, see my cleanup of TISSyntaxParser [r7141] as an example of the changes needed.

     
    👍
    1

    Related

    Commit: [r7140]
    Commit: [r7141]
    Commit: [r7142]


    Last edit: Vidar Hasfjord 2025-08-29
  • Sebastian Ledesma

    Hi Vidar:

    As always your code excels.

    As I said before, for years I've tried to make work CoolPrj in Borland/Embarcadero/Codegear and now finally it's starting to work, so it allowed me to integrate the functionallity into a project.

    Regarding the TIniSyntaxParser it's always a trade off between parsing precision and speed. At first time I didnt think to have enought experience evaluating that, so I've opted for a simple solution ( I remember the time when both MSVS-Intellisense and Codegear/Embarcadero taked some milliseconds to open the context menu when editing ).

    Some minor observations in the example:
    In [Section]
    key3 = 42
    key 4 = 42_

    both should be evaluated as integer (both parsers fail at this point).

     
    👍
    1
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-10

    Thanks Sebastian,

    On the INI parser, go with the implementation you prefer and is most comfortable with.

    both [values in "key3= 42" and "key 4=42 "] should be evaluated as integer

    Well, it depends. If the value is really just representing an integer, sure. But what if these values are passwords? Then the space is significant, and whether the digits form an integer is insignificant.

    Formally, everything after the '=' operator is part of the value. Hence, in no circumstance should whitespace within the value be categorised and highlighted as insignificant whitespace.

    The question is whether you want to categorise and highlight the value in parts or as a whole.

    If you want to highlight the value in parts, then "42 " could be highlighted as a number "42", followed by a string " ". Likewise for " 42" and "42m". But then "3.14" would be highlighted as a number "3", followed by a string ".", followed by a number "14", unless you also parse real numbers, not just integers.

    If you want to highlight the value as a whole, and you want to allow whitespace around numbers, then "42 " would be highlighted as a number, including the trailing whitespace. Likewise for " 42" and " 42 ". But "42m" would be highlighted as a string.

    Edit: In the attached patch, I have now added this functionality to the parsers: The main implementation (state machine) now highlights all integers within the value, while the alternative implementation (regular expression) now allows an integer to have surrounding whitespace. The latter still does not highlight the value in parts, but just highlights it as a whole, for the sake of simplicity. If you prefer in-parts highlighting, then the regular expressions can be extended to implement this as well.

    In the following screenshot, the main implementation is on the left, and the simpler alternative implementation is on the right:

    INI parser screenshot

    As can be seen in the screenshot, the main parser now also correctly highlights insignificant whitespace after the key name, which means the highlighting of the key name is now fully accurate in terms of what is seen by the Windows API. However, the section header is still parsed and highlighted as a whole. More parser states will need to be added to make this fully accurate, thereby increasing the parser complexity somewhat.

    The parsers now also correctly detect some syntax errors they didn't before.

    Note that I will not commit this patch, but I'll leave it here for you to use as you see fit.

    For your convenience, I have also attached the INI file and the highlight settings I've used for testing.

     

    Last edit: Vidar Hasfjord 2025-08-29
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-14

    Hi Sebastian,

    As part of my review of the code for the old parsers, I found this:

                      goto out;
                    }
                }
              bRedefineBlock = FALSE;
              bDecIndex = FALSE;
            }
    out:
    

    https://sourceforge.net/p/owlnext/code/7152/tree/trunk/source/coolprj/html.cpp#l413

    The old CoolPrj code for the parsers is horrible, horrible, horrible. It is utterly unreadable, and I've found several bugs and issues already. I noticed that it is also the origin of the misleading "pszText" naming issue I chided you for earlier. Please don't copy-paste from this old legacy code. Studying it for insight is fine. But don't reuse. Refactor and rewrite in modern C++, while adhering to our Coding Standards.

    On a positive note, since the code for the old parsers is obviously copy-pasted from each other, cleaning it up and rewriting it could be feasible, as refactoring work would likely be applicable across the board (as I've discussed in the discussion forum).

    If you plan on more parsers, I recommend using regular expressions as a starting point, then optimise later if necessary, e.g. using a clean finite-state machine algorithm, like in the current main implementation of your INI parser.

     

    Related

    Wiki: Coding_Standards

    • Sebastian Ledesma

      I particulary prefer a finite-state machine. I know they all start small and then grow when you increase the detail level.
      I think that they are faster than using regular expression, but I didnt did any performance comparison.

       
      👍
      1
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-19
    • summary: Add INI parser for syntax highlighting --> TCoolEdit: Syntax highlighting for INI files
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-21

    Hi Sebastian, I have refactored the code (yet again) [r7181]. I hope you are fine with the new code organisation and style. On the functionality of the parser, as discussed earlier, I'd leave it up to you whether you want to do more work on that (more accurate parsing of the section header, in particular). If not, set the status of this ticket to "pending".

    And to all: Please review and test.

     
    👍
    1

    Related

    Commit: [r7181]

  • Sebastian Ledesma

    Regarding name section, there is no specific rule that the name must be alphanumeric, or limited the ASCII characters.
    So it should be valid to have a section called:
    [😊]
    So, I dont see any reason to increase the strictness at this moment.
    Moreover, I've opened the TIniSyntaxParser-test.ini in notepad++ and I'm delighted to see that our (mostly yours) implementation is superior. Congrats!!

     
    👍
    1
  • Sebastian Ledesma

    • status: open --> pending
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-22

    @sebas_ledesma wrote:

    I particularly prefer a finite-state machine. [...] I think that they are faster than using regular expression

    Yeah, though convenient and succinct, regular expressions (RE) can be notoriously slow.

    Interestingly, the implementation of RE is often done by constructing a state machine from the expression. Then this machine is executed when you try to match the RE with a text. However, parsing the expression and constructing the state machine can be slow if not optimised. Very clever implementations do this at compile-time, and I think this has been proposed for the C++ standard library. Anyway, for the current slow implementations, a good optimisation is to make the RE a static variable, so that it is constructed only once, and then reused thereafter.

    Regarding name section, there is no specific rule that the name must be alphanumeric, or limited the ASCII characters. So it should be valid to have a section called: [😊]

    By more accurate, I didn't mean more restrictive. If you look closely, my alternative implementation using RE did allow pretty much anything for the name.

    However, it was more accurate in the sense that it individually recognised the constituent parts of the section element, i.e. brackets, whitespace and name. For example, if you put spaces around the smiley, i.e. [ 😊 ], they are not part of the name. The name is just the smiley (according to how the Windows API sees it).

    You can add states to the current implementation to deal with this. However, you may prefer highlighting the whole element instead of its constituent parts.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2024-10-22

    @sebas_ledesma wrote:

    I've opened the TIniSyntaxParser-test.ini in notepad++ and I'm delighted to see that our (mostly yours) implementation is superior. Congrats!!

    Cool!

     

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB