Menu

#54 Unicode support in CoolPrj

8
pending
1
2026-03-09
2012-08-31
No

Currently, CoolPrj isn't fully UNICODE aware.

Related

Bugs: #285
Bugs: #536
Discussion: CoolEdit issues
Discussion: CoolEdit issues
Discussion: CoolEdit and C++Builder
Discussion: CoolEdit and C++Builder
Discussion: CoolEdit and C++Builder
Discussion: CoolEdit and C++Builder
Discussion: CoolEdit and C++Builder
Feature Requests: #266
Feature Requests: #267
Feature Requests: #54
News: 2022/07/owlmaker-build-6160-update
Wiki: Frequently_Asked_Questions
Wiki: OWLNext_Roadmap_and_Prereleases
Wiki: OWLNext_modules_description

Discussion

1 2 > >> (Page 1 of 2)
  • Sebastian Ledesma

    • assigned_to: Sebastian Ledesma
     
  • Sebastian Ledesma

    We need to add a member variable 'encodingType' to TCoolFileBuffer.
    When opening a file we must detect the encoding type, when reading the text file the appropiate conversion will be done:

    • ANSI, UTF8 -> UTF16 in UNICODE builds
    • UTF8, UTF16 -> ANIS in ANSI builds.
      When saving the file, the encoding must be preserved, so previous to save the first line the BOM will be saved (in case of non-ANSI files), and then again convert from native charset to the requested by the original file.
     
    👍
    1
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-07

    Hi Sebastian,

    If your work on this is specific to 6.44, and it would be cumbersome for you to work on the trunk (7.1), then note that you can create a feature branch for this work, e.g. "branches/coolprj-unicode", by copying the current release branch for 6.44. Then when done, we can merge your changes into the trunk, and if appropriate, into the release branches. The feature branch can then be removed, if no longer needed.

     
  • Sebastian Ledesma

    My initial intention it's to support C++Builder both in 6.44.x and 7.x (as I have the long term target to migrate to OWLNext 7).
    After I have coolprj running I will be able to work in UNICODE support.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-07

    @sebas_ledesma wrote:

    We need to add a member variable 'encodingType'

    Makes sense. I suggest "Encoding" as the member name. Member names should have leading capital letter (see our Coding Standards).

    When saving the file, the encoding must be preserved

    Makes sense. Perhaps there should be an option to select encoding in the save dialog, as in Notepad? If not, what would be the encoding for new files?

    My initial intention [is] to support C++Builder both in 6.44.x and 7.x

    Good. Then I suggest you work on the trunk, and when done, we'll merge your work to the release branches in preparation for the next releases. However, if you feel it would be helpful to have a feature branch for working on 6.44, let me know if you need help to set it up (if you have TortoiseSVN installed, it is simply a copy in the Repository Browser).

     

    Related

    Wiki: Coding_Standards

  • Sebastian Ledesma

    I've migrated my code to UNICODE 5+ years ago,
    One of our applications, it's a script editor (for adds and music schedulling) that uses text files.
    So, I've developed a couple a functions to detect/write BOM and get/put lines with the corresponding encoding.
    For the UNICODE version I've silently switched to UTF-16 when saving. But for OWLMaker or other user application, the GUI should have an option to select the encoding.

     
    👍
    1
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-08

    In discussion:

    https://sourceforge.net/p/owlnext/discussion/97175/thread/8cbc674381/?limit=100#7630

    @sebas_ledesma wrote:

    This is a sample code to detect BOM and load a text file with the corresponding translation.

    I've done a quick code review. First, the important stuff:

    • You need error checking.
    • Your myFgets function may truncate the line read due to the use of fixed-size buffers. Tip: Allocate buffers (e.g. std::vector) of size 4 times maxLen instead. That should cover even extreme cases, in which every character code unit in the target buffer is equivalent to 4 code units in the source input encoding.

    Second, some suggestions on coding style:

    • Use our naming conventions (e.g. "ReadLine_" for a file-local function).
    • Use an anonymous namespace for file-local functions.
    • Use an enum for the encoding.
    • Use de-facto standard parameter names "buf" and "bufSize".
    • Break up long functions into sub-functions.
    • Don't use the preprocessor and conditional code if you can help it. Here you can instead just define overloads. The correct functions will be selected depending on the build mode, and the others will be unused and optimised away. As a bonus, you get any compilation errors in either implementation, regardless of the build mode.

    namespace {
    
    LPSTR ReadLine_Utf8_(LPSTR buf, int bufSize, FILE* file) {/*...*/}
    LPSTR ReadLine_Utf16be_(LPSTR buf, int bufSize, FILE* file) {/*...*/}
    LPSTR ReadLine_Utf16le_(LPSTR buf, int bufSize, FILE* file) {/*...*/}
    LPSTR ReadLine_Ansi_(LPSTR buf, int bufSize, FILE* file) {/*...*/}
    
    LPWSTR ReadLine_Utf8_(LPWSTR buf, int bufSize, FILE* file) {/*...*/}
    LPWSTR ReadLine_Utf16be_(LPWSTR buf, int bufSize, FILE* file) {/*...*/}
    LPWSTR ReadLine_Utf16le_(LPWSTR buf, int bufSize, FILE* file) {/*...*/}
    LPWSTR ReadLine_Ansi_(LPWSTR buf, int bufSize, FILE* file) {/*...*/}
    
    enum TEncoding_ { eUtf8, eUtf16be, eUtf16le, eAnsi };
    
    LPTSTR ReadLine_(LPTSTR buf, int bufSize, FILE* file, TEncoding_ encoding)
    {
      switch (encoding)
      {
        case eUtf8: return ReadLine_Utf8_(buf, bufSize, file);
        case eUtf16be: return ReadLine_Utf16be_(buf, bufSize, file);
        case eUtf16le: return ReadLine_Utf16le_(buf, bufSize, file);
        case eAnsi: return ReadLine_Ansi_(buf, bufSize, file);
      }
      CHECKX(false, _T("ReadLine_: Unknown encoding!"));
      return NULL;
    }
    
    } // namespace
    

    Finally, a few suggestions and some nitpicking:

    • You can use std::wstring_convert and std::codecvt to convert between encodings. (May not work for 6.44, though, which has support for older compilers.)
    • You can use std::transform to swap the byte order of a whole string.
    • You can use std::swap to swap two values.
    • Declare variables where they are used.
    • Don't repeat yourself (e.g. define constants).
    • Don't use magic numbers (define constants).
    • Use English for code comments. :-)
     

    Last edit: Vidar Hasfjord 2026-04-14
  • Sebastian Ledesma

    TEncoding should belong to the owl namespace, or at least to coolproj namespace.
    Explanation: the developer will need to know/access to that value and implement the user inteface to allow to the user to select a different encoding when saving a file.

    Also this will allow to implement the same mechanism to the TEditFile class.

    At last but not least. eAnsi should value 0, as it's our default value.
    I'm not sure to assign 1,2,3 to the others "eOptions" or instead use the very same BOM as value.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-08

    @sebas_ledesma wrote:

    TEncoding should belong to the owl namespace, or at least to coolproj namespace.
    Explanation: the developer will need to know/access to that value and implement the user inteface to allow to the user to select a different encoding when saving a file.

    Makes sense. My code example was just for exposition. I propose nesting TEncoding within TCoolTextBuffer (similar to TCoolTextBuffer::TCrLfStyle). I guess it is within TCoolTextBuffer that you need to extend the API with encoding support.

    eAnsi should [have] value 0, as it's our default value.

    Makes much sense.

    PS: By the way, I've attached full code in my previous post. It has error handling and further refactoring for code reuse (conversion functions now only used at a single call site within helper functions, which are fully reused). Though, beware, it is not tested in any way.

     
    👍
    1
  • Sebastian Ledesma

    Here it's a sample code of writeBOM.

    void myWriteBOM(FILE *fp, int encodingType) {
        const char bom_xEF=0xef, bom_xBB=0xbb, bom_xBF=0xbf; //BOM de UTF8
        const char bom_xFF=0xff, bom_xFE=0xfe;
        switch (encodingType) {
            case 1: //UTF8
                fputc(bom_xEF, fp);
                fputc(bom_xBB, fp);
                fputc(bom_xBF, fp);
                break;
    
            case 2: //BOM UTF16 Big Endian
                fputc(bom_xFE, fp);
                fputc(bom_xFF, fp);
                break;
    
            case 3://BOM UTF16 Little Endian (default de windows).
                fputc(bom_xFF, fp);
                fputc(bom_xFE, fp);
                break;
    
            case 4://UTF32 BIG ENDIAN c/BOM
                fputc(0, fp);
                fputc(0, fp);
                fputc(0xFF, fp);
                fputc(0xFF, fp);
    
            case 5: //UTF32 little endian c/BOM
                fputc(0xff, fp);
                fputc(0xff, fp);
                fputc(0x00, fp);
                fputc(0x00, fp);
                break;
    
            default: //0=ANSI, no hago nada.
                ;
        }
    }
    

    Traditionally, text file were 'headerless'. Now in the UNICODE world a couple of bytes at the beggining indicate the Byte Order Mask.
    For those living in DOS / Windows world, the omission means ANSI, but for those from Mac / Android, the omission normally means UTF-8.

     

    Last edit: Sebastian Ledesma 2022-01-10
  • Sebastian Ledesma

    Here it's my code to write text with unicode support.
    Note: TCoolFileBuffer::Load uses file handles instead of "FILE * "
    Also, it load into memory buffer and segments the text into lines.
    So, the updated code will be similar.

    int myFputs(const _TCHAR *s, FILE * fp, int encodingType) {
    
    char utf8String[8192];
    wchar_t wString[8192];
    int nByte=0;
    BYTE *pB;
    
    #if defined UNICODE
        switch (encodingType) {
            case 1: //convertir en UTF8:
                WideCharToMultiByte(CP_UTF8, 0, s, -1, utf8String, 8192, NULL, NULL);
                utf8String[8191]='\0';
                nByte=strlen(utf8String);
                nByte=fwrite(utf8String, 1, nByte, fp);
                break;
            case 2: //UTF16 big endian (no soportado)
                wcscpy(wString, s);
                nByte=wcslen(s);
                return -1;
            case 3: //UTF-16LE, el default de Windows
                nByte=wcslen(s);
                fwrite(s, nByte, 2, fp);
                //fputws(s, fp);
                break;
            case 0: //convertir en ANSI.
                WideCharToMultiByte(CP_ACP, 0, s, -1, utf8String, 8192, NULL, NULL);
                utf8String[8191]='\0';
                nByte=strlen(utf8String);
                nByte=fwrite(utf8String, 1, nByte, fp);
                break;
        }
    #else
        switch (encodingType) {
            case 1: //convertir en UTF8
                MultiByteToWideChar(CP_ACP, 0, s, -1, wString, 8192);
                WideCharToMultiByte(CP_UTF8, 0, wString, -1, utf8String, 8192, NULL, NULL);
                utf8String[8191]='\0';
                nByte=strlen(utf8String);
                nByte=fwrite(utf8String, 1, nByte, fp);
                break;
            case 2: //UTF16 big endian
                MultiByteToWideChar(CP_ACP, 0, s, -1, wString, 8192);
                nByte=wcslen(wString);
    //          nByte=fwrite(wString, 2, nByte, fp);
                return -1;
            case 3: //UTF16 little endian
                MultiByteToWideChar(CP_ACP, 0, s, -1, wString, 8192);
                nByte=wcslen(wString);
                nByte=fwrite(wString, 2, nByte, fp);
                break;
            case 0: //no hace falta convertir ANSI.
                nByte=strlen(s);
                nByte=fwrite(s, 1, nByte, fp);
                break;
        }
    #endif
    
    return nByte;
    }
    
     
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-10

    In discussion:

    https://sourceforge.net/p/owlnext/discussion/97175/thread/8cbc674381/#786e

    @sebas_ledesma wrote:

    I did a couple of commits regarding [feature-requests:#54] but I still cannot make it work in C++Builder I'm using a couple a 'dummy' variables in TCoolFileBuffer and TCoolEditFile to avoid memory overwriting in c++.

    Great work, Sebastian, and thanks for the comprehensive log messages! A good log makes it easier to review development work. Regarding the delete-bug [r5719], I've created a dedicated ticket [bugs:#511], and updated your log message, adding a reference. Note that another instance of this issue remains for SyntaxArray in TCoolTextBuffer::SetSyntaxArray, where SyntaxArray is still deallocated using plain delete. I've added a comment in the log message about this.

    I notice that you have (accidentally) committed your changes to the 7.0 release branch.

    Note: Please do not change our release branches! That should be done by release managers only.

    Remember, development work should ideally be performed on the trunk. Here is the correct checkout address for the trunk (you can use the SVN command "svn switch" to flip your working copy to the trunk):

    https://svn.code.sf.net/p/owlnext/code/trunk
    

    Accordingly, I have moved your work to the trunk and restored the 7.0 release branch.

    PS. If you cannot work on the trunk for some reason, or it is more practical for you to base your work on one of our existing releases, then you can create a temporary development branch, as mentioned earlier (let me know if you need help to set it up, but if you have TortoiseSVN installed, it is simply a copy in the Repository Browser).

     
    👍
    1

    Related

    Bugs: #511
    Commit: [r5719]
    Feature Requests: #54

  • Sebastian Ledesma

    I usually work on my copy of OWL6.44. I've created a copy of 'Trunk' to update it but it seems that I've used the OWLNext 7 link instead.
    Thanks.

     
    👍
    1

    Last edit: Sebastian Ledesma 2022-01-10
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-10

    I usually work in my copy of OWL6.44. I've created a copy of 'Trunk' to update it but it seems that I've used the OWLNext 7 link instead.

    No problem! If it would be easier for you, we can set up a development branch based on 6.44. Just let me know if need help to do so.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-10

    Here it's a sample code of writeBOM [...and...] myFputs.

    Quick code review:

    • The byte order marks for UTF-32 are wrong.
    • You need error checking.
    • Your myFputs function may truncate lines due to the use of fixed-size buffers. Tip: Use dynamic allocation instead. MultiByteToWideChar and WideCharToMultiByte can be called with a nullptr to calculate the size of the buffer needed.
    • For feedback on coding style etc, see my earlier code review for myFgets.

    Here is some (untested) modern C++ for writing the BOM:

    namespace {
    
    enum TEncoding_ { eAnsi, eUtf8, eUtf16be, eUtf16le, eUtf32be, eUtf32le };
    using TBom_ = string_view;
    
    auto GetBom_(TEncoding_ encoding) -> TBom_
    {
      static const auto utf8 = array{'\xEF', '\xBB', '\xBF'};
      static const auto utf16be = array{'\xFE', '\xFF'};
      static const auto utf16le = array{'\xFF', '\xFE'};
      static const auto utf32be = array{'\x00', '\x00', '\xFE', '\xFF'};
      static const auto utf32le = array{'\xFF', '\xFE', '\x00', '\x00'};
    
      const auto bom = [](auto& a) { return TBom_{a.data(), a.size()}; };
    
      switch (encoding)
      {
      case eAnsi: return {};
      case eUtf8: return bom(utf8);
      case eUtf16be: return bom(utf16be);
      case eUtf16le: return bom(utf16le);
      case eUtf32be: return bom(utf32be);
      case eUtf32le: return bom(utf32le);
      }
      CHECKX(false, _T("GetBom_: Unknown encoding!"));
      return {};
    }
    
    void WriteBom_(FILE* file, TEncoding_ encoding)
    {
      const auto bom = GetBom_(encoding);
      if (bom.empty()) return;
      const auto n = fwrite(bom.data(), sizeof(bom[0]), bom.size(), file);
      if (n != bom.size()) throw runtime_error("WriteBom_: fwrite failed");
    }
    
    } // namespace
    

    PS: Cool to see that the switch-statement leads to better optimisation than hand-coded lookup. The compiler is really good at switch-statements! See: https://godbolt.org/z/7GaGrv19z

     

    Last edit: Vidar Hasfjord 2026-04-14
  • Sebastian Ledesma

    Pleas note that [r5725] respect the internal usage of 'text' as an array of characters without consider zero ending. However the documentation about CF_TEXT indicates that the zero ending it's used. I keeped the CoolEdit usage to avoid breaking some internal mechanism and i will review later with more time (and knowledge about cooledit internals)
    https://docs.microsoft.com/en-us/windows/win32/dataxchg/standard-clipboard-formats

     

    Related

    Commit: [r5725]

  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-12

    Hi Sebastian, good work on fixing Unicode issues and more in CoolPrj!

    Regarding your log messages, note that the tag "BUG" is short for "bugfix" (so no need to say "fix for bug"), and our current coding style is to just use the bug ticket title as the primary log message, with the ticket reference at the end. See Coding Standards.

    Not a big deal, but I've cleaned up your log messages, ticket titles and descriptions a little. Hope you don't mind!

     
    👍
    1
  • Sebastian Ledesma

    Hi Vidar:

    Can you delete 'Feature request #189'?
    I've accidentally entered feature instead of a bug.

    BTW: I've reviewd the report of PVS (https://www.fly-server.ru/pvs-studio/owlnext/) and I've found some interesting points that complement my reviewing.

     
    👍
    1
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-12

    Can you delete 'Feature request #189'?

    Done.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-12

    I've reviewed the report of PVS [...] and I've found some interesting points that complement my reviewing.

    Super! If you fix issues mentioned in that report, it would be helpful if you update the associated ticket [bugs:#504], so that it is clear which issues have been addressed.

     

    Related

    Bugs: #504

  • Sebastian Ledesma

    One point to keep in mind.
    In OWLNext 6.44 the font used in CoolEdit it's 'Courier New', in OWLNext 7.x it's "Consolas", in both cases to 'fully' support UNICODE the font family must include the glyphs for others alphabets, like arabic, cyrillic, hebrew, thai, etc.
    In other cases the font subsytem will draw a square block when it has a character that cannot draw.

     
  • Sebastian Ledesma

    So far, [r5740] , all the changes / bugfixes are API and ABI compatibles and can be transtated to 6.44.
    But I'm thinking of extending the coolprj API to allow change/save the file with another Enconding, and that will means the creating of branch 6.45.

     

    Related

    Commit: [r5740]

  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-17

    Hi Sebastian,

    Setting up a branch for 6.45 and extending the API is not a big deal. If you want to do that, I can help. However, the important issue you should consider is whether you are prepared to support your extensions publicly, including repairing any regressions. Notably, the 6.40 series has support for a lot of old compilers, and you would then ideally have to make sure your code works with all of them.

    That said, this issue applies to any compatible changes to 6.44 as well.

    I would prefer to keep the 6.40 series stable now and do all new development on the 7 series. Personally, I am only prepared to maintain code with the latest compilers (although I still have Borland C++ 5.02 and 5.5 installed for the occasional regression testing), and I want to do minimal maintenance and administrative work on old OWLNext versions.

    What is the big obstacle stopping you from migrating to 7? If there are showstoppers in your personal circumstance that prevent you from migrating your code, remember that you can always keep your own custom branch of 6.44 with the extensions you want and need. You don't have to publish it, especially if there is no demand for it among the broader user base.

    Personally, I have my own experimental branch Owlet, where I implement extensions I want and need for my software. Many of those have been merged into the OWLNext trunk, but many have not. That is my preferred approach.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-17

    There is another issue with adding new features in older versions — it blocks the upgrade path. If we publish 6.45 with new features, it blocks the upgrade path to 7.0 for you and any other users of the new features. This is an odd situation, where a lower version number has more features than a higher version. Of course, your extensions will be available in 7.1, but the latter is not yet ready for release, and may not be for some time.

    For all mentioned reasons, I recommend you use a private branch to extend 6.44, if you really need to, and set the milestone for this feature ticket to 7.1.

     
  • Vidar Hasfjord

    Vidar Hasfjord - 2022-01-20

    Hi Sebastian,

    Since we now have a regression in OWLMaker (missing filename in editor window title, as noted in discussion:8cbc674381), I suggest you add OWLMaker as one of your test cases as you work on this feature ticket and any related bug tickets.

    Hopefully, the source revision causing the regression should be easy to identify. As I mentioned, the regression must have happened since revision 5699, which is the build number of the most recent release of OWLMaker.

     
1 2 > >> (Page 1 of 2)

Anonymous
Anonymous

Add attachments
Cancel





MongoDB Logo MongoDB