Menu

Making_an_application_Unicode-ready

Unicode and OWLNext

While OWL only supported narrow text encoded according to Windows code pages (also known as "ANSI" code pages) , OWLNext also supports wide text encoded as Unicode UTF-16, as implemented by Windows 2000/XP and later. This page describes how you can enable your application to support Unicode, either using the narrow UTF-8 code page in ANSI build mode, or the traditional wide UTF-16 character set in the UNICODE build mode.



Introduction

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. Windows 2000/XP and later use Unicode, encoded as UTF-16, throughout the operating system.

For text-related functionality, Windows provides two variants of the API, using the suffix 'A' for functions and types dealing with traditional narrow (8-bit) Windows code page text (somewhat incorrectly referred to as "ANSI" text within Windows, hence the use of the letter 'A'), and the suffix 'W' for functions and types dealing with wide (16-bit) Unicode UTF-16 text ('W' for "wide"). An agnostic variant of the API, without these suffixes, is provided using macros. Each macro expands to the correct function or type according to the build mode, controlled by the UNICODE preprocessor symbol. For example:

  • SetWindowTextA(HWND, LPCSTR); Narrow ANSI version.
  • SetWindowTextW(HWND, LPCWSTR); Wide UTF-16 version.
  • SetWindowText(HWND, LPCTSTR); Agnostic version.

The agnostic version is implemented as a macro that expands to SetWindowTextW if UNICODE is defined, otherwise it expands to SetWindowTextA. Likewise, LPCTSTR expands to LPCWSTR if UNICODE is defined, otherwise LPCSTR.

Note that functions dealing with traditional narrow Windows code page text will translate the text to Unicode UTF-16 behind the scene and then forward the call to the Unicode counterpart of the function. It is hence no more efficient, in fact less so, to use the narrow ANSI variant of the API.

Wide string and character literals, i.e. literals prefixed by 'L', are encoded as UTF-16 (after translation from the encoding used by the source file) by C/C++ compilers targeting the Windows platform. The TEXT macro, and its shorter synonym _T, add the 'L' prefix to the literal if UNICODE is defined, otherwise not.

For example:

void foo(HWND w)
{
  SetWindowTextA(w, "Færøyene"); // ANSI version
  SetWindowTextW(w, L"Færøyene"); // UTF-16 version
  SetWindowText(w, TEXT("Færøyene")); // Agnostic version
  SetWindowText(w, _T("Færøyene")); // Agnostic version; less verbose alternative
}

Since Windows 10 version 1903 (May 2019 Update), the operating system now also supports an ANSI code page for UTF-8, allowing you to use narrow Unicode strings encoded in UTF-8.

void foo(HWND w)
{
  SetWindowTextA(w, u8"Færøyene"); // ANSI UTF-8 code page
  SetWindowTextA(w, "Færøyene"); // The "u8" prefix can be dropped, if the execution character set is set to UTF-8 in the compiler.
}

For more information about Unicode in Windows, see Working with Strings and Unicode in the Windows API in the Windows Development Reference. The MSDN article Globalization Step-by-Step is also a useful introduction to the topic.


Using UTF-8 in OWLNext applications

Setting the active ANSI code page to UTF-8 allows you to use the good old ANSI build mode and narrow strings for your OWLNext applications and still be able to use Unicode. However, if you rely on this code page, you can then only deploy your applications on Windows 10 version 1903 or later. Also, note that OWLNext will not work out of the box with UTF-8. You will have to make further changes to your OWLNext application to make it work with UTF-8. In particular, you have to explicitly set fonts with UTF-8 support enabled, and you may have to modify or replace any component that does string processing, so that it works properly with the UTF-8 encoding.


Activate the UTF-8 code page

For information on how to activate the UTF-8 code page, see "Use the UTF-8 code page" in the Windows documentation.


Use fonts with UTF-8 support

A big issue when using the UTF-8 code page, is that GDI text functions will not use it by default. The code page used by GDI text functions is determined by the currently selected font. In particular, the LOGFONT::lfCharset property of the font must be set to the undocumented value 254 to interpret strings as UTF-8 encoded (see Ted's Blog). Additionally, the selected font must of course support all the characters needed in the Unicode character set (some fonts only have a limited range of characters).

enum { UNICODE_CHARSET = 254 }; // Undocumented

// Create a new font with support for UTF-8 enabled.
//
auto f = TFont{"Segoe UI", -10, 0, 0, 0, 400, 0, 0, 0, 0, UNICODE_CHARSET};

// Enable support for UTF-8 for the current window font.
//
auto lf = TFont{GetWindowFont()}.GetObject();
lf.lfCharSet = UNICODE_CHARSET;
SetWindowFont(TFont{lf});

Currently, OWLNext does not know anything about this, so all GDI text drawing done by OWLNext will not support UTF-8 by default. You will have to override the fonts wherever possible, and where not possible, you will have to find alternative solutions (e.g. use owner-drawing, or use a fully custom component, or modify OWLNext).


Using UTF-16 in OWLNext applications

OWLNext applications can be made to support both UTF-16 and ANSI text. When an OWLNext application is compiled with UNICODE defined, OWLNext will use UTF-16 types and functions; otherwise, it will use the ANSI variants instead. The application will automatically be linked with the respective variant of the OWLNext library. You can either code explicitly for wide UTF-16 text, or you can use the macro API to make your application agnostic to the UNICODE build mode, i.e. so that it uses UTF-16 text in UNICODE build mode, and ANSI text otherwise. For example:

void foo(TWindow& w)
{
  w.SetWindowText(L"Færøyene"); // Unicode version
  w.SetWindowText(_T("Færøyene")); // Agnostic version
}

The following sections provide some guidelines for moving your OWLNext application into the Unicode world. For complete code, demonstrating the use of the agnostic macro API to support both UNICODE and ANSI build modes, see [Examples].


Define the UNICODE preprocessor symbol

Add UNICODE to the preprocessor definitions for your application. This is necessary even if you do not intend to use the Windows macro API for Unicode. OWLNext uses the macro API in its own API and implementation, and automatically decides which variant of the OWLNext library to link, depending on the definition of the UNICODE symbol.

If you define UNICODE, then OWLNext will automatically define _UNICODE also (note the leading underscore). The _UNICODE symbol is used to enable a similar generic-text macro API for the string functions in the runtime library. For example, the macro _tcscpy will expand to wcscpy if _UNICODE is defined, otherwise it will expand to _mbscpy (if _MBCS is defined) or strcpy (if none is defined). For more information, see Generic-Text Mappings in the Microsoft documentation.


Update the signature of OwlMain

The signature of the OwlMain entry function must be updated for the UNICODE build mode, using one of the following alternatives:

int OwlMain(int argc, LPWSTR argv[]); // Unicode version
int OwlMain(int argc, wchar_t* argv[]); // Equivalent alternative
int OwlMain(int argc, LPTSTR argv[]); // Agnostic version
int OwlMain(int argc, TCHAR* argv[]); // Equivalent alternative
int OwlMain(int argc, owl::tchar* argv[]); // Another agnostic alternative since OWLNext 6.32

Note that all of these alternatives boil down to the same signature in the UNICODE build mode. Use one of the agnostic alternatives if your application needs to support ANSI build mode as well.


Update the signature of virtual overrides

For all classes inherited from OWLNext classes, you must update the signature of any virtual override that has a string parameter or return value. For example, the correct UNICODE compliant signature of TView::SetDocTitle is now:

virtual bool SetDocTitle(LPCTSTR docname, int index); // UNICODE compliant signature

Tip: If you use a C++11 compliant compiler, then use the keyword override on all your overriding virtual functions. When override is used, the compiler will tell you if the signature does not match the base virtual function. Without it, the mismatch goes undetected, and your program will not work as intended.

bool SetDocTitle(LPCTSTR docname, int index) override; // UNICODE compliant signature


Use owl::tstring for strings

The recommended way to handle strings in OWLNext is to use the string classes in the standard C++ library. On the Windows platform, wchar_t is defined as 16-bit, making std::wstring a perfect fit for UTF-16.

OWLNext 6.32 introduced wide-spread support for the standard string classes throughout the library. Version 6.32 also introduced the owl::tstring type definition as an agnostic type, mapping to std::wstring in UNICODE build mode, and to std::string otherwise. Note that the owl::tstring type definition was named owl_string in versions prior to OWLNext 6.32. See Strings in OWLNext for more information.

If you need to refer to the string character type in a generic way, you can use owl::tstring::value_type. Outside the string classes, you can use the type definition owl::tchar. It will map to wchar_t in UNICODE mode, and to char otherwise. Note that owl::tchar was introduced in OWLNext 6.32. In versions prior to 6.32, you can use the Windows macro TCHAR.

To simplify and increase the robustness of your application, you should avoid using plain C-style string functions, but if you have to, then use the wide or agnostic variants, e.g. use wcscpy or _tcscpy to replace strcpy and _mbscpy. See Generic-Text Mappings in the Microsoft documentation.


Use wide formatted streams

To support UTF-16, you must replace use of narrow streams by their wide counterparts. OWLNext 6.32 introduced agnostic type definitions, such as owl::tostringstream, that maps to wide streams (e.g. std::wostringstream) in UNICODE mode, and to narrow streams (e.g. std::ostringstream) otherwise. See "include/owl/private/strmdefs.h" for all the available type definitions.


Prefix or encase string literals

All string and character literals prefixed by 'L' will be encoded as UTF-16 by compilers targeting the Windows platform. To create a UNICODE agnostic application, encase all character strings and single characters with the TEXT macro (or the shorter _T synonym). For example:

void foo(TStatic& c)
{
  c.SetText(L"Færøyene"); // UTF-16 string literal
  c.SetText(_T("Færøyene")); // Agnostic alternative
}



Converting between narrow and wide strings

See "Strings in OWLNext" for guidance on conversion between narrow ANSI strings and wide UTF-16 strings.


Pitfalls

String and buffer lengths

When dealing with Unicode string lengths it is important to understand the difference between character count (Unicode code points) and encoded element count (Unicode code units). These counts are not synonymous for variable-width encodings such as UTF-8 and UTF-16. A common misconception is that UTF-16 is a fixed-width encoding (i.e. it is confused with the older fixed-width UCS-2 encoding), but it is not; some code points in the Unicode character set require two UTF-16 (wchar_t) code units to represent.

Note that std::wstring::size and wcslen (and hence owl::tstring::size and _tcslen) return the code unit count.


Buffer length calculation

Beware that sizeof, commonly used to calculate the element count of a char array, will not return the element count of a wchar_t array; it returns the size of the array in bytes. If your compiler supports C++17, you can use std::size to calculate the element count of an array. For older compilers, you can use COUNTOF, a macro provided by OWLNext.

For example, a common usage of TInputDialog in a traditional OWL application is:

char buf[32] = "Færøyene";
TInputDialog dlg(this, "Title", "Prompt:", buf, sizeof(buf));
...

To support the UNICODE mode, the code above could be ported as follows:

tchar buf[32] = _T("Færøyene");
TInputDialog dlg(this, _T("Title"), _T("Prompt:"), buf, std::size(buf));
...

Note that std::size and COUNTOF can only be used on arrays. They will not work for string pointers, such as LPTSTR. When you have a pointer to a string, then you need to use _tcslen or wcslen to calculate its length (i.e. in UTF-16 code units). In general, it is recommended that you rewrite your code to use standard string classes where possible to avoid the use of error-prone buffers.


Related

News: 2021/02/utf-8-support-in-owlnext
Wiki: Examples
Wiki: Knowledge_Base
Wiki: OWLMaker
Wiki: Replacing_the_Borland_C++_Class_Libraries
Wiki: Strings_in_OWLNext
Wiki: Upgrading_from_OWL

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.