Menu

StrCvt

Ruediger Helsch

StrCvt - Character type converter for C++11

Contents

Introduction

C++11 has the new character types char16_t and char32_t, which use
the UTF-16 and UTF-32 Unicode encoding. The question is how to convert
between the four character types of C++.

This library converts between the different character types of C++.

  • Conversion between the types char, wchar_t, char16_t and
    char32_t in all combinations.

  • Implements the API of the C++ codecvt member functions in() and
    out() for stream conversion, and convenience functions for string
    conversion.

  • Locale-dependent conversion and UTF conversion.

Installation

Unpack the distribution to a directory on your local machine. You can
include the headers in subdirectory include/xprintf from your
program. To make inclusion of the headers easier, it is recommended to
add the subdirectory include of the distribution to the include file
search path of the compiler. This is commonly achieved with the
option -I/path/to/strcvt/include (assuming that the xprintf
distribution has been upacked to directory /path/to/xprintf). Then
you can include the code converter headers through their standard names
like "strcvt/strcvt.h".

Usage

In order to use the code converter, its headers have to be included.
They are:
- UTF-8 string type u8string

    // The main header defining the strcvt() string conversion functions
    #include "/path/to/strcvt/include/strcvt/strcvt.h"
    // The main header defining the u8string type
    #include "/path/to/strcvt/include/strcvt/u8string.h"
    // The header defining the strcvt_utf() UTF string conversion functions
    #include "/path/to/strcvt/include/strcvt/strcvt_utf.h"
    // The header defining the charcvt() stream conversion functions
    #include "/path/to/strcvt/include/strcvt/charcvt.h"
    // The header defining the charcvt_utf() UTF stream conversion functions
    #include "/path/to/strcvt/include/strcvt/charcvt_utf.h"
    // The header defining the strcvt_iterator
    #include "/path/to/strcvt/include/strcvt/strcvt_iterator.h"
    // The header defining codecvt_utf
    #include "/path/to/strcvt/include/strcvt/codecvt_utf.h"

    // Optional header defining output operators
    #include "/path/to/strcvt/include/strcvt/strcvt_operators.h"

    // Import symbols into users namespace
    using namespace StrCvt;

If the xprintf include directory /path/to/strcvt/include has been
added to the include file search path of the compiler, e.g. using the
compiler option -I/path/to/strcvt/include, this reduces to:

    // The main header defining the strcvt() string conversion functions
    #include "strcvt/strcvt.h"
    // The main header defining the u8string type
    #include "strcvt/u8string.h"
    // The header defining the strcvt_utf() UTF string conversion functions
    #include "strcvt/strcvt_utf.h"
    // The header defining the charcvt() stream conversion functions
    #include "strcvt/charcvt.h"
    // The header defining the charcvt_utf() UTF stream conversion functions
    #include "strcvt/charcvt_utf.h"
    // The header defining the strcvt_iterator
    #include "strcvt/strcvt_iterator.h"
    // The header defining codecvt_utf
    #include "strcvt/codecvt_utf.h"

    // Optional header defining output operators
    #include "strcvt/strcvt_operators.h"

    // Import symbols into users namespace
    using namespace StrCvt;

The strcvt() functions are exported through namespace StrCvt. It
is recommented to make them available via a using namespace
directive like in the example above. Alternatively it is possible to
import the functions strcvt() and strcvt_utf() etc. separately
through using declarations:

    using StrCvt::strcvt;
    using StrCvt::strcvt_utf;
    using StrCvt::strcvt_utf_strict;
    using StrCvt::make_strcvt_iterator;

The character type converter is pre-configured for header-only use.
This means: Just include the header and you are done. In order to
reduce space overhead and compilation time, a precompiled library can
be used. See section Creating a Library.

String conversion

Function strcvt\<dest_string_type>(source) converts strings from the
source character type to the destination string type. The
dest_string_type can be std::string, std::wstring,
std::u16string and std::u32string. An additional string type
StrCvt::u8string with UTF-8 encoding is also provided; see section
UTF-8 string type u8string. Example usage:

    #include "strcvt/strcvt.h"

    // ...

    using namespace StrCvt;

    // Convert const char* C string to wide wchar_t string
    std::wstring w = strcvt<std::wstring>("Hello, world!");
    // Convert wide string to string of different character type
    std::string s = strcvt<std::string>(w);
    // Convert string to char16_t string
    std::u16string s16 = strcvt<std::u16string>(s);
    // The source buffer can also be specified as (pointer, size):
    std::cout << strcvt<std::string>(&s16[0], 5) << std::endl;

The source string can be specified as:
- a character pointer of any character type (C-style null-terminated string),
- a C++ string of arbitrary character type, or
- a character pointer and a size.

The destination string type is specified as a template argument to
strcvt(). The function is specialized on all combinations of source
and destination character types. So the full signatures are:

    template<class charT, class source_charT>
    std::basic_string<charT>
    strcvt<std::basic_string<charT>>(const std::basic_string<source_charT>& source);

    template<class charT, class source_charT>
    std::basic_string<charT>
    strcvt<std::basic_string<charT>>(const source_charT* source);

    template<class charT, class source_charT>
    std::basic_string<charT>
    strcvt<std::basic_string<charT>>(const source_charT* source, std::size_t size);

The function with string argument dest_string strcvt<dest_string>(source_string source) is overloaded for the case
dest_string == source_string: it moves its argument to the result
(for rvalue arguments) or returns a reference to its argument (for
lvalue arguments). This is useful in generic code where the concrete
types are not known.

It is also possible to omit the template argument determining the
destination string type. In this case, strcvt() does no conversion at
all, but returns an object which can be converted to all string types.

    // strcvt() returns an object which can be assigned to any string type
    std::wstring w = strcvt("Hello, world!");
    std::u16string = strcvt(w);

For string arguments, the return value of strcvt() is a subclass of
the class of its argument. The subclass adds conversion operators to
all string types. For character pointer arguments, the return value
of strcvt() is an object which can be converted to a character
pointer (of the character type of the argument) or to any string type.
In each case, the return value of strcvt() can e.g. be assigned to
any string type, or can be passed as argument to a function expecting a
string.

UTF-8 string type u8string

In addition to the four standard string types std::string,
std::wstring, std::u16stringandstd::u32stringprovided by the C++ standard library, header strcvt/u8string.h defines an additional string typeStrCvt::u8string. Like the locale dependent string typestd::stringit has character typechar`, but it always uses a
locale-independent UTF-8 encoding. Example usage:

    #include "strcvt/u8string.h"
    #include "strcvt/strcvt.h"

    // ...

    using namespace StrCvt;

    u8string us(u8"Hello, World");
    std::cout << strcvt<std::string>(us) << std::endl;

In the example above, the u8string is initialized using the C++11
UTF-8 character literal u8"string". For output to the
locale-dependent standard output stream, it is converted from UTF-8 to
the locale dependent char endoding.

UTF string conversion

In the examples above, strcvt() can be replaced with strcvt_utf(),
which ignores the locale and always performs a UTF transformation.
The header is "strcvt/strcvt_utf.h".

It is also possible to configure strcvt() to only use UTF conversions
instead of the locale dependent character conversions. Open the header
strcvt/strcvt_config.h with an editor and change the preprocessor
symbol STRCVT_IMPL_UTF8_ONLY from 0 to 1.

Locale-dependent stream conversion

The character type stream converter function charcvt(state, from,
from_end, from_next, to, to_end, to_next) is defined in header
strcvt/charcvt.h and uses the API of the member functions in()
and out() of the standard code conversion interface std::codecvt:

    #include "strcvt/charcvt.h"

    // ...

    using namespace StrCvt;

    std::mbstate_t state = std::mbstate_t(); // Zero-initialize
    // Convert between character types:
    result r = charcvt(state, from, from_end, from_next, to, to_end, to_next);

    // ... More calls of charcvt()

    // Return output to initial shift state
    r = charcvt_unshift<source_charT>(state, to, to_end, to_next);

The state must be of type std::mbstate_t. It holds the conversion
state between successive calls of charcvt() and must be explicitly
zero-initialized before the first use, like shown above. The
arguments from and from_end delimit the source character buffer,
to and to_end delimit the destination character buffer, and on
exit from the function from_next and to_next point past the last
converted character. The function returns std::codecvt_base::ok on
success and std::codecvt_base::partial if the output buffer was too
small to convert the entire input buffer, or if the input buffer ended
in a part of a multibyte sequence. After the input has been
completely converted, possibly by multiple calls to charcvt(),
charcvt_unshift() must be called to move the output to the initial
shift state. The source character type must be specified as a
template argument to charcvt_unshift(), since different converters are
used depending on the source character type. Charcvt_unshift() should
even be used if the character encoding is known not to be state
dependent, like UTF-8. Charcvt_unshift() checks whether trailing
incomplete Unicode character input sequences are pending, and appends
a replacement character to the output buffer if necessary to signal
the presence of trailing garbage.

Charcvt() is specialized for all combinations of source and
destination character types char, wchar_t, char16_t and
char32_t and converts between the locale-dependent character type
char, the Unicode UTF-16 and UTF-32 character types char16_t and
char32_t, and the implementation defined wide character type
wchar_t which is assumed to be equivalent to either char16_t or
char32_t. It is able to consume or deliver single characters from
multi-character Unicode characters.

The full signature of charcvt() is:

    template<class src_charT, class dst_charT>
    std::codecvt_base::result
    charcvt(std::mbstate_t& state,
            const src_charT* from, const src_charT* from_end,
            const src_charT*& from_next,
            dst_charT* to, dst_charT* to_end, dst_charT*& dst_next,
            int flags = 0);

The flags can be omitted, or they can be set to the constant
strcvt_flags_no_partial_conversions to prevent partial conversions.
If the flag is omitted or zero, charcvt() is eager and consumes even
single partial multibyte characters.

UTF stream conversion

The function charcvt_utf(state, from, from_end, from_next, to, to_end,
to_next) is defined in header strcvt/charcvt_utf.h and works like
charcvt() but treats the type char as having UTF-8 encoding instead
of a locale dependent implementation defined encoding:

    #include "strcvt/charcvt_utf.h"

    // ...

    using namespace StrCvt;

    std::mbstate_t state = std::mbstate_t(); // Zero-initialize
    // Convert between character types:
    charcvt_utf(state, from, from_end, from_next, to, to_end, to_next);

    // ... More calls of charcvt_utf()

    // Return output to initial shift state
    charcvt_utf_unshift(state, to, to_end, to_next);

The state must be of type std::mbstate_t. It holds the conversion
state between successive calls of charcvt_utf() and must be explicitly
zero-initialized before the first use, like shown above. The
arguments from and from_end delimit the source character buffer,
to and to_end delimit the destination character buffer, and on
exit from the function from_next and to_next point past the last
converted character. The function returns std::codecvt_base::ok on
success and std::codecvt_base::partial if the output buffer was too
small to convert the entire input buffer, or if the input buffer ended
in a part of a multibyte sequence. After the input has been
completely converted, possibly by multiple calls to charcvt_utf(),
charcvt_utf_unshift() checks whether trailing incomplete Unicode
character input sequences are pending, and appends a replacement
character to the output buffer if necessary to signal the presence of
trailing garbage.

Charcvt_utf() is specialized for all combinations of source and
destination character types char, wchar_t, char16_t and
char32_t and converts between UTF-coded char, the Unicode UTF-16
and UTF-32 character types char16_t and char32_t, and the
implementation defined wide character type wchar_t which is assumed
to be equivalent to either char16_t or char32_t. Contrary to
member functions in() and out() of codecvt, charcvt_utf() always
converts (even if the input character type is the same as the output
character type, in which case it checks the validity of the encoding).
It always generates valid UTF-8, UTF-16 or UTF-32 output sequences.

For each invalid encoding in the input buffer, charcvt_utf() inserts
the replacement character U+0xFFFD into the output buffer. If this is
not wanted, function charcvt_utf_strict() can be used, which has the
same interface as charcvt_utf() but returns std::codecvt_base::error
on encoding errors, with from_next pointing to the first element of
the invalid sequence.

There is also a UTF code conversion facet defined in header
strcvt/codecvt_utf.h

    class codecvt_utf<intern_charT, extern_charT>;

This code conversion facet uses charcvt_utf() to convert between UTF
coded characters intern_charT and extern_charT, and is specialized
for all character type combinations.

Character type converting character iterator

Instead of converting the entire buffer, it can be accessed trough a
converting character iterator. The iterator is created by function
make_strcvt_iterator():

    #include "strcvt/strcvt_iterator.h"

    // ...

    using namespace StrCvt;

    // Create char32_t iterator for access to source C string
    auto it = make_strcvt_iterator<char32_t>("Hello, world!");

The end iterator is returned by make_strcvt_iterator() without
arguments, or it can be obtained from member function end() of the
iterator. Since member function begin() is also implemented, and
returns the iterator itself, the iterator can be used just like a
container, for example with the range-based for statement:

    std::u32string w;
    // Use range-based for statement: process all Unicode characters
    for (char32_t c: it)
        w.push_back(c);

    // Alternatively, use simple iterator interface
    auto e = it.end();
    for (auto j = it; j != e; ++j)
        w.push_back(*j);

The converting iterator is (indirectly) derived from base class
strcvt_iterator_base<charT> which does not depend on the source
iterator type or the source character type. It has a virtual
destructor and can be used polymorphically. The inheritance hierarchy
looks like this:

    template<class charT>
    class strcvt_iterator_base {
        // Operators *(), ++(), ==()
        // Member functions begin(), end()
    };

    template<class charT, class source_charT>
    class strcvt_iterator_impl : public strcvt_iterator_base<charT> {
        // Implements operator ++() which does the conversion using
        // virtual member function get_next_source_char()
    };

    template<class charT, class SourceIterator>
    class strcvt_iterator
        : public strcvt_iterator_impl<charT, iterator_traits<SourceIterator>::value_type> {
        // Implements get_next_source_char()
    };

Class template strcvt_iterator_base does not know the source
character type and can be used to handle a strcvt_iterator
polymorphically. Class template strcvt_iterator_impl knows the
source character type but not the type of the iterator. If the
precompiled library is used, it is precompiled for all combinations of
source and destination character type. Class template
strcvt_iterator is the return value of function
make_strcvt_iterator(). It implements the source iterator handling.

Operators

Some operators for strings are defined in header
"strcvt/strcvt_operators.h". These are the output operators << for
strings and character pointers, and the appending += operators for
strings. To use them, include header "strcvt/strcvt_operators.h".
The operators reside in namespace StrCvt and should be imported into
the user's namespace through a using namespace directive:

    #include "strcvt/strcvt_operators.h"
    using namespace StrCvt;

    std::cout << U"Hello, world!\n";

It would be nice if we could define conversion operators to enable
assignment between different string types. But in C++, conversion and
assignment operators can only be defined as member functions.

Creating a Library

The character type converter is preconfigured for header-only use.
This means: Just include the header and you are done. In order to
reduce space overhead and compilation time, a precompiled library can
be used.

The main advantage in using a library is that each time the strcvt
headers are included, the compiler does not need to look at the
implementation details. This can speed up compilation significantly.

To create the library, the C++ source files libstrcvt.cpp,
libstrcvt_iterator.cpp, libstrcvt_utf.cpp and libcodecvt_utf.cpp in
directory lib of the distribution must be compiled. Under Linux,
just run make. Before compiling, you may want to select the
compiler to use: Uncomment to proper CXX= - line in the toplevel
Makefile.template. Running make should create a library
lib/libxprintf.a, which has to be linked to the programs.

In Visual C++, instead of building a library, you may just add the
library source files to your project.

In order to make the headers use the library, you must open the header
strcvt/strcvt_config.h with an editor and change the preprocessor
symbol STRCVT_IMPL_USE_LIBRARY from 0 to 1. The next time the
header is included, the library will be used. You can check that the
library is used as intended by omitting the library when linking.
Linking should fail with missing externals.

In order to run the tests, the headers for the boost test framework
are required.

Compatibility

The code converter has been tested with:

  • GCC 4.7, 4.8 and 4.9 on Linux
  • Visual Studio Express 2013 for Windows Desktop with November 2013 CTP
  • Intel C++ 14.0.2 on Linux
  • Clang 3.5.0 on Linux

Rationale

C++11 has the new character types char16_t and char32_t, which use
the UTF-16 and UTF-32 Unicode encoding. The question is how to convert
between the four character types of C++.

The codecvt part of the C++11 library looks like some ruins left
over at the front line between warring factions.

The codecvt class template is part of the header \<locale> and
described in section 22.4.1.4 of the standard. The first thing to
note is that according to this specification codecvt transforms
between an internal and an external character encoding, so it is
not intended to transform between internal character encodings like
char16_t and wchar_t.

According to the standard, each locale shall have specializations of
codecvt for tranformation between the internal character types
char, wchar_t, char16_t and char32_t and the external
character type char. The standard first says that these
specializations "convert the implementation-defined native character
set", only to continue specifying that

  • the specialization with external and internal character type both
    char must not convert at all (effectively saying that the internal
    char must have the same encoding as the external char),

  • the specializations with internal character types char16_t and
    char32_t must treat the external character code as UTF-8, so they
    are explicitly not allowed to treat it as an "implementation-defined
    native character set".

This leaves the transformation between internal character type
wchar_t and external character type char as the only
locale-dependent transformation, and it is specified to convert
"between the native character sets for narrow and wide characters".
So no luck using codecvt to transform between native character sets
and char16_t or char32_t.

Section 22.5 of the standard specifies the "standard code conversion
facets". It contains:

  • Facet codecvt_utf8, which converts between a UTF-8 coded char
    buffer and UCS2 or UCS4, so it can be used to transform between
    UTF-8 and char32_t. It is not usable for transformation between
    UTF-8 and char16_t because char16_t strings are coded as
    UTF-16, not UCS2.

  • Facet codecvt_utf8_utf16 converts between UTF-8 and UTF-16, so it
    can be used to transform between UTF-8-coded char and char16_t.

  • Facet codecvt_utf16 looks like a hack from the 20th century to
    adapt UTF-16 wide characters to character byte streams. The
    UTF-16-coded buffer is addressed through a char*.

So we can use these standard code conversion facets to convert between
UTF-8-coded char and char16_t or char32_t. We could even
transform between char16_t and char32_t by going through an
intermediate UTF-8-coded char buffer. But again, no
transformation between native character sets and char16_t or
char32_t.

The standard does provide an interface for transformation between
native char and char16_t or char32_t. It is hidden at the very
end of section 21.7 in the string library and consists of the four
functions mbrtoc16(), c16rtomb(), mbrtoc32() and c32rtomb().
Unfortunately, a locale can not be specified for these functions.
They always work with the currently active global locale.

For the following things I know of no standard-conforming procedure:

  • Conversion between wchar_t and char16_t or char32_t. Clearly,
    going through native char is not an option, unless it happens to
    use UTF-8 encoding. I worked around this by assuming that the native
    wchar_t type uses Unicode coding and is equivalent to either
    char16_t or char32_t, depending on sizeof(wchar_t). This
    assumption holds for the systems I have access to, but there may be
    other wchar_t encodings in use. Older Windows systems used UCS2,
    where this assumption would not hold, but modern versions of Windows
    use UTF-16.

  • Locale-dependent conversion: especially for output streams where a
    locale is known that may not be the global locale, it would be
    useful if the locale could be passed as an argument to the character
    transformation. Unfortunately the standard library does not provide
    a portable way to do this, so I left it out.

First I thought I would reuse the standard-specified UTF coders, but
first they are incomplete (going from UTF-16 to UTF-32 via an
intermediate representation as UTF-8 is not very attractive), second
they are distributed over several different interfaces, and third GCC
does not have them. So as a first step I implemented a transformation
between UTF-coded char, char16_t, char32_t and wchar_t. The
function charcvt_utf(state, from, from_end, from_next, to, to_end,
to_next) uses the API of the member functions in() and out() of
codecvt. The state holds the conversion state, from and
from_end delimit the source character buffer, to and to_end
delimit the destination character buffer, and on exit from the
function from_next and to_next point past the last converted
character. Like in() and out(), charcvt_utf() returns
std::codecvt_base::ok on success and std::codecvt_base::partial if
the output buffer was too small to convert the entire input buffer, or
if the input buffer ended in a part of a multibyte sequence. For each
invalid encoding in the input buffer, the replacement character
U+0xFFFD is inserted into the output stream. If this is not wanted,
function charcvt_utf_strict() can be used, which has the same
interface as charcvt_utf() but returns std::codecvt_base::error on
encoding errors, with from_next pointing to the first element of the
invalid sequence. Both functions are specialized on all combinations
of char, wchar_t, char16_t and char32_t. They always convert
(even if input and output type is the same, in which case they check
the validity of the encoding). They are able to consume and produce
single characters forming part of multibyte Unicode characters.

Based on the UTF coder, a locale-dependent converter has been
implemented. The function charcvt(state, from, from_end, from_next,
to, to_end, to_next) also uses the API of the member functions in()
and out() of codecvt: The state holds the conversion state,
from and from_end delimit the source character buffer, to and
to_end delimit the destination character buffer, and on exit from
the function from_next and to_next point past the last converted
character. This implementation uses the assumption that wchar_t is
encoded as either UTF-16 or UTF-32. Transformations between wide
characters wchar_t, char16_t and char32_t use UTF
transformations, transformations between char and wide characters
treat char as a native locale-defined character code.

Convenience functions for the code transformation are:

  • std::basic_string<charT> strcvt<std::basic_string<charT>>(const C*
    from, std::size_t size) converts the size characters at from
    into a string of character type charT, which is specified as a
    template parameter.

  • std::basic_string<charT> strcvt<std::basic_string<charT>>(const C*
    from) converts the characters of the null-terminated string from
    into a string of character type charT, which is specified as a
    template parameter.

  • std::basic_string<charT> strcvt<std::basic_string<charT>>(const
    std::basic_string<C>& from) converts the characters of the string
    from into a string of character type charT, which is specified
    as a template parameter. Specializations for charT == C move the
    string argument to the return value (if the argument is an rvalue)
    or return the reference to the argument (if the argument is an
    lvalue). This is useful in generic code where the concrete types
    are not known.

  • strcvt(const C* from) (strcvt() without template parameter
    specifying the destination string type) creates an object which
    can be converted back to a const C* or to any string type. The
    result can e.g. be assigned to any string type, or passed to a
    function expecting any string.

  • strcvt(const std::basic_string<C>& from) (strcvt() without template
    parameter specifying the destination string type) casts the type of
    the argument to a reference to a class derived from
    std::basic_string<C>, which adds conversion operators to all string
    types. The result can e.g. be assigned to any string type, or
    passed to a function expecting any string.

License

Copyright (c) 2014 Ruediger Helsch; All rights reserved

Permission to use, copy, modify, and distribute this software for any
purpose and without fee is hereby granted. The author disclaims all
warranties with regard to this software.


Related

Documentation / Wiki: Home
Documentation / Wiki: README

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.