Download Latest Version strcvt-0.3.1.zip (78.4 kB)
Email in envelope

Get an email when there's a new version of strcvt

Home
Name Modified Size InfoDownloads / Week
README.md 2014-07-02 29.6 kB
strcvt-0.3.1.zip 2014-07-02 78.4 kB
strcvt-0.3.1.tar.gz 2014-07-02 60.6 kB
Totals: 3 Items   168.7 kB 0

StrCvt - Character type converter for C++11

Contents

Introduction

C++11 has the new character types char16_t and char32_t, which use the UTF-16 and UTF-32 Unicode encoding. The question is how to convert between the four character types of C++.

This library converts between the different character types of C++.

  • Conversion between the types char, wchar_t, char16_t and char32_t in all combinations.

  • Implements the API of the C++ codecvt member functions in() and out() for stream conversion, and convenience functions for string conversion.

  • Locale-dependent conversion and UTF conversion.

Installation

Unpack the distribution to a directory on your local machine. You can include the headers in subdirectory include/xprintf from your program. To make inclusion of the headers easier, it is recommended to add the subdirectory include of the distribution to the include file search path of the compiler. This is commonly achieved with the option -I/path/to/strcvt/include (assuming that the xprintf distribution has been upacked to directory /path/to/xprintf). Then you can include the code converter headers through their standard names like "strcvt/strcvt.h".

Usage

In order to use the code converter, its headers have to be included. They are:


    // The main header defining the strcvt() string conversion functions
    #include "/path/to/strcvt/include/strcvt/strcvt.h"
    // The main header defining the u8string type
    #include "/path/to/strcvt/include/strcvt/u8string.h"
    // The header defining the strcvt_utf() UTF string conversion functions
    #include "/path/to/strcvt/include/strcvt/strcvt_utf.h"
    // The header defining the charcvt() stream conversion functions
    #include "/path/to/strcvt/include/strcvt/charcvt.h"
    // The header defining the charcvt_utf() UTF stream conversion functions
    #include "/path/to/strcvt/include/strcvt/charcvt_utf.h"
    // The header defining the strcvt_iterator
    #include "/path/to/strcvt/include/strcvt/strcvt_iterator.h"
    // The header defining codecvt_utf
    #include "/path/to/strcvt/include/strcvt/codecvt_utf.h"

    // Optional header defining output operators
    #include "/path/to/strcvt/include/strcvt/strcvt_operators.h"

    // Import symbols into users namespace
    using namespace StrCvt;

If the xprintf include directory /path/to/strcvt/include has been added to the include file search path of the compiler, e.g. using the compiler option -I/path/to/strcvt/include, this reduces to:


    // The main header defining the strcvt() string conversion functions
    #include "strcvt/strcvt.h"
    // The main header defining the u8string type
    #include "strcvt/u8string.h"
    // The header defining the strcvt_utf() UTF string conversion functions
    #include "strcvt/strcvt_utf.h"
    // The header defining the charcvt() stream conversion functions
    #include "strcvt/charcvt.h"
    // The header defining the charcvt_utf() UTF stream conversion functions
    #include "strcvt/charcvt_utf.h"
    // The header defining the strcvt_iterator
    #include "strcvt/strcvt_iterator.h"
    // The header defining codecvt_utf
    #include "strcvt/codecvt_utf.h"

    // Optional header defining output operators
    #include "strcvt/strcvt_operators.h"

    // Import symbols into users namespace
    using namespace StrCvt;

The strcvt() functions are exported through namespace StrCvt. It is recommented to make them available via a using namespace directive like in the example above. Alternatively it is possible to import the functions strcvt() and strcvt_utf() etc. separately through using declarations:


    using StrCvt::strcvt;
    using StrCvt::strcvt_utf;
    using StrCvt::strcvt_utf_strict;
    using StrCvt::make_strcvt_iterator;

The character type converter is pre-configured for header-only use. This means: Just include the header and you are done. In order to reduce space overhead and compilation time, a precompiled library can be used. See section Creating a Library.

String conversion

Function strcvt\<dest_string_type>(source) converts strings from the source character type to the destination string type. The dest_string_type can be std::string, std::wstring, std::u16string and std::u32string. An additional string type StrCvt::u8string with UTF-8 encoding is also provided; see section UTF-8 string type u8string. Example usage:


    #include "strcvt/strcvt.h"

    // ...

    using namespace StrCvt;

    // Convert const char* C string to wide wchar_t string
    std::wstring w = strcvt<std::wstring>("Hello, world!");
    // Convert wide string to string of different character type
    std::string s = strcvt<std::string>(w);
    // Convert string to char16_t string
    std::u16string s16 = strcvt<std::u16string>(s);
    // The source buffer can also be specified as (pointer, size):
    std::cout << strcvt<std::string>(&s16[0], 5) << std::endl;

The source string can be specified as: - a character pointer of any character type (C-style null-terminated string), - a C++ string of arbitrary character type, or - a character pointer and a size.

The destination string type is specified as a template argument to strcvt(). The function is specialized on all combinations of source and destination character types. So the full signatures are:


    template<class charT, class source_charT>
    std::basic_string<charT>
    strcvt<std::basic_string<charT>>(const std::basic_string<source_charT>& source);

    template<class charT, class source_charT>
    std::basic_string<charT>
    strcvt<std::basic_string<charT>>(const source_charT* source);

    template<class charT, class source_charT>
    std::basic_string<charT>
    strcvt<std::basic_string<charT>>(const source_charT* source, std::size_t size);

The function with string argument dest_string strcvt<dest_string>(source_string source) is overloaded for the case dest_string == source_string: it moves its argument to the result (for rvalue arguments) or returns a reference to its argument (for lvalue arguments). This is useful in generic code where the concrete types are not known.

It is also possible to omit the template argument determining the destination string type. In this case, strcvt() does no conversion at all, but returns an object which can be converted to all string types.


    // strcvt() returns an object which can be assigned to any string type
    std::wstring w = strcvt("Hello, world!");
    std::u16string = strcvt(w);

For string arguments, the return value of strcvt() is a subclass of the class of its argument. The subclass adds conversion operators to all string types. For character pointer arguments, the return value of strcvt() is an object which can be converted to a character pointer (of the character type of the argument) or to any string type. In each case, the return value of strcvt() can e.g. be assigned to any string type, or can be passed as argument to a function expecting a string.

UTF-8 string type u8string

In addition to the four standard string types std::string, std::wstring, std::u16stringandstd::u32stringprovided by the C++ standard library, header strcvt/u8string.h defines an additional string typeStrCvt::u8string. Like the locale dependent string typestd::stringit has character typechar`, but it always uses a locale-independent UTF-8 encoding. One important use case is a string initialized with C++11's u8"string" UTF-8 string literal. If this is used to initialize a std::string, strcvt() will wrongly assume it is a string of locale-dependent character type. Example usage:


    #include "strcvt/u8string.h"
    #include "strcvt/strcvt.h"

    // ...

    using namespace StrCvt;

    u8string us(u8"Hello, World");
    std::cout << strcvt<std::string>(us) << std::endl;

In the example above, the u8string is initialized using the C++11 UTF-8 character literal u8"string". For output to the locale-dependent standard output stream, it is converted from UTF-8 to the locale dependent char endoding.

UTF string conversion

In the examples above, strcvt() can be replaced with strcvt_utf(), which ignores the locale and always performs a UTF transformation. The header is "strcvt/strcvt_utf.h".

It is also possible to configure strcvt() to only use UTF conversions instead of the locale dependent character conversions. Open the header strcvt/strcvt_config.h with an editor and change the preprocessor symbol STRCVT_IMPL_UTF8_ONLY from 0 to 1.

Locale-dependent stream conversion

The character type stream converter function charcvt(state, from, from_end, from_next, to, to_end, to_next) is defined in header strcvt/charcvt.h and uses the API of the member functions in() and out() of the standard code conversion interface std::codecvt:


    #include "strcvt/charcvt.h"

    // ...

    using namespace StrCvt;

    std::mbstate_t state = std::mbstate_t(); // Zero-initialize
    // Convert between character types:
    result r = charcvt(state, from, from_end, from_next, to, to_end, to_next);

    // ... More calls of charcvt()

    // Return output to initial shift state
    r = charcvt_unshift<source_charT>(state, to, to_end, to_next);

The state must be of type std::mbstate_t. It holds the conversion state between successive calls of charcvt() and must be explicitly zero-initialized before the first use, like shown above. The arguments from and from_end delimit the source character buffer, to and to_end delimit the destination character buffer, and on exit from the function from_next and to_next point past the last converted character. The function returns std::codecvt_base::ok on success and std::codecvt_base::partial if the output buffer was too small to convert the entire input buffer, or if the input buffer ended in a part of a multibyte sequence. After the input has been completely converted, possibly by multiple calls to charcvt(), charcvt_unshift() must be called to move the output to the initial shift state. The source character type must be specified as a template argument to charcvt_unshift(), since different converters are used depending on the source character type. Charcvt_unshift() should even be used if the character encoding is known not to be state dependent, like UTF-8. Charcvt_unshift() checks whether trailing incomplete Unicode character input sequences are pending, and appends a replacement character to the output buffer if necessary to signal the presence of trailing garbage.

Charcvt() is specialized for all combinations of source and destination character types char, wchar_t, char16_t and char32_t and converts between the locale-dependent character type char, the Unicode UTF-16 and UTF-32 character types char16_t and char32_t, and the implementation defined wide character type wchar_t which is assumed to be equivalent to either char16_t or char32_t. It is able to consume or deliver single characters from multi-character Unicode characters.

The full signature of charcvt() is:


    template<class src_charT, class dst_charT>
    std::codecvt_base::result
    charcvt(std::mbstate_t& state,
            const src_charT* from, const src_charT* from_end,
            const src_charT*& from_next,
            dst_charT* to, dst_charT* to_end, dst_charT*& dst_next,
            int flags = 0);

The flags can be omitted, or they can be set to the constant strcvt_flags_no_partial_conversions to prevent partial conversions. If the flag is omitted or zero, charcvt() is eager and consumes even single partial multibyte characters.

UTF stream conversion

The function charcvt_utf(state, from, from_end, from_next, to, to_end, to_next) is defined in header strcvt/charcvt_utf.h and works like charcvt() but treats the type char as having UTF-8 encoding instead of a locale dependent implementation defined encoding:


    #include "strcvt/charcvt_utf.h"

    // ...

    using namespace StrCvt;

    std::mbstate_t state = std::mbstate_t(); // Zero-initialize
    // Convert between character types:
    charcvt_utf(state, from, from_end, from_next, to, to_end, to_next);

    // ... More calls of charcvt_utf()

    // Return output to initial shift state
    charcvt_utf_unshift(state, to, to_end, to_next);

The state must be of type std::mbstate_t. It holds the conversion state between successive calls of charcvt_utf() and must be explicitly zero-initialized before the first use, like shown above. The arguments from and from_end delimit the source character buffer, to and to_end delimit the destination character buffer, and on exit from the function from_next and to_next point past the last converted character. The function returns std::codecvt_base::ok on success and std::codecvt_base::partial if the output buffer was too small to convert the entire input buffer, or if the input buffer ended in a part of a multibyte sequence. After the input has been completely converted, possibly by multiple calls to charcvt_utf(), charcvt_utf_unshift() checks whether trailing incomplete Unicode character input sequences are pending, and appends a replacement character to the output buffer if necessary to signal the presence of trailing garbage.

Charcvt_utf() is specialized for all combinations of source and destination character types char, wchar_t, char16_t and char32_t and converts between UTF-coded char, the Unicode UTF-16 and UTF-32 character types char16_t and char32_t, and the implementation defined wide character type wchar_t which is assumed to be equivalent to either char16_t or char32_t. Contrary to member functions in() and out() of codecvt, charcvt_utf() always converts (even if the input character type is the same as the output character type, in which case it checks the validity of the encoding). It always generates valid UTF-8, UTF-16 or UTF-32 output sequences.

For each invalid encoding in the input buffer, charcvt_utf() inserts the replacement character U+0xFFFD into the output buffer. If this is not wanted, function charcvt_utf_strict() can be used, which has the same interface as charcvt_utf() but returns std::codecvt_base::error on encoding errors, with from_next pointing to the first element of the invalid sequence.

There is also a UTF code conversion facet defined in header strcvt/codecvt_utf.h


    class codecvt_utf<intern_charT, extern_charT>;

This code conversion facet uses charcvt_utf() to convert between UTF coded characters intern_charT and extern_charT, and is specialized for all character type combinations.

Character type converting character iterator

Instead of converting the entire buffer, it can be accessed trough a converting character iterator. The iterator is created by function make_strcvt_iterator():


    #include "strcvt/strcvt_iterator.h"

    // ...

    using namespace StrCvt;

    // Create char32_t iterator for access to source C string
    auto it = make_strcvt_iterator<char32_t>("Hello, world!");

The end iterator is returned by make_strcvt_iterator() without arguments, or it can be obtained from member function end() of the iterator. Since member function begin() is also implemented, and returns the iterator itself, the iterator can be used just like a container, for example with the range-based for statement:


    std::u32string w;
    // Use range-based for statement: process all Unicode characters
    for (char32_t c: it)
        w.push_back(c);

    // Alternatively, use simple iterator interface
    auto e = it.end();
    for (auto j = it; j != e; ++j)
        w.push_back(*j);

The converting iterator is (indirectly) derived from base class strcvt_iterator_base<charT> which does not depend on the source iterator type or the source character type. It has a virtual destructor and can be used polymorphically. The inheritance hierarchy looks like this:


    template<class charT>
    class strcvt_iterator_base {
        // Operators *(), ++(), ==()
        // Member functions begin(), end()
    };

    template<class charT, class source_charT>
    class strcvt_iterator_impl : public strcvt_iterator_base<charT> {
        // Implements operator ++() which does the conversion using
        // virtual member function get_next_source_char()
    };

    template<class charT, class SourceIterator>
    class strcvt_iterator
        : public strcvt_iterator_impl<charT, iterator_traits<SourceIterator>::value_type> {
        // Implements get_next_source_char()
    };

Class template strcvt_iterator_base does not know the source character type and can be used to handle a strcvt_iterator polymorphically. Class template strcvt_iterator_impl knows the source character type but not the type of the iterator. If the precompiled library is used, it is precompiled for all combinations of source and destination character type. Class template strcvt_iterator is the return value of function make_strcvt_iterator(). It implements the source iterator handling.

Operators

Some operators for strings are defined in header "strcvt/strcvt_operators.h". These are the output operators << for strings and character pointers, and the appending += operators for strings. To use them, include header "strcvt/strcvt_operators.h". The operators reside in namespace StrCvt and should be imported into the user's namespace through a using namespace directive:


    #include "strcvt/strcvt_operators.h"
    using namespace StrCvt;

    std::cout << U"Hello, world!\n";

It would be nice if we could define conversion operators to enable assignment between different string types. But in C++, conversion and assignment operators can only be defined as member functions.

Creating a Library

The character type converter is preconfigured for header-only use. This means: Just include the header and you are done. In order to reduce space overhead and compilation time, a precompiled library can be used.

The main advantage in using a library is that each time the strcvt headers are included, the compiler does not need to look at the implementation details. This can speed up compilation significantly.

To create the library, the C++ source files libstrcvt.cpp, libstrcvt_iterator.cpp, libstrcvt_utf.cpp and libcodecvt_utf.cpp in directory lib of the distribution must be compiled. Under Linux, just run make. Before compiling, you may want to select the compiler to use: Uncomment to proper CXX= - line in the toplevel Makefile.template. Running make should create a library lib/libxprintf.a, which has to be linked to the programs.

In Visual C++, instead of building a library, you may just add the library source files to your project.

In order to make the headers use the library, you must open the header strcvt/strcvt_config.h with an editor and change the preprocessor symbol STRCVT_IMPL_USE_LIBRARY from 0 to 1. The next time the header is included, the library will be used. You can check that the library is used as intended by omitting the library when linking. Linking should fail with missing externals.

In order to run the tests, the headers for the boost test framework are required.

Compatibility

The code converter has been tested with:

  • GCC 4.7, 4.8 and 4.9 on Linux
  • Visual Studio Express 2013 for Windows Desktop with November 2013 CTP
  • Intel C++ 14.0.2 on Linux
  • Clang 3.5.0 on Linux

Rationale

C++11 has the new character types char16_t and char32_t, which use the UTF-16 and UTF-32 Unicode encoding. The question is how to convert between the four character types of C++.

The codecvt part of the C++11 library looks like some ruins left over at the front line between warring factions.

The codecvt class template is part of the header \<locale> and described in section 22.4.1.4 of the standard. The first thing to note is that according to this specification codecvt transforms between an internal and an external character encoding, so it is not intended to transform between internal character encodings like char16_t and wchar_t.

According to the standard, each locale shall have specializations of codecvt for tranformation between the internal character types char, wchar_t, char16_t and char32_t and the external character type char. The standard first says that these specializations "convert the implementation-defined native character set", only to continue specifying that

  • the specialization with external and internal character type both char must not convert at all (effectively saying that the internal char must have the same encoding as the external char),

  • the specializations with internal character types char16_t and char32_t must treat the external character code as UTF-8, so they are explicitly not allowed to treat it as an "implementation-defined native character set".

This leaves the transformation between internal character type wchar_t and external character type char as the only locale-dependent transformation, and it is specified to convert "between the native character sets for narrow and wide characters". So no luck using codecvt to transform between native character sets and char16_t or char32_t.

Section 22.5 of the standard specifies the "standard code conversion facets". It contains:

  • Facet codecvt_utf8, which converts between a UTF-8 coded char buffer and UCS2 or UCS4, so it can be used to transform between UTF-8 and char32_t. It is not usable for transformation between UTF-8 and char16_t because char16_t strings are coded as UTF-16, not UCS2.

  • Facet codecvt_utf8_utf16 converts between UTF-8 and UTF-16, so it can be used to transform between UTF-8-coded char and char16_t.

  • Facet codecvt_utf16 looks like a hack from the 20th century to adapt UTF-16 wide characters to character byte streams. The UTF-16-coded buffer is addressed through a char*.

So we can use these standard code conversion facets to convert between UTF-8-coded char and char16_t or char32_t. We could even transform between char16_t and char32_t by going through an intermediate UTF-8-coded char buffer. But again, no transformation between native character sets and char16_t or char32_t.

The standard does provide an interface for transformation between native char and char16_t or char32_t. It is hidden at the very end of section 21.7 in the string library and consists of the four functions mbrtoc16(), c16rtomb(), mbrtoc32() and c32rtomb(). Unfortunately, a locale can not be specified for these functions. They always work with the currently active global locale.

For the following things I know of no standard-conforming procedure:

  • Conversion between wchar_t and char16_t or char32_t. Clearly, going through native char is not an option, unless it happens to use UTF-8 encoding. I worked around this by assuming that the native wchar_t type uses Unicode coding and is equivalent to either char16_t or char32_t, depending on sizeof(wchar_t). This assumption holds for the systems I have access to, but there may be other wchar_t encodings in use. Older Windows systems used UCS2, where this assumption would not hold, but modern versions of Windows use UTF-16.

  • Locale-dependent conversion: especially for output streams where a locale is known that may not be the global locale, it would be useful if the locale could be passed as an argument to the character transformation. Unfortunately the standard library does not provide a portable way to do this, so I left it out.

First I thought I would reuse the standard-specified UTF coders, but first they are incomplete (going from UTF-16 to UTF-32 via an intermediate representation as UTF-8 is not very attractive), second they are distributed over several different interfaces, and third GCC does not have them. So as a first step I implemented a transformation between UTF-coded char, char16_t, char32_t and wchar_t. The function charcvt_utf(state, from, from_end, from_next, to, to_end, to_next) uses the API of the member functions in() and out() of codecvt. The state holds the conversion state, from and from_end delimit the source character buffer, to and to_end delimit the destination character buffer, and on exit from the function from_next and to_next point past the last converted character. Like in() and out(), charcvt_utf() returns std::codecvt_base::ok on success and std::codecvt_base::partial if the output buffer was too small to convert the entire input buffer, or if the input buffer ended in a part of a multibyte sequence. For each invalid encoding in the input buffer, the replacement character U+0xFFFD is inserted into the output stream. If this is not wanted, function charcvt_utf_strict() can be used, which has the same interface as charcvt_utf() but returns std::codecvt_base::error on encoding errors, with from_next pointing to the first element of the invalid sequence. Both functions are specialized on all combinations of char, wchar_t, char16_t and char32_t. They always convert (even if input and output type is the same, in which case they check the validity of the encoding). They are able to consume and produce single characters forming part of multibyte Unicode characters.

Based on the UTF coder, a locale-dependent converter has been implemented. The function charcvt(state, from, from_end, from_next, to, to_end, to_next) also uses the API of the member functions in() and out() of codecvt: The state holds the conversion state, from and from_end delimit the source character buffer, to and to_end delimit the destination character buffer, and on exit from the function from_next and to_next point past the last converted character. This implementation uses the assumption that wchar_t is encoded as either UTF-16 or UTF-32. Transformations between wide characters wchar_t, char16_t and char32_t use UTF transformations, transformations between char and wide characters treat char as a native locale-defined character code.

Convenience functions for the code transformation are:

  • std::basic_string<charT> strcvt<std::basic_string\<charT>>(const C* from, std::size_t size) converts the size characters at from into a string of character type charT, which is specified as a template parameter.

  • std::basic_string<charT> strcvt<std::basic_string\<charT>>(const C* from) converts the characters of the null-terminated string from into a string of character type charT, which is specified as a template parameter.

  • std::basic_string<charT> strcvt<std::basic_string\<charT>>(const std::basic_string<C>& from) converts the characters of the string from into a string of character type charT, which is specified as a template parameter. Specializations for charT == C move the string argument to the return value (if the argument is an rvalue) or return the reference to the argument (if the argument is an lvalue). This is useful in generic code where the concrete types are not known.

  • strcvt(const C* from) (strcvt() without template parameter specifying the destination string type) creates an object which can be converted back to a const C* or to any string type. The result can e.g. be assigned to any string type, or passed to a function expecting any string.

  • strcvt(const std::basic_string<C>& from) (strcvt() without template parameter specifying the destination string type) casts the type of the argument to a reference to a class derived from std::basic_string<C>, which adds conversion operators to all string types. The result can e.g. be assigned to any string type, or passed to a function expecting any string.

License

Copyright (c) 2014 Ruediger Helsch; All rights reserved

Permission to use, copy, modify, and distribute this software for any purpose and without fee is hereby granted. The author disclaims all warranties with regard to this software.

Source: README.md, updated 2014-07-02