Name | Modified | Size | Downloads / Week |
---|---|---|---|
README.md | 2014-07-02 | 29.6 kB | |
strcvt-0.3.1.zip | 2014-07-02 | 78.4 kB | |
strcvt-0.3.1.tar.gz | 2014-07-02 | 60.6 kB | |
Totals: 3 Items | 168.7 kB | 0 |
StrCvt - Character type converter for C++11
Contents
- Introduction
- Installation
- Usage
- String conversion
- UTF-8 string type
u8string
- UTF string conversion
- Locale dependent stream conversion
- UTF stream conversion
- Character type converting iterator
- Operators
- Creating a Library
- Compatibility
- Rationale
- License
Introduction
C++11 has the new character types char16_t
and char32_t
, which use
the UTF-16 and UTF-32 Unicode encoding. The question is how to convert
between the four character types of C++.
This library converts between the different character types of C++.
-
Conversion between the types
char
,wchar_t
,char16_t
andchar32_t
in all combinations. -
Implements the API of the C++
codecvt
member functions in() and out() for stream conversion, and convenience functions for string conversion. -
Locale-dependent conversion and UTF conversion.
Installation
Unpack the distribution to a directory on your local machine. You can
include the headers in subdirectory include/xprintf
from your
program. To make inclusion of the headers easier, it is recommended to
add the subdirectory include
of the distribution to the include file
search path of the compiler. This is commonly achieved with the
option -I/path/to/strcvt/include
(assuming that the xprintf
distribution has been upacked to directory /path/to/xprintf
). Then
you can include the code converter headers through their standard names
like "strcvt/strcvt.h".
Usage
In order to use the code converter, its headers have to be included. They are:
// The main header defining the strcvt() string conversion functions
#include "/path/to/strcvt/include/strcvt/strcvt.h"
// The main header defining the u8string type
#include "/path/to/strcvt/include/strcvt/u8string.h"
// The header defining the strcvt_utf() UTF string conversion functions
#include "/path/to/strcvt/include/strcvt/strcvt_utf.h"
// The header defining the charcvt() stream conversion functions
#include "/path/to/strcvt/include/strcvt/charcvt.h"
// The header defining the charcvt_utf() UTF stream conversion functions
#include "/path/to/strcvt/include/strcvt/charcvt_utf.h"
// The header defining the strcvt_iterator
#include "/path/to/strcvt/include/strcvt/strcvt_iterator.h"
// The header defining codecvt_utf
#include "/path/to/strcvt/include/strcvt/codecvt_utf.h"
// Optional header defining output operators
#include "/path/to/strcvt/include/strcvt/strcvt_operators.h"
// Import symbols into users namespace
using namespace StrCvt;
If the xprintf include directory /path/to/strcvt/include
has been
added to the include file search path of the compiler, e.g. using the
compiler option -I/path/to/strcvt/include
, this reduces to:
// The main header defining the strcvt() string conversion functions
#include "strcvt/strcvt.h"
// The main header defining the u8string type
#include "strcvt/u8string.h"
// The header defining the strcvt_utf() UTF string conversion functions
#include "strcvt/strcvt_utf.h"
// The header defining the charcvt() stream conversion functions
#include "strcvt/charcvt.h"
// The header defining the charcvt_utf() UTF stream conversion functions
#include "strcvt/charcvt_utf.h"
// The header defining the strcvt_iterator
#include "strcvt/strcvt_iterator.h"
// The header defining codecvt_utf
#include "strcvt/codecvt_utf.h"
// Optional header defining output operators
#include "strcvt/strcvt_operators.h"
// Import symbols into users namespace
using namespace StrCvt;
The strcvt() functions are exported through namespace StrCvt
. It
is recommented to make them available via a using namespace
directive like in the example above. Alternatively it is possible to
import the functions strcvt() and strcvt_utf() etc. separately
through using
declarations:
using StrCvt::strcvt;
using StrCvt::strcvt_utf;
using StrCvt::strcvt_utf_strict;
using StrCvt::make_strcvt_iterator;
The character type converter is pre-configured for header-only use. This means: Just include the header and you are done. In order to reduce space overhead and compilation time, a precompiled library can be used. See section Creating a Library.
String conversion
Function strcvt\<dest_string_type>(source) converts strings from the
source character type to the destination string type. The
dest_string_type
can be std::string
, std::wstring
,
std::u16string
and std::u32string
. An additional string type
StrCvt::u8string
with UTF-8 encoding is also provided; see section
UTF-8 string type u8string
. Example usage:
#include "strcvt/strcvt.h"
// ...
using namespace StrCvt;
// Convert const char* C string to wide wchar_t string
std::wstring w = strcvt<std::wstring>("Hello, world!");
// Convert wide string to string of different character type
std::string s = strcvt<std::string>(w);
// Convert string to char16_t string
std::u16string s16 = strcvt<std::u16string>(s);
// The source buffer can also be specified as (pointer, size):
std::cout << strcvt<std::string>(&s16[0], 5) << std::endl;
The source string can be specified as: - a character pointer of any character type (C-style null-terminated string), - a C++ string of arbitrary character type, or - a character pointer and a size.
The destination string type is specified as a template argument to strcvt(). The function is specialized on all combinations of source and destination character types. So the full signatures are:
template<class charT, class source_charT>
std::basic_string<charT>
strcvt<std::basic_string<charT>>(const std::basic_string<source_charT>& source);
template<class charT, class source_charT>
std::basic_string<charT>
strcvt<std::basic_string<charT>>(const source_charT* source);
template<class charT, class source_charT>
std::basic_string<charT>
strcvt<std::basic_string<charT>>(const source_charT* source, std::size_t size);
The function with string argument dest_string
strcvt<dest_string>(source_string source)
is overloaded for the case
dest_string
== source_string
: it moves its argument to the result
(for rvalue arguments) or returns a reference to its argument (for
lvalue arguments). This is useful in generic code where the concrete
types are not known.
It is also possible to omit the template argument determining the destination string type. In this case, strcvt() does no conversion at all, but returns an object which can be converted to all string types.
// strcvt() returns an object which can be assigned to any string type
std::wstring w = strcvt("Hello, world!");
std::u16string = strcvt(w);
For string arguments, the return value of strcvt() is a subclass of the class of its argument. The subclass adds conversion operators to all string types. For character pointer arguments, the return value of strcvt() is an object which can be converted to a character pointer (of the character type of the argument) or to any string type. In each case, the return value of strcvt() can e.g. be assigned to any string type, or can be passed as argument to a function expecting a string.
UTF-8 string type u8string
In addition to the four standard string types std::string
,
std::wstring
, std::u16stringand
std::u32stringprovided by the
C++ standard library, header strcvt/u8string.h defines an additional
string type
StrCvt::u8string. Like the locale dependent string type
std::stringit has character type
char`, but it always uses a
locale-independent UTF-8 encoding. One important use case is a string
initialized with C++11's u8"string" UTF-8 string literal. If this is
used to initialize a std::string, strcvt() will wrongly assume it is
a string of locale-dependent character type. Example usage:
#include "strcvt/u8string.h"
#include "strcvt/strcvt.h"
// ...
using namespace StrCvt;
u8string us(u8"Hello, World");
std::cout << strcvt<std::string>(us) << std::endl;
In the example above, the u8string
is initialized using the C++11
UTF-8 character literal u8"string"
. For output to the
locale-dependent standard output stream, it is converted from UTF-8 to
the locale dependent char
endoding.
UTF string conversion
In the examples above, strcvt() can be replaced with strcvt_utf(), which ignores the locale and always performs a UTF transformation. The header is "strcvt/strcvt_utf.h".
It is also possible to configure strcvt() to only use UTF conversions
instead of the locale dependent character conversions. Open the header
strcvt/strcvt_config.h
with an editor and change the preprocessor
symbol STRCVT_IMPL_UTF8_ONLY
from 0 to 1.
Locale-dependent stream conversion
The character type stream converter function charcvt(state, from,
from_end, from_next, to, to_end, to_next) is defined in header
strcvt/charcvt.h and uses the API of the member functions in()
and out() of the standard code conversion interface std::codecvt
:
#include "strcvt/charcvt.h"
// ...
using namespace StrCvt;
std::mbstate_t state = std::mbstate_t(); // Zero-initialize
// Convert between character types:
result r = charcvt(state, from, from_end, from_next, to, to_end, to_next);
// ... More calls of charcvt()
// Return output to initial shift state
r = charcvt_unshift<source_charT>(state, to, to_end, to_next);
The state
must be of type std::mbstate_t
. It holds the conversion
state between successive calls of charcvt() and must be explicitly
zero-initialized before the first use, like shown above. The
arguments from
and from_end
delimit the source character buffer,
to
and to_end
delimit the destination character buffer, and on
exit from the function from_next
and to_next
point past the last
converted character. The function returns std::codecvt_base::ok
on
success and std::codecvt_base::partial
if the output buffer was too
small to convert the entire input buffer, or if the input buffer ended
in a part of a multibyte sequence. After the input has been
completely converted, possibly by multiple calls to charcvt(),
charcvt_unshift() must be called to move the output to the initial
shift state. The source character type must be specified as a
template argument to charcvt_unshift(), since different converters are
used depending on the source character type. Charcvt_unshift() should
even be used if the character encoding is known not to be state
dependent, like UTF-8. Charcvt_unshift() checks whether trailing
incomplete Unicode character input sequences are pending, and appends
a replacement character to the output buffer if necessary to signal
the presence of trailing garbage.
Charcvt() is specialized for all combinations of source and
destination character types char
, wchar_t
, char16_t
and
char32_t
and converts between the locale-dependent character type
char
, the Unicode UTF-16 and UTF-32 character types char16_t
and
char32_t
, and the implementation defined wide character type
wchar_t
which is assumed to be equivalent to either char16_t
or
char32_t
. It is able to consume or deliver single characters from
multi-character Unicode characters.
The full signature of charcvt() is:
template<class src_charT, class dst_charT>
std::codecvt_base::result
charcvt(std::mbstate_t& state,
const src_charT* from, const src_charT* from_end,
const src_charT*& from_next,
dst_charT* to, dst_charT* to_end, dst_charT*& dst_next,
int flags = 0);
The flags
can be omitted, or they can be set to the constant
strcvt_flags_no_partial_conversions
to prevent partial conversions.
If the flag is omitted or zero, charcvt() is eager and consumes even
single partial multibyte characters.
UTF stream conversion
The function charcvt_utf(state, from, from_end, from_next, to, to_end,
to_next) is defined in header strcvt/charcvt_utf.h and works like
charcvt() but treats the type char
as having UTF-8 encoding instead
of a locale dependent implementation defined encoding:
#include "strcvt/charcvt_utf.h"
// ...
using namespace StrCvt;
std::mbstate_t state = std::mbstate_t(); // Zero-initialize
// Convert between character types:
charcvt_utf(state, from, from_end, from_next, to, to_end, to_next);
// ... More calls of charcvt_utf()
// Return output to initial shift state
charcvt_utf_unshift(state, to, to_end, to_next);
The state
must be of type std::mbstate_t
. It holds the conversion
state between successive calls of charcvt_utf() and must be explicitly
zero-initialized before the first use, like shown above. The
arguments from
and from_end
delimit the source character buffer,
to
and to_end
delimit the destination character buffer, and on
exit from the function from_next
and to_next
point past the last
converted character. The function returns std::codecvt_base::ok
on
success and std::codecvt_base::partial
if the output buffer was too
small to convert the entire input buffer, or if the input buffer ended
in a part of a multibyte sequence. After the input has been
completely converted, possibly by multiple calls to charcvt_utf(),
charcvt_utf_unshift() checks whether trailing incomplete Unicode
character input sequences are pending, and appends a replacement
character to the output buffer if necessary to signal the presence of
trailing garbage.
Charcvt_utf() is specialized for all combinations of source and
destination character types char
, wchar_t
, char16_t
and
char32_t
and converts between UTF-coded char
, the Unicode UTF-16
and UTF-32 character types char16_t
and char32_t
, and the
implementation defined wide character type wchar_t
which is assumed
to be equivalent to either char16_t
or char32_t
. Contrary to
member functions in() and out() of codecvt
, charcvt_utf() always
converts (even if the input character type is the same as the output
character type, in which case it checks the validity of the encoding).
It always generates valid UTF-8, UTF-16 or UTF-32 output sequences.
For each invalid encoding in the input buffer, charcvt_utf() inserts
the replacement character U+0xFFFD into the output buffer. If this is
not wanted, function charcvt_utf_strict() can be used, which has the
same interface as charcvt_utf() but returns std::codecvt_base::error
on encoding errors, with from_next
pointing to the first element of
the invalid sequence.
There is also a UTF code conversion facet defined in header strcvt/codecvt_utf.h
class codecvt_utf<intern_charT, extern_charT>;
This code conversion facet uses charcvt_utf() to convert between UTF
coded characters intern_charT
and extern_charT
, and is specialized
for all character type combinations.
Character type converting character iterator
Instead of converting the entire buffer, it can be accessed trough a converting character iterator. The iterator is created by function make_strcvt_iterator():
#include "strcvt/strcvt_iterator.h"
// ...
using namespace StrCvt;
// Create char32_t iterator for access to source C string
auto it = make_strcvt_iterator<char32_t>("Hello, world!");
The end iterator is returned by make_strcvt_iterator() without arguments, or it can be obtained from member function end() of the iterator. Since member function begin() is also implemented, and returns the iterator itself, the iterator can be used just like a container, for example with the range-based for statement:
std::u32string w;
// Use range-based for statement: process all Unicode characters
for (char32_t c: it)
w.push_back(c);
// Alternatively, use simple iterator interface
auto e = it.end();
for (auto j = it; j != e; ++j)
w.push_back(*j);
The converting iterator is (indirectly) derived from base class
strcvt_iterator_base<charT>
which does not depend on the source
iterator type or the source character type. It has a virtual
destructor and can be used polymorphically. The inheritance hierarchy
looks like this:
template<class charT>
class strcvt_iterator_base {
// Operators *(), ++(), ==()
// Member functions begin(), end()
};
template<class charT, class source_charT>
class strcvt_iterator_impl : public strcvt_iterator_base<charT> {
// Implements operator ++() which does the conversion using
// virtual member function get_next_source_char()
};
template<class charT, class SourceIterator>
class strcvt_iterator
: public strcvt_iterator_impl<charT, iterator_traits<SourceIterator>::value_type> {
// Implements get_next_source_char()
};
Class template strcvt_iterator_base
does not know the source
character type and can be used to handle a strcvt_iterator
polymorphically. Class template strcvt_iterator_impl
knows the
source character type but not the type of the iterator. If the
precompiled library is used, it is precompiled for all combinations of
source and destination character type. Class template
strcvt_iterator
is the return value of function
make_strcvt_iterator(). It implements the source iterator handling.
Operators
Some operators for strings are defined in header
"strcvt/strcvt_operators.h". These are the output operators <<
for
strings and character pointers, and the appending +=
operators for
strings. To use them, include header "strcvt/strcvt_operators.h".
The operators reside in namespace StrCvt and should be imported into
the user's namespace through a using namespace
directive:
#include "strcvt/strcvt_operators.h"
using namespace StrCvt;
std::cout << U"Hello, world!\n";
It would be nice if we could define conversion operators to enable assignment between different string types. But in C++, conversion and assignment operators can only be defined as member functions.
Creating a Library
The character type converter is preconfigured for header-only use. This means: Just include the header and you are done. In order to reduce space overhead and compilation time, a precompiled library can be used.
The main advantage in using a library is that each time the strcvt headers are included, the compiler does not need to look at the implementation details. This can speed up compilation significantly.
To create the library, the C++ source files libstrcvt.cpp,
libstrcvt_iterator.cpp, libstrcvt_utf.cpp and libcodecvt_utf.cpp in
directory lib
of the distribution must be compiled. Under Linux,
just run make
. Before compiling, you may want to select the
compiler to use: Uncomment to proper CXX= - line in the toplevel
Makefile.template
. Running make
should create a library
lib/libxprintf.a
, which has to be linked to the programs.
In Visual C++, instead of building a library, you may just add the library source files to your project.
In order to make the headers use the library, you must open the header
strcvt/strcvt_config.h
with an editor and change the preprocessor
symbol STRCVT_IMPL_USE_LIBRARY
from 0 to 1. The next time the
header is included, the library will be used. You can check that the
library is used as intended by omitting the library when linking.
Linking should fail with missing externals.
In order to run the tests, the headers for the boost test framework are required.
Compatibility
The code converter has been tested with:
- GCC 4.7, 4.8 and 4.9 on Linux
- Visual Studio Express 2013 for Windows Desktop with November 2013 CTP
- Intel C++ 14.0.2 on Linux
- Clang 3.5.0 on Linux
Rationale
C++11 has the new character types char16_t
and char32_t
, which use
the UTF-16 and UTF-32 Unicode encoding. The question is how to convert
between the four character types of C++.
The codecvt
part of the C++11 library looks like some ruins left
over at the front line between warring factions.
The codecvt
class template is part of the header \<locale> and
described in section 22.4.1.4 of the standard. The first thing to
note is that according to this specification codecvt
transforms
between an internal and an external character encoding, so it is
not intended to transform between internal character encodings like
char16_t
and wchar_t
.
According to the standard, each locale shall have specializations of
codecvt
for tranformation between the internal character types
char
, wchar_t
, char16_t
and char32_t
and the external
character type char
. The standard first says that these
specializations "convert the implementation-defined native character
set", only to continue specifying that
-
the specialization with external and internal character type both
char
must not convert at all (effectively saying that the internalchar
must have the same encoding as the externalchar
), -
the specializations with internal character types
char16_t
andchar32_t
must treat the external character code as UTF-8, so they are explicitly not allowed to treat it as an "implementation-defined native character set".
This leaves the transformation between internal character type
wchar_t
and external character type char
as the only
locale-dependent transformation, and it is specified to convert
"between the native character sets for narrow and wide characters".
So no luck using codecvt
to transform between native character sets
and char16_t
or char32_t
.
Section 22.5 of the standard specifies the "standard code conversion facets". It contains:
-
Facet
codecvt_utf8
, which converts between a UTF-8 codedchar
buffer and UCS2 or UCS4, so it can be used to transform between UTF-8 andchar32_t
. It is not usable for transformation between UTF-8 andchar16_t
becausechar16_t
strings are coded as UTF-16, not UCS2. -
Facet
codecvt_utf8_utf16
converts between UTF-8 and UTF-16, so it can be used to transform between UTF-8-codedchar
andchar16_t
. -
Facet
codecvt_utf16
looks like a hack from the 20th century to adapt UTF-16 wide characters to character byte streams. The UTF-16-coded buffer is addressed through achar*
.
So we can use these standard code conversion facets to convert between
UTF-8-coded char
and char16_t
or char32_t
. We could even
transform between char16_t
and char32_t
by going through an
intermediate UTF-8-coded char
buffer. But again, no
transformation between native character sets and char16_t
or
char32_t
.
The standard does provide an interface for transformation between
native char
and char16_t
or char32_t
. It is hidden at the very
end of section 21.7 in the string library and consists of the four
functions mbrtoc16(), c16rtomb(), mbrtoc32() and c32rtomb().
Unfortunately, a locale can not be specified for these functions.
They always work with the currently active global locale.
For the following things I know of no standard-conforming procedure:
-
Conversion between
wchar_t
andchar16_t
orchar32_t
. Clearly, going through nativechar
is not an option, unless it happens to use UTF-8 encoding. I worked around this by assuming that the nativewchar_t
type uses Unicode coding and is equivalent to eitherchar16_t
orchar32_t
, depending onsizeof(wchar_t)
. This assumption holds for the systems I have access to, but there may be otherwchar_t
encodings in use. Older Windows systems used UCS2, where this assumption would not hold, but modern versions of Windows use UTF-16. -
Locale-dependent conversion: especially for output streams where a locale is known that may not be the global locale, it would be useful if the locale could be passed as an argument to the character transformation. Unfortunately the standard library does not provide a portable way to do this, so I left it out.
First I thought I would reuse the standard-specified UTF coders, but
first they are incomplete (going from UTF-16 to UTF-32 via an
intermediate representation as UTF-8 is not very attractive), second
they are distributed over several different interfaces, and third GCC
does not have them. So as a first step I implemented a transformation
between UTF-coded char
, char16_t
, char32_t
and wchar_t
. The
function charcvt_utf(state, from, from_end, from_next, to, to_end,
to_next) uses the API of the member functions in() and out() of
codecvt
. The state
holds the conversion state, from
and
from_end
delimit the source character buffer, to
and to_end
delimit the destination character buffer, and on exit from the
function from_next
and to_next
point past the last converted
character. Like in() and out(), charcvt_utf() returns
std::codecvt_base::ok
on success and std::codecvt_base::partial
if
the output buffer was too small to convert the entire input buffer, or
if the input buffer ended in a part of a multibyte sequence. For each
invalid encoding in the input buffer, the replacement character
U+0xFFFD is inserted into the output stream. If this is not wanted,
function charcvt_utf_strict() can be used, which has the same
interface as charcvt_utf() but returns std::codecvt_base::error
on
encoding errors, with from_next
pointing to the first element of the
invalid sequence. Both functions are specialized on all combinations
of char
, wchar_t
, char16_t
and char32_t
. They always convert
(even if input and output type is the same, in which case they check
the validity of the encoding). They are able to consume and produce
single characters forming part of multibyte Unicode characters.
Based on the UTF coder, a locale-dependent converter has been
implemented. The function charcvt(state, from, from_end, from_next,
to, to_end, to_next) also uses the API of the member functions in()
and out() of codecvt
: The state
holds the conversion state,
from
and from_end
delimit the source character buffer, to
and
to_end
delimit the destination character buffer, and on exit from
the function from_next
and to_next
point past the last converted
character. This implementation uses the assumption that wchar_t
is
encoded as either UTF-16 or UTF-32. Transformations between wide
characters wchar_t
, char16_t
and char32_t
use UTF
transformations, transformations between char
and wide characters
treat char
as a native locale-defined character code.
Convenience functions for the code transformation are:
-
std::basic_string<charT> strcvt<std::basic_string\<charT>>(const C* from, std::size_t size) converts the
size
characters atfrom
into a string of character typecharT
, which is specified as a template parameter. -
std::basic_string<charT> strcvt<std::basic_string\<charT>>(const C* from) converts the characters of the null-terminated string
from
into a string of character typecharT
, which is specified as a template parameter. -
std::basic_string<charT> strcvt<std::basic_string\<charT>>(const std::basic_string<C>& from) converts the characters of the string
from
into a string of character typecharT
, which is specified as a template parameter. Specializations forcharT == C
move the string argument to the return value (if the argument is an rvalue) or return the reference to the argument (if the argument is an lvalue). This is useful in generic code where the concrete types are not known. -
strcvt(const C* from) (strcvt() without template parameter specifying the destination string type) creates an object which can be converted back to a
const C*
or to any string type. The result can e.g. be assigned to any string type, or passed to a function expecting any string. -
strcvt(const std::basic_string<C>& from) (strcvt() without template parameter specifying the destination string type) casts the type of the argument to a reference to a class derived from std::basic_string<C>, which adds conversion operators to all string types. The result can e.g. be assigned to any string type, or passed to a function expecting any string.
License
Copyright (c) 2014 Ruediger Helsch; All rights reserved
Permission to use, copy, modify, and distribute this software for any purpose and without fee is hereby granted. The author disclaims all warranties with regard to this software.