Thread: [Cppunit-devel] Toward True Unicode Code... Help requested

Brought to you by: blep

cppunit-devel

[Cppunit-devel] Toward True Unicode Code... Help requested

From: Baptiste L. <gai...@fr...> - 2002-04-14 10:06:07

Attachments: UnicodeAnsiCodeFragment.txt

    I intend to give CppUnit true UNICODEsupport in assertion message (the
description of the failure, returned by Exception::what()).

    CppUnit should be able to deal with *both* ANSI and UNICODE string. It's
likely that not every unit tests in a project will use one of those
exclusively.

    The current solution I found to this problem is to introduce a wrapper
string class that can be constructed from either ANSI or UNICODE string
(kind of like the MFC CString). That class would also provides accessor to
retreive the string as either ANSI or UNICODE. Conversion from on format to
another would be done automatically.

    Exception and NotEqualException would be modified to use the string
class described above, as well as Asserter functions (functions that 'do'
the assertion by creating the Exception and throwing it).

    That solution have the advantage of having little impact on the existing
code (both CppUnit and user). Most impact will likely be on Outputter (I
have some really twistest ideas for this, but that for later). Does anybody
see another solution ? Suggestions ?

    On a more technical side, I'm not very familar with Unicode on Windows
(never made an application that truely use Unicode), and just about nothing
on Unix. So here are some questions:

    1) Is std::wstring available on all plateform ?
    2) Are std::wcout, std::wcerr available on all plateform ?
    3) Is the attached code fragment the correct way to do the conversion
between std::string and std::wstring on Windows.
    4) How do we do the conversion on Unix ? Would a dummy conversion (to
and fro iso-latin1) do ?

    Thanks in advance,
    Baptiste.
---
Baptiste Lepilleur <gai...@fr...>
http://gaiacrtn.free.fr/

Re: [Cppunit-devel] Toward True Unicode Code... Help requested

From: Duane M. <dua...@ma...> - 2002-04-14 17:35:45

My two cents for what its worth is to do nothing.

There is no "True" Unicode. There are several unicode formats. UTF-8,
UTF-16, and the upcoming UTF-32 among others. I have been told by some
associates that keep up with such things that UTF-16, while currently
being used, is on its way out in favor of UTF-32. It is most often
recommended to use the simple UTF-8 for most applications.

UTF-8 will likely satisfy most of us and require absolutlely no changes.
UTF-8 is completely compatible with ASCII (for characters < 128). UTF-8
fits nicely in a standard string. Anyone that is concerned about such
things has already worked around any problems involved in using UTF-8
with std::string. This mostly involved parsing and locating character
seperations which is of little concern to CppUnit.

Another reason to do nothing is that I would hope that the C++ standards
committee at least makes some statement about Unicode or
internationalization. They have done lots of work to put in
infrastructure that very few people really understand. I believe that
they need to make some statement or show some examples of how to truly
deal with Unicode.

My recommendation is to do nothing.

Is there some other driving factor behind this decision?

--- At Fri, 14 Jun 2002 12:11:46 +0200, Baptiste Lepilleur wrote:

>    I intend to give CppUnit true UNICODEsupport in assertion message (the
>description of the failure, returned by Exception::what()).
>
>    CppUnit should be able to deal with *both* ANSI and UNICODE string. It's
>likely that not every unit tests in a project will use one of those
>exclusively.
>
>    The current solution I found to this problem is to introduce a wrapper
>string class that can be constructed from either ANSI or UNICODE string
>(kind of like the MFC CString). That class would also provides accessor to
>retreive the string as either ANSI or UNICODE. Conversion from on format to
>another would be done automatically.
>
>    Exception and NotEqualException would be modified to use the string
>class described above, as well as Asserter functions (functions that 'do'
>the assertion by creating the Exception and throwing it).
>
>    That solution have the advantage of having little impact on the existing
>code (both CppUnit and user). Most impact will likely be on Outputter (I
>have some really twistest ideas for this, but that for later). Does anybody
>see another solution ? Suggestions ?
>
>    On a more technical side, I'm not very familar with Unicode on Windows
>(never made an application that truely use Unicode), and just about nothing
>on Unix. So here are some questions:
>
>    1) Is std::wstring available on all plateform ?
>    2) Are std::wcout, std::wcerr available on all plateform ?
>    3) Is the attached code fragment the correct way to do the conversion
>between std::string and std::wstring on Windows.
>    4) How do we do the conversion on Unix ? Would a dummy conversion (to
>and fro iso-latin1) do ?

 ...Duane

-- 
"If tyranny and oppression come to this land, it will be in the
guise of fighting a foreign enemy."              - James Madison

Re: [Cppunit-devel] Toward True Unicode Code... Help requested

From: Baptiste L. <gai...@fr...> - 2002-04-15 23:26:59

----- Original Message -----
From: "Duane Murphy" <dua...@ma...>
To: "CppUnit Developers" <cpp...@li...>
Sent: Sunday, April 14, 2002 7:35 PM
Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested

> My two cents for what its worth is to do nothing.
>
> There is no "True" Unicode. There are several unicode formats. UTF-8,
> UTF-16, and the upcoming UTF-32 among others. I have been told by some
> associates that keep up with such things that UTF-16, while currently
> being used, is on its way out in favor of UTF-32. It is most often
> recommended to use the simple UTF-8 for most applications.
>
> UTF-8 will likely satisfy most of us and require absolutlely no changes.
> UTF-8 is completely compatible with ASCII (for characters < 128). UTF-8
> fits nicely in a standard string. Anyone that is concerned about such
> things has already worked around any problems involved in using UTF-8
> with std::string. This mostly involved parsing and locating character
> seperations which is of little concern to CppUnit.

Just a question on the side, does that means that if you split a string into
many lines using the '\n' character, you can use the same algorithm in ANSI
and UTF8 ? (=> even two or three bytes characters encoding don't use '\n')

>
> Another reason to do nothing is that I would hope that the C++ standards
> committee at least makes some statement about Unicode or
> internationalization. They have done lots of work to put in
> infrastructure that very few people really understand. I believe that
> they need to make some statement or show some examples of how to truly
> deal with Unicode.
>
> My recommendation is to do nothing.
>
> Is there some other driving factor behind this decision?

My original though was that it makes it easier for outputter: AFAIK you can
not set a code page saying that you're working in UTF8 (let me know if it is
possible).

Since you have API such as fwprintf, cwerr... it wouldn't be a problem to
display the output in Unicode. So I did some testing: trying to display a
few hiragana in VC++ output window. I try two differents way:
- running the test application in post-build test, and printing with
fwprintf
- from a VC++ add-ins, using IApplication::PrintToOutputWindow, which take a
unicode string as argument.

Same result for both, a few '?' characters, meaning that a conversion
occured from unicode to multi-byte charater, and failed to find a match for
the unicode character (the font used for the output window support those
unicode characters).

Basically, that means using unicode doesn't make anything easier: even if
you have unicode, you need to write special application to display the
result. The same applies to UTF8, but...

For UTF8, we already have the XmlOuputter (thanks to Fumiki suggestions, we
can now specify the encoding).

So I agree, let's not change CppUnit. It already support UTF8 and that's
enough. If anything need to be changed, it would be the GUI TestRunner to
support UTF8 and font selection.

Thanks for you feedback Duane,
Baptiste.

Re: [Cppunit-devel] Toward True Unicode Code... Help requested

From: Duane M. <dua...@ma...> - 2002-04-15 23:37:18

--- At Mon, 15 Apr 2002 23:32:29 +0200, Baptiste Lepilleur wrote:

>----- Original Message -----
>From: "Duane Murphy" <dua...@ma...>
>To: "CppUnit Developers" <cpp...@li...>
>Sent: Sunday, April 14, 2002 7:35 PM
>Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested
>
>
>> My two cents for what its worth is to do nothing.
>>
>> There is no "True" Unicode. There are several unicode formats. UTF-8,
>> UTF-16, and the upcoming UTF-32 among others. I have been told by some
>> associates that keep up with such things that UTF-16, while currently
>> being used, is on its way out in favor of UTF-32. It is most often
>> recommended to use the simple UTF-8 for most applications.
>>
>> UTF-8 will likely satisfy most of us and require absolutlely no changes.
>> UTF-8 is completely compatible with ASCII (for characters < 128). UTF-8
>> fits nicely in a standard string. Anyone that is concerned about such
>> things has already worked around any problems involved in using UTF-8
>> with std::string. This mostly involved parsing and locating character
>> seperations which is of little concern to CppUnit.
>
>Just a question on the side, does that means that if you split a string into
>many lines using the '\n' character, you can use the same algorithm in ANSI
>and UTF8 ? (=> even two or three bytes characters encoding don't use '\n')

I hope I understand the question. If I output a string that includes a
'\n' in a stream, and some other process is parsing that stream, will
'\n' be unique?

The answer is yes! I was equally stunned to here this. Once a shift
character is seen that identifies that following characters as unicode,
then none of the bytes that are part of that unicode "character" will be
less than 128! This is what makes UTF-8 work. All characters <128 are
always ASCII!

>> Another reason to do nothing is that I would hope that the C++ standards
>> committee at least makes some statement about Unicode or
>> internationalization. They have done lots of work to put in
>> infrastructure that very few people really understand. I believe that
>> they need to make some statement or show some examples of how to truly
>> deal with Unicode.
>>
>> My recommendation is to do nothing.
>>
>> Is there some other driving factor behind this decision?
>
>My original though was that it makes it easier for outputter: AFAIK you can
>not set a code page saying that you're working in UTF8 (let me know if it is
>possible).

I'm not sure where you want to specify a code page and I'm not always
clear as to what a code page means in some contexts. I think (and this is
very old memory) that the encoding of an XML file can be UTF-8. Beyond
that, I dont know.

>Since you have API such as fwprintf, cwerr... it wouldn't be a problem to
>display the output in Unicode. So I did some testing: trying to display a
>few hiragana in VC++ output window. I try two differents way:
>- running the test application in post-build test, and printing with
>fwprintf
>- from a VC++ add-ins, using IApplication::PrintToOutputWindow, which take a
>unicode string as argument.
>
>Same result for both, a few '?' characters, meaning that a conversion
>occured from unicode to multi-byte charater, and failed to find a match for
>the unicode character (the font used for the output window support those
>unicode characters).
>
>Basically, that means using unicode doesn't make anything easier: even if
>you have unicode, you need to write special application to display the
>result. The same applies to UTF8, but...
>
>For UTF8, we already have the XmlOuputter (thanks to Fumiki suggestions, we
>can now specify the encoding).
>
>So I agree, let's not change CppUnit. It already support UTF8 and that's
>enough. If anything need to be changed, it would be the GUI TestRunner to
>support UTF8 and font selection.

 ...Duane

-- 
"If tyranny and oppression come to this land, it will be in the
guise of fighting a foreign enemy."              - James Madison

Re: [Cppunit-devel] Toward True Unicode Code... Help requested

From: Baptiste L. <gai...@fr...> - 2002-04-16 12:39:52

----- Original Message -----
From: "Duane Murphy" <dua...@ma...>
To: "Baptiste Lepilleur" <gai...@fr...>; "CppUnit Developers"
<cpp...@li...>
Sent: Tuesday, April 16, 2002 1:37 AM
Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested


> --- At Mon, 15 Apr 2002 23:32:29 +0200, Baptiste Lepilleur wrote:
>
> >----- Original Message -----
> >From: "Duane Murphy" <dua...@ma...>
> >To: "CppUnit Developers" <cpp...@li...>
> >Sent: Sunday, April 14, 2002 7:35 PM
> >Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested
> >
[...]
> I hope I understand the question. If I output a string that includes a
> '\n' in a stream, and some other process is parsing that stream, will
> '\n' be unique?
>
> The answer is yes! I was equally stunned to here this. Once a shift
> character is seen that identifies that following characters as unicode,
> then none of the bytes that are part of that unicode "character" will be
> less than 128! This is what makes UTF-8 work. All characters <128 are
> always ASCII!

Great, that means even ouputters relying on that are compatible with UTF8
(CompilerOutputter which as some line wrapping code).

Baptiste.

Re: [Cppunit-devel] Toward True Unicode Code... Help requested

From: Duane M. <dua...@ma...> - 2002-04-16 15:29:01

--- At Tue, 16 Apr 2002 13:51:54 +0200, Baptiste Lepilleur wrote:
>----- Original Message -----
>From: "Duane Murphy" <dua...@ma...>
>To: "Baptiste Lepilleur" <gai...@fr...>; "CppUnit Developers"
><cpp...@li...>
>Sent: Tuesday, April 16, 2002 1:37 AM
>Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested
>
>
>> --- At Mon, 15 Apr 2002 23:32:29 +0200, Baptiste Lepilleur wrote:
>>
>> >----- Original Message -----
>> >From: "Duane Murphy" <dua...@ma...>
>> >To: "CppUnit Developers" <cpp...@li...>
>> >Sent: Sunday, April 14, 2002 7:35 PM
>> >Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested
>> >
>[...]
>> I hope I understand the question. If I output a string that includes a
>> '\n' in a stream, and some other process is parsing that stream, will
>> '\n' be unique?
>>
>> The answer is yes! I was equally stunned to here this. Once a shift
>> character is seen that identifies that following characters as unicode,
>> then none of the bytes that are part of that unicode "character" will be
>> less than 128! This is what makes UTF-8 work. All characters <128 are
>> always ASCII!
>
>Great, that means even ouputters relying on that are compatible with UTF8
>(CompilerOutputter which as some line wrapping code).

I want to make sure that this question is properly understood. If you are
just searching for '\n' in a stream or string then that will work fine.
If you are looking to insert '\n' (or any other characters) at some
position then things get complicated.

 ...Duane

-- 
"If tyranny and oppression come to this land, it will be in the
guise of fighting a foreign enemy."              - James Madison

Re: [Cppunit-devel] Toward True Unicode Code... Help requested

From: Baptiste L. <gai...@fr...> - 2002-04-16 18:21:55

----- Original Message -----
From: "Duane Murphy" <dua...@ma...>
To: "Baptiste Lepilleur" <gai...@fr...>; "CppUnit Developers"
<cpp...@li...>
Sent: Tuesday, April 16, 2002 5:28 PM
Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested


> --- At Tue, 16 Apr 2002 13:51:54 +0200, Baptiste Lepilleur wrote:
> >----- Original Message -----
> >From: "Duane Murphy" <dua...@ma...>
> >To: "Baptiste Lepilleur" <gai...@fr...>; "CppUnit Developers"
> ><cpp...@li...>
> >Sent: Tuesday, April 16, 2002 1:37 AM
> >Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested
> >
> >
> >> --- At Mon, 15 Apr 2002 23:32:29 +0200, Baptiste Lepilleur wrote:
> >>
> >> >----- Original Message -----
> >> >From: "Duane Murphy" <dua...@ma...>
> >> >To: "CppUnit Developers" <cpp...@li...>
> >> >Sent: Sunday, April 14, 2002 7:35 PM
> >> >Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help
requested
> >> >
> >[...]
> >> I hope I understand the question. If I output a string that includes a
> >> '\n' in a stream, and some other process is parsing that stream, will
> >> '\n' be unique?
> >>
> >> The answer is yes! I was equally stunned to here this. Once a shift
> >> character is seen that identifies that following characters as unicode,
> >> then none of the bytes that are part of that unicode "character" will
be
> >> less than 128! This is what makes UTF-8 work. All characters <128 are
> >> always ASCII!
> >
> >Great, that means even ouputters relying on that are compatible with UTF8
> >(CompilerOutputter which as some line wrapping code).
>
> I want to make sure that this question is properly understood. If you are
> just searching for '\n' in a stream or string then that will work fine.
> If you are looking to insert '\n' (or any other characters) at some
> position then things get complicated.

You understood the question well. It's me who did not have all my head when
I answered. Indeed I insert '\n', which make it not UTF8 compatible. This is
an issue that will need to be addressed in the future.

Thanks,
Baptiste.

>
>  ...Duane
>
> --
> "If tyranny and oppression come to this land, it will be in the
> guise of fighting a foreign enemy."              - James Madison
>
>
>

Re: [Cppunit-devel] Toward True Unicode Code... Help requested

From: Duane M. <dua...@ma...> - 2002-04-16 21:11:07

--- At Tue, 16 Apr 2002 20:28:56 +0200, Baptiste Lepilleur wrote:

>----- Original Message -----
>From: "Duane Murphy" <dua...@ma...>
>To: "Baptiste Lepilleur" <gai...@fr...>; "CppUnit Developers"
><cpp...@li...>
>Sent: Tuesday, April 16, 2002 5:28 PM
>Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested
>
>
>> --- At Tue, 16 Apr 2002 13:51:54 +0200, Baptiste Lepilleur wrote:
>> >----- Original Message -----
>> >From: "Duane Murphy" <dua...@ma...>
>> >To: "Baptiste Lepilleur" <gai...@fr...>; "CppUnit Developers"
>> ><cpp...@li...>
>> >Sent: Tuesday, April 16, 2002 1:37 AM
>> >Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested
>> >
>> >
>> >> --- At Mon, 15 Apr 2002 23:32:29 +0200, Baptiste Lepilleur wrote:
>> >>
>> >> >----- Original Message -----
>> >> >From: "Duane Murphy" <dua...@ma...>
>> >> >To: "CppUnit Developers" <cpp...@li...>
>> >> >Sent: Sunday, April 14, 2002 7:35 PM
>> >> >Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help
>requested
>> >> >
>> >[...]
>> >> I hope I understand the question. If I output a string that includes a
>> >> '\n' in a stream, and some other process is parsing that stream, will
>> >> '\n' be unique?
>> >>
>> >> The answer is yes! I was equally stunned to here this. Once a shift
>> >> character is seen that identifies that following characters as unicode,
>> >> then none of the bytes that are part of that unicode "character" will
>be
>> >> less than 128! This is what makes UTF-8 work. All characters <128 are
>> >> always ASCII!
>> >
>> >Great, that means even ouputters relying on that are compatible with UTF8
>> >(CompilerOutputter which as some line wrapping code).
>>
>> I want to make sure that this question is properly understood. If you are
>> just searching for '\n' in a stream or string then that will work fine.
>> If you are looking to insert '\n' (or any other characters) at some
>> position then things get complicated.
>
>You understood the question well. It's me who did not have all my head when
>I answered. Indeed I insert '\n', which make it not UTF8 compatible. This is
>an issue that will need to be addressed in the future.

This is where I have hopes of the standards committee adding something to
the standard to address unicode support. Presently there is no standard
API for identifying character boundaries. I suspect its not that hard to
do by hand but its also something that most OS's provide an interface for.

Maybe some kind of abstraction. All you need to do is find a place that's
safe to insert characters; that's between characters not inter-character.
I think most OS's provide that capability, so this would be an OS
dependent abstraction.

 ...Duane

-- 
"If tyranny and oppression come to this land, it will be in the
guise of fighting a foreign enemy."              - James Madison

Re: [Cppunit-devel] Specifying coed page (was: Toward True Unicode Code... Help requested)

From: Baptiste L. <gai...@fr...> - 2002-04-16 12:39:54

----- Original Message -----
From: "Duane Murphy" <dua...@ma...>
To: "Baptiste Lepilleur" <gai...@fr...>; "CppUnit Developers"
<cpp...@li...>
Sent: Tuesday, April 16, 2002 1:37 AM
Subject: Re: [Cppunit-devel] Toward True Unicode Code... Help requested


> --- At Mon, 15 Apr 2002 23:32:29 +0200, Baptiste Lepilleur wrote:
>
[...]
> I'm not sure where you want to specify a code page and I'm not always
> clear as to what a code page means in some contexts. I think (and this is
> very old memory) that the encoding of an XML file can be UTF-8. Beyond
> that, I dont know.

The only stuff I know to specify code page is VC++ specific:
// Set the locale
setlocale( LC_ALL, "jpn" );

// Set the code associated to the current locale
_setmbcp( _MB_CP_LOCALE );

When you do that, VC++ is supposed to use the specified locale when doing
MBCS <=> UNICODE conversion. So, if you could specify a UTF8 code page, you
would have a cheap way of doing UTF8/UNICODE conversion. (understand UNICODE
as wchat_t ;-) ).

Baptiste.