From: Manolo <man...@gm...> - 2012-08-30 16:35:34
|
Hi wint_t ch = 0x20ac; //euro sign iswalpha(ch) and iswprint(ch) returns 0. The irony is that on XP, compiled with g++ 4.6 MessageBoxW(NULL, buf, L"Result of \u20AC", MB_OK); shows the right symbol. wchar_t eu[] = L"\u20AC"; has two elements: 20ac and 0000, as every NULL terminated. And I do: wint_t ch = eu[0]; //conversion works but still iswprint(ch) returns 0. Either I don't understand the meaning of "print" or somethings goes wrong here. Anyone could help? TIA Manolo |
From: Earnie B. <ea...@us...> - 2012-08-30 18:15:06
|
On Thu, Aug 30, 2012 at 12:35 PM, Manolo wrote: > Hi > > wint_t ch = 0x20ac; //euro sign > iswalpha(ch) and iswprint(ch) returns 0. > > The irony is that on XP, compiled with g++ 4.6 > MessageBoxW(NULL, buf, L"Result of \u20AC", MB_OK); > shows the right symbol. > > wchar_t eu[] = L"\u20AC"; > has two elements: 20ac and 0000, as every NULL terminated. And I do: > wint_t ch = eu[0]; //conversion works > but still iswprint(ch) returns 0. > > Either I don't understand the meaning of "print" or somethings goes wrong here. > > Anyone could help? It would be helpful for you to attach a small test case and give build commands. I'll take a stab that you don't define UNICODE and _UNICODE. MSVC now sets both these by default. See http://msdn.microsoft.com/en-us/library/windows/desktop/ff381407(v=vs.85).aspx for more on strings. -- Earnie -- https://sites.google.com/site/earnieboyd |
From: Manolo <man...@gm...> - 2012-08-30 21:35:56
|
> It would be helpful for you to attach a small test case and give build > commands. I'll take a stab that you don't define UNICODE and > _UNICODE. MSVC now sets both these by default. See > http://msdn.microsoft.com/en-us/library/windows/desktop/ff381407(v=vs.85).aspx > for more on strings. > Small test: euro.cpp =========== #include <wctype.h> #include <stdio.h> #include <windows.h> //useless defines #define UNICODE #define _UNICODE int main() { int ich, resa, resp; wchar_t wch; wchar_t buf[70]; //Try 1 ich = 0x20ac; // euro wch = (wchar_t) ich; resa = iswalpha(ich); resp = iswprint(ich); swprintf(buf, L"hex: 0x%X '%C' iswalpha: %d iswprint: %d \n", ich, wch, resa, resp); MessageBoxW(NULL, buf, L"Result 1 for \u20AC", MB_OK); //Try 2 wchar_t eu[] = L"\u20AC"; wchar_t euH = eu[0]; wint_t wi = euH; resa = iswalpha(wi); resp = iswprint(wi); swprintf(buf, L"hex: 0x%X '%C' iswalpha: %d iswprint: %d \n", wi, euH, resa, resp); MessageBoxW(NULL, buf, L"Result 2 for \u20AC", MB_OK); return 0; } Build commands =============== g++.exe -Wall -O2 -IC:\MinGW\include -c C:\PROGS\pruebas\euro\euro.cpp -o CB_obj\Release\euro.o mingw32-g++.exe -LC:\MinGW\lib -o CB_bin\Release\euro.exe CB_obj\Release\euro.o -s Or just in a shell: g++ euro.cpp and execute the created a.exe TIA Manolo |
From: Greg C. <gch...@sb...> - 2012-08-30 21:55:23
|
On 2012-08-30 21:35Z, Manolo wrote: [...] > #include <wctype.h> > #include <stdio.h> > #include <windows.h> > > //useless defines > #define UNICODE > #define _UNICODE > > int main() Does it work as expected if you define those macros first, before including any header? #define UNICODE #define _UNICODE #include <wctype.h> #include <stdio.h> #include <windows.h> |
From: Manolo <man...@gm...> - 2012-08-30 22:01:55
|
El 30/08/2012 23:55, Greg Chicares escribió: > On 2012-08-30 21:35Z, Manolo wrote: > [...] >> #include <wctype.h> >> #include <stdio.h> >> #include <windows.h> >> >> //useless defines >> #define UNICODE >> #define _UNICODE >> >> int main() > Does it work as expected if you define those macros first, > before including any header? > > #define UNICODE > #define _UNICODE > > #include <wctype.h> > #include <stdio.h> > #include <windows.h> > Wether I #define or not, neither where I #define, same [wrong] result. That's why I wrote "useless defines" |
From: KHMan <kei...@gm...> - 2012-08-31 01:16:24
|
On 8/31/2012 6:01 AM, Manolo wrote: > El 30/08/2012 23:55, Greg Chicares escribió: >> On 2012-08-30 21:35Z, Manolo wrote: >> [...] >>> #include<wctype.h> >>> #include<stdio.h> >>> #include<windows.h> >>> >>> //useless defines >>> #define UNICODE >>> #define _UNICODE >>> >>> int main() >> Does it work as expected if you define those macros first, >> before including any header? >> >> #define UNICODE >> #define _UNICODE >> >> #include<wctype.h> >> #include<stdio.h> >> #include<windows.h> >> > Wether I #define or not, neither where I #define, same [wrong] result. > That's why I wrote "useless defines" True, I guess UNICODE didn't do anything because of the explicit Unicode calls used. Still, there are a lot of tests you could have performed to learn about the behaviour of those calls. I did some testing, and here is what I found on XP: %c worked for me, Euro shown. %C didn't work. Depends on C runtime libs, I suppose. Try dumping a large section of iswalpha() and iswprint() data. Both returned what appeared to be valid data. However, I did not find printable flags in the data files provided by the Unicode Consortium. Try dumping 0x20A0 to 0x20B5, these are the currency symbols currently defined by Unicode. iswprint() returned all zeros. I guess you should find some other way to check for printability... -- Cheers, Kein-Hong Man (esq.) Kuala Lumpur, Malaysia |
From: Earnie B. <ea...@us...> - 2012-08-31 12:42:44
|
On Thu, Aug 30, 2012 at 9:16 PM, KHMan wrote: > > %c worked for me, Euro shown. %C didn't work. Depends on C runtime > libs, I suppose. > See http://msdn.microsoft.com/en-US/library/hf4y5e3w(v=vs.80) %c for swprintf will be a wide character %C for swprintf will be a single byte character %c for sprintf will be a single byte character %C for sprintf will be a wide character G++ will give an error when using wide character data for sprintf though. -- Earnie -- https://sites.google.com/site/earnieboyd |
From: George K. <xke...@ne...> - 2012-08-31 02:27:09
|
On 8/30/2012 4:35 PM, Manolo wrote: > wint_t ch = 0x20ac; //euro sign > iswalpha(ch) and iswprint(ch) returns 0. This is wrong. U+20AC EURO SIGN is punctuation, so iswalpha(ch) must return 0, but iswpunct(ch) and iswprint(ch) must return 1. I confirmed this with OpenBSD in a Unicode locale. You should not use isw routines to test Unicode characters. If Microsoft ever changes iswprintf(0x20ac) to return 1, there will remain another problem. Windows always uses UTF-16 for wchar_t or wint_t. "The isw routines produce meaningful results for any integer value from – 1 (WEOF) to 0xFFFF, inclusive." -- http://msdn.microsoft.com/en-us/library/4yc6feha.aspx Can one check if U+1D11E MUSICAL SYMBOL G CLEF is a printable character? U+1D11E in UTF-16 becomes a surrogate pair, a wchar_t[2]. Yet iswprint() can take only one wchar_t. I conclude that iswprint() can never check if U+1D11E is printable. You might want to find a library that knows that U+20AC is printable and U+1D11E is printable. The most famous Unicode library might be ICU4C (from http://site.icu-project.org/), but I have never used this library, and I know not whether it provides this feature. --George Koehler |
From: Earnie B. <ea...@us...> - 2012-08-31 12:29:54
|
On Thu, Aug 30, 2012 at 10:26 PM, George Koehler <xke...@ne...> wrote: > On 8/30/2012 4:35 PM, Manolo wrote: >> wint_t ch = 0x20ac; //euro sign >> iswalpha(ch) and iswprint(ch) returns 0. > > This is wrong. U+20AC EURO SIGN is punctuation, so iswalpha(ch) must > return 0, but iswpunct(ch) and iswprint(ch) must return 1. I confirmed > this with OpenBSD in a Unicode locale. > Yes, is(w)alpha will return 0 for any character that is not used to represent words of a language. The is(w)print will return 0 for any non-printable character where non-printable is determined by the locale setting. See http://msdn.microsoft.com/en-us/library/ewx8s4kw.aspx. -- Earnie -- https://sites.google.com/site/earnieboyd |
From: Earnie B. <ea...@us...> - 2012-08-31 11:48:31
|
On Thu, Aug 30, 2012 at 9:16 PM, KHMan wrote: > > True, I guess UNICODE didn't do anything because of the explicit > Unicode calls used. Still, there are a lot of tests you could have > performed to learn about the behaviour of those calls. You must define both UNICODE and _UNICODE before including the header files because of the preprocessor directives in the headers. Microsoft states[1] that Visual C++ sets both by default when you create a new project. So my point is that it does do something; but it may not have resolved the issue. [1] http://msdn.microsoft.com/en-us/library/windows/desktop/ff381407(v=vs.85).aspx -- Earnie -- https://sites.google.com/site/earnieboyd |
From: KHMan <kei...@gm...> - 2012-08-31 14:45:51
|
On 8/31/2012 7:48 PM, Earnie Boyd wrote: > On Thu, Aug 30, 2012 at 9:16 PM, KHMan wrote: >> >> True, I guess UNICODE didn't do anything because of the explicit >> Unicode calls used. Still, there are a lot of tests you could have >> performed to learn about the behaviour of those calls. > > You must define both UNICODE and _UNICODE before including the header > files because of the preprocessor directives in the headers. > Microsoft states[1] that Visual C++ sets both by default when you > create a new project. So my point is that it does do something; but > it may not have resolved the issue. > > [1] http://msdn.microsoft.com/en-us/library/windows/desktop/ff381407(v=vs.85).aspx My bad. Sorry for the noise. -- Cheers, Kein-Hong Man (esq.) Kuala Lumpur, Malaysia |
From: Manolo <man...@gm...> - 2012-08-31 18:25:49
|
Hi Thanks to Earnie, I learned %c %C differences ;) But this was not the issue. In these GUI days, I just use xxprintf() for some test. Thanks George for make me remember about surrogate pairs. But they are so little used, that by now I don't worry about them. I did more tests. I tested all values < 0xFFFF with 12 different functions. I counted how many returned != 0, and the minimal and maximum values that ret != 0. Here is what I got: total iswalnum= 46011 min= 0x0030 max= 0xffdc total iswalpha= 45810 min= 0x0041 max= 0xffdc total iswcntrl= 89 min= 0x0000 max= 0xfffb total iswdigit= 201 min= 0x0030 max= 0xff19 total iswgraph= 46342 min= 0x0021 max= 0xffdc total iswlower= 832 min= 0x0061 max= 0xff5a total iswprint= 46347 min= 0x0009 max= 0xffdc total iswpunct= 334 min= 0x0021 max= 0xff65 total iswspace= 24 min= 0x0009 max= 0x3000 total iswupper= 717 min= 0x0041 max= 0xff3a total iswascii= 128 min= 0x0000 max= 0x007f total iswxdigit= 44 min= 0x0030 max= 0xff46 Funny things you can see: - 46347 values are classified as "printable" - 24 values are "space" - 44 xdigit and 201 digit - among the 89 control-values, 0 is one of them. I also tested the 0x20ac euro value with all of these iswxxx(). All of them return 0. I suppose it's MS to blame, because Mingw delegates these functions to msvcrt.dll right? Thanks, Manolo |
From: Earnie B. <ea...@us...> - 2012-08-31 19:43:23
|
On Fri, Aug 31, 2012 at 2:25 PM, Manolo wrote: > > I also tested the 0x20ac euro value with all of these iswxxx(). > All of them return 0. > I think you'll need to set your code page to one that supports the Euro character but I'm guessing. > I suppose it's MS to blame, because Mingw delegates these functions > to msvcrt.dll right? Yes, unless we've enhanced the function. We haven't for these though. -- Earnie -- https://sites.google.com/site/earnieboyd |
From: Keith M. <kei...@us...> - 2012-09-01 09:22:40
|
On 31/08/12 20:43, Earnie Boyd wrote: > On Fri, Aug 31, 2012 at 2:25 PM, Manolo wrote: >> I also tested the 0x20ac euro value with all of these iswxxx(). >> All of them return 0. > > I think you'll need to set your code page to one that supports the > Euro character but I'm guessing. And perhaps also a call to setlocale()? In the default C (aka POSIX) locale, only those characters designated within the POSIX portable set would be classified as printable, and U+20AC isn't within that set. -- Regards, Keith. |
From: George K. <xke...@ne...> - 2012-09-01 16:28:34
|
On 9/1/2012 9:22 AM, Keith Marshall wrote: > And perhaps also a call to setlocale()? In the default C (aka POSIX) > locale, only those characters designated within the POSIX portable set > would be classified as printable, and U+20AC isn't within that set. Except this is Windows, not POSIX. "For the isw routines, the result of the test for the specified condition is independent of locale." -- MSDN: http://msdn.microsoft.com/en-us/library/4yc6feha.aspx Using Windows, my test program shows that iswprint(0x20ac) is 0, with or without setlocale(LC_CTYPE, ""). I do need setlocale to see the euro sign in Windows-1252. --George Koehler #include <locale.h> #include <stdio.h> #include <wchar.h> #include <wctype.h> int main() { wint_t ch = 0x20ac; /* euro sign */ setlocale(LC_CTYPE, ""); printf("'%lc' is printable: %s\n", ch, iswprint(ch) ? "yes" : "no"); return 0; } |
From: Earnie B. <ea...@us...> - 2012-09-01 16:47:50
|
On Sat, Sep 1, 2012 at 12:28 PM, George Koehler wrote: > On 9/1/2012 9:22 AM, Keith Marshall wrote: >> And perhaps also a call to setlocale()? In the default C (aka POSIX) >> locale, only those characters designated within the POSIX portable set >> would be classified as printable, and U+20AC isn't within that set. > > Except this is Windows, not POSIX. > > "For the isw routines, the result of the test for the specified > condition is independent of locale." > -- MSDN: http://msdn.microsoft.com/en-us/library/4yc6feha.aspx > Wrong. "The result of the test condition for these functions depends on the LC_CTYPE category setting of the locale; see setlocale for more information. The versions of these functions without the _l suffix use the current locale for any locale-dependent behavior; the versions with the _l suffix are identical except that they use the locale passed in instead. For more information, see Locale." http://msdn.microsoft.com/en-us/library/ewx8s4kw.aspx Caution: the _l versions are MSVC 2012 dependent and have yet to be added to the MinGW runtime. > Using Windows, my test program shows that iswprint(0x20ac) is 0, with or > without setlocale(LC_CTYPE, ""). I do need setlocale to see the euro > sign in Windows-1252. Which also states that your code page must support it as well. -- Earnie -- https://sites.google.com/site/earnieboyd |
From: Manolo <man...@gm...> - 2012-09-02 00:53:50
|
>> >> "For the isw routines, the result of the test for the specified >> condition is independent of locale." >> -- MSDN: http://msdn.microsoft.com/en-us/library/4yc6feha.aspx >> > > Wrong. > > "The result of the test condition for these functions depends on the > LC_CTYPE category setting of the locale; see setlocale for more > information. The versions of these functions without the _l suffix use > the current locale for any locale-dependent behavior; the versions > with the _l suffix are identical except that they use the locale > passed in instead. For more information, see Locale." > http://msdn.microsoft.com/en-us/library/ewx8s4kw.aspx > > Caution: the _l versions are MSVC 2012 dependent and have yet to be > added to the MinGW runtime. > Well, MSDN contradictory helps. I have tested both using setlocale(LC_ALL, "esp") and without calling setlocale() at all. The result of iswprint() is *the same*, at least in XP. Can anyone test it on Vista or W8? It _does_ changes other charset-dependent functions, as wctomb(). I've used wctomb() to see how \u20ac translates into multibyte. If I use locale ("esp" uses CP 1252) it translates to 0x80, which is the code for euro sign in this locale. If I don't set any locale, it uses "C" locale and translates to 0xb0, which is used for "not translated". If I set the locale and try with \u01a1 (latin small letter o with horn) it translates to a simple 'o'. Near, but inexact. So, wctomb() seems to work as expected. But iswprint() does not. From http://pubs.opengroup.org/onlinepubs/009695399/functions/iswprint.html "The iswprint() function shall test whether wc is a wide-character code representing a character of class print in the program's current locale; see the Base Definitions volume of IEEE Std 1003.1-2001, Chapter 7,Locale." I did some tests also in Unix. Not setting any locale, makes iswxxxx() to return != 0 only for values <= 0xF7. No locale means "C" locale. So this result is good. Setting my locale (Spanish utf8), iswxxxx() work as a charm. An for 0x20ac iswgraph(), iswpunct() and iswprint() return != 0, as expected. Also good. The only thing I can conclude from all my tests is that for iswxxx() function MS is broken with the euro sign (and maybe, with some more). Regards, Manolo |