From: Andrei Z. <an...@ya...> - 2005-08-22 22:32:47
|
I'm trying to use u_strCaseCompare to perform case-insensitive comparisons of strings. One of the test cases I have does not seem to work though: u_strCaseCompare("=C3=9F", 1, "ss", 2, U_COMPARE_CODE_POINT_ORDER, &sta= tus); This keeps returning -1, although I was expecting that sharp S and 'ss' would fold to the same string.. Am I doing something wrong? - Andrei |
From: <dav...@us...> - 2005-08-23 04:03:28
|
PiBJJ20gdHJ5aW5nIHRvIHVzZSB1X3N0ckNhc2VDb21wYXJlIHRvIHBlcmZvcm0gY2FzZS1pbnNl bnNpdGl2ZQ0KPiBjb21wYXJpc29ucyBvZiBzdHJpbmdzLiBPbmUgb2YgdGhlIHRlc3QgY2FzZXMg SSBoYXZlIGRvZXMgbm90IHNlZW0gdG8NCj4gd29yayB0aG91Z2g6DQo+IA0KPiAgIHVfc3RyQ2Fz ZUNvbXBhcmUoIsODxbgiLCAxLCAic3MiLCAyLCBVX0NPTVBBUkVfQ09ERV9QT0lOVF9PUkRFUiwg DQomc3RhdHVzKTsNCg0KSW50ZXJlc3RpbmcuICBXaGF0IHZlcnNpb24gb2YgdGhlIElDVSBhcmUg eW91IHVzaW5nPyAgSSdtIHVzaW5nIDMuMiwgYW5kIEkgDQpjYW4ndCBmaW5kIGFueSB2ZXJzaW9u IG9mIHRoaXMgZnVuY3Rpb24gdGhhdCBhY2NlcHRzIHN0cmluZ3MgZW5jb2RlZCBpbiANClVURi04 Lg0KDQpUaGUgZm9sbG93aW5nIHNuaXBwZXQgb2YgY29kZSByZXR1cm5zIDAsIGFzIGV4cGVjdGVk Og0KDQovLyAweERGIGlzIExhdGluIFNtYWxsIExldHRlciBTaGFycCBTIChFc3pldHQpDQpjb25z dCBVQ2hhciAgIHN0cjFbXSA9IHsgMHhERiB9Ow0KDQovLyAweDczIGlzIExhdGluIFNtYWxsIExl dHRlciBTDQpjb25zdCBVQ2hhciAgIHN0cjJbXSA9IHsgMHg3MywgMHg3MyB9Ow0KDQpVRXJyb3JD b2RlICBzdGF0dXMgPSBVX1pFUk9fRVJST1I7DQoNCmludCBjb25zdCAgIHJlc3VsdCA9DQogICAg ICAgICAgICAgIHVfc3RyQ2FzZUNvbXBhcmUoDQogICAgICAgICAgICAgICAgICBzdHIxLA0KICAg ICAgICAgICAgICAgICAgMSwNCiAgICAgICAgICAgICAgICAgIHN0cjIsDQogICAgICAgICAgICAg ICAgICAyLA0KICAgICAgICAgICAgICAgICAgVV9DT01QQVJFX0NPREVfUE9JTlRfT1JERVIsDQog ICAgICAgICAgICAgICAgICAmc3RhdHVzKTsNCg0KRGF2ZQ0K |
From: Andrei Z. <an...@ya...> - 2005-08-23 16:32:04
|
On Mon, 22 Aug 2005, dav...@us... wrote: > Interesting. What version of the ICU are you using? I'm using 3.2, and I > can't find any version of this function that accepts strings encoded in > UTF-8. > > The following snippet of code returns 0, as expected: > > // 0xDF is Latin Small Letter Sharp S (Eszett) > const UChar str1[] = { 0xDF }; > > // 0x73 is Latin Small Letter S > const UChar str2[] = { 0x73, 0x73 }; > > UErrorCode status = U_ZERO_ERROR; > > int const result = > u_strCaseCompare( > str1, > 1, > str2, > 2, > U_COMPARE_CODE_POINT_ORDER, > &status); > You're right, it does work. I had an embarassingly silly mistake in my code. :) - Andrei |
From: Wenlin I. <we...@we...> - 2005-08-23 06:13:38
|
On Aug 22, 2005, at 15:31, Andrei Zmievski wrote: > I'm trying to use u_strCaseCompare to perform case-insensitive > comparisons of strings. One of the test cases I have does not seem to > work though: > > u_strCaseCompare("=C3=9F", 1, "ss", 2, U_COMPARE_CODE_POINT_ORDER, =20= > &status); > > This keeps returning -1, although I was expecting that sharp S and =20 > 'ss' > would fold to the same string.. Am I doing something wrong? I don't know the answer, but I'm also having trouble using UTF-8 =20 literals with ICU, as mentioned in another thread ("collation ... abc =20= > =C3=A1bc"). Apparently char strings are assumed by ICU to be non-=20 Unicode, at least sometimes. On Aug 22, 2005, at 21:00, dav...@us... wrote: ... > // 0xDF is Latin Small Letter Sharp S (Eszett) > const UChar str1[] =3D { 0xDF }; So you're using 16-bit UChar instead of 8-bit char (UTF-8), and =20 you're forced to use hexadecimal instead of a string literal. Yikes! =20 I guess for a non-BMP character like U+20000 you'd have to specify =20 the UTF-16 surrogates in hexadecimal? Or could you use UChar32 =20 instead of UChar? I hope it's not too off-topic to note that the email system had some =20 trouble with a non-ASCII character in the message about =20 u_strCaseCompare(). The character in question is U+00DF =3D =C3=9F =3D = LATIN =20 SMALL LETTER SHARP S. The first message (from Andrei) appears to include these headers: User-Agent: Mutt/1.4.1i Content-Type: text/plain; charset=3Dunknown-8bit Evidently it was UTF-8 encoded. I had to tell my email program (Apple =20= Mail) explicitly to use UTF-8 so I could read the message; otherwise =20 (with "Automatic" encoding), the =C3=9F appeared as U+00C3 =3D =C3=83 =3D = LATIN =20 CAPITAL LETTER A WITH TILDE. The second message (from Dave) appears to include these headers: X-Mailer: Lotus Notes Release 6.5.2 June 01, 2004 Content-Type: text/plain; charset=3D"UTF-8" Content-Transfer-Encoding: base64 But, in the second message, with encoding set to UTF-8, instead of =C3=9F = =20 I see two characters: U+00C3 =3D =C3=83 =3D LATIN CAPITAL LETTER A WITH = =20 TILDE, followed by U+0178 =3D =C5=B8 =3D LATIN CAPITAL LETTER Y WITH =20 DIAERESIS. I can't make the second message display correctly. This =20 snafu might be the cumulative result of two causes: the first email =20 program failed to specify charset=3D"UTF-8", even though it did =20 correctly transmit UTF-8 text; the second email program failed to =20 recognize the first message as UTF-8, and when it composed a reply, =20 it assumed the first message was non-UTF-8, though the reply was UTF-8. This becomes slightly clearer if you notice that in UTF-8, U+00DF =3D =C3=9F= =20 =3D LATIN SMALL LETTER SHARP S is encoded as two bytes: c3 9f So, it's clear that the U+00C3 =3D =C3=83 =3D LATIN CAPITAL LETTER A = WITH =20 TILDE could have resulted from misinterpreting the UTF-8 as LATIN1. =20 But how the second byte 9f turned into U+0178 =3D =C5=B8 =3D LATIN = CAPITAL =20 LETTER Y WITH DIAERESIS, who knows, maybe a "codepage"? (U+009F is a =20 control character.) Here's a tip that others on this list might find helpful. At the =20 bottom of your email signature, you can include this character: U+262F =3D =E2=98=AF =3D YIN YANG Since this character is not in most other character sets, its =20 inclusion appears to cause some email applications (in this case, =20 Apple Mail) to use Unicode -- with "charset=3DUTF-8" in the header. I =20= wish I knew a more straightforward way to force UTF-8; Apple Mail =20 defaults to "Automatic" for every message; it seems you can only =20 change the encoding for the current message, not specify a default =20 encoding for new messages. Some other dingbats besides =E2=98=AF might = work =20 just as well: =E2=98=AE=E2=98=BB=E2=98=B5=E4=B7=B8... Tom Bishop > - Andrei > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle =20 > Practices > Agile & Plan-Driven Development * Managing Projects & Teams * =20 > Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/=20 > bsce5sf > _______________________________________________ > icu-support mailing list - icu...@li... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-=20 > support > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese E-mail: we...@we... Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) =E2=98=AF |
From: George R. <gr...@us...> - 2005-08-23 15:36:32
|
UmVtZW1iZXIgSUNVIHVzZXMgVVRGLTE2IHRocm91Z2hvdXQgdGhlIHZhc3QgbWFqb3JpdHkgb2Yg aXRzIGludGVyZmFjZXMuIA0KSWYgeW91J3JlIHJlYWxseSBwYXNzaW5nIGluIGEgY2hhciAqIHN0 cmluZywgdGhhdCdzIHlvdXIgcHJvYmxlbS4gIFlvdSANCnNob3VsZCBoYXZlIHNlZW4gYSBjb21w aWxlciB3YXJuaW5nIGFib3V0IHRoaXMgaXNzdWUuICBQbGVhc2UgY29udmVydCB5b3VyIA0Kc3Ry aW5nIGZyb20gVVRGLTggdG8gVVRGLTE2LCBhbmQgdGhhdCBzaG91bGQgZml4IHlvdXIgcHJvYmxl bS4NCg0KWW91IGNhbiBjcmVhdGUgc3RyaW5nIGxpdGVyYWxzIHdpdGggYSBVbmljb2RlU3RyaW5n LCBhbmQgc3BlY2lmeSB0aGUgDQpjaGFyc2V0IHRvIGJlIFVURi04LiAgVGhhdCBtYXkgYmUgeW91 ciBlYXNpZXN0IHNvbHV0aW9uLiAgU2VlIHRoZSANClVuaWNvZGVTdHJpbmcgY29uc3RydWN0b3Jz IGZvciBkZXRhaWxzLg0KDQpHZW9yZ2UgUmhvdGVuDQpJQk0gR2xvYmFsaXphdGlvbiBDZW50ZXIg b2YgQ29tcGV0ZW5jeS9JQ1UgIFNhbiBKb3PDqSwgQ0EsIFVTQQ0KaHR0cDovL3d3dy5pY3UtcHJv amVjdC5vcmcvDQpodHRwOi8vaWN1LnNvdXJjZWZvcmdlLm5ldC8NCg0KDQoNCldlbmxpbiBJbnN0 aXR1dGUgPHdlbmxpbkB3ZW5saW4uY29tPiANClNlbnQgYnk6IGljdS1zdXBwb3J0LWFkbWluQGxp c3RzLnNvdXJjZWZvcmdlLm5ldA0KMDgvMjIvMjAwNSAxMToxMyBQTQ0KUGxlYXNlIHJlc3BvbmQg dG8NCmljdS1zdXBwb3J0DQoNCg0KVG8NCmljdS1zdXBwb3J0QGxpc3RzLnNvdXJjZWZvcmdlLm5l dA0KY2MNCg0KU3ViamVjdA0KUmU6IFtpY3Utc3VwcG9ydF0gdV9zdHJDYXNlQ29tcGFyZSgpIGlz c3VlDQoNCg0KDQoNCg0KDQoNCk9uIEF1ZyAyMiwgMjAwNSwgYXQgMTU6MzEsIEFuZHJlaSBabWll dnNraSB3cm90ZToNCg0KPiBJJ20gdHJ5aW5nIHRvIHVzZSB1X3N0ckNhc2VDb21wYXJlIHRvIHBl cmZvcm0gY2FzZS1pbnNlbnNpdGl2ZQ0KPiBjb21wYXJpc29ucyBvZiBzdHJpbmdzLiBPbmUgb2Yg dGhlIHRlc3QgY2FzZXMgSSBoYXZlIGRvZXMgbm90IHNlZW0gdG8NCj4gd29yayB0aG91Z2g6DQo+ DQo+ICAgdV9zdHJDYXNlQ29tcGFyZSgiw58iLCAxLCAic3MiLCAyLCBVX0NPTVBBUkVfQ09ERV9Q T0lOVF9PUkRFUiwgDQo+ICZzdGF0dXMpOw0KPg0KPiBUaGlzIGtlZXBzIHJldHVybmluZyAtMSwg YWx0aG91Z2ggSSB3YXMgZXhwZWN0aW5nIHRoYXQgc2hhcnAgUyBhbmQgDQo+ICdzcycNCj4gd291 bGQgZm9sZCB0byB0aGUgc2FtZSBzdHJpbmcuLiBBbSBJIGRvaW5nIHNvbWV0aGluZyB3cm9uZz8N Cg0KSSBkb24ndCBrbm93IHRoZSBhbnN3ZXIsIGJ1dCBJJ20gYWxzbyBoYXZpbmcgdHJvdWJsZSB1 c2luZyBVVEYtOCANCmxpdGVyYWxzIHdpdGggSUNVLCBhcyBtZW50aW9uZWQgaW4gYW5vdGhlciB0 aHJlYWQgKCJjb2xsYXRpb24gLi4uIGFiYyANCiA+IMOhYmMiKS4gQXBwYXJlbnRseSBjaGFyIHN0 cmluZ3MgYXJlIGFzc3VtZWQgYnkgSUNVIHRvIGJlIG5vbi0gDQpVbmljb2RlLCBhdCBsZWFzdCBz b21ldGltZXMuDQoNCk9uIEF1ZyAyMiwgMjAwNSwgYXQgMjE6MDAsIGRhdmlkX25fYmVydG9uaUB1 cy5pYm0uY29tIHdyb3RlOg0KLi4uDQo+IC8vIDB4REYgaXMgTGF0aW4gU21hbGwgTGV0dGVyIFNo YXJwIFMgKEVzemV0dCkNCj4gY29uc3QgVUNoYXIgICBzdHIxW10gPSB7IDB4REYgfTsNCg0KU28g eW91J3JlIHVzaW5nIDE2LWJpdCBVQ2hhciBpbnN0ZWFkIG9mIDgtYml0IGNoYXIgKFVURi04KSwg YW5kIA0KeW91J3JlIGZvcmNlZCB0byB1c2UgaGV4YWRlY2ltYWwgaW5zdGVhZCBvZiBhIHN0cmlu ZyBsaXRlcmFsLiBZaWtlcyEgDQpJIGd1ZXNzIGZvciBhIG5vbi1CTVAgY2hhcmFjdGVyIGxpa2Ug VSsyMDAwMCB5b3UnZCBoYXZlIHRvIHNwZWNpZnkgDQp0aGUgVVRGLTE2IHN1cnJvZ2F0ZXMgaW4g aGV4YWRlY2ltYWw/IE9yIGNvdWxkIHlvdSB1c2UgVUNoYXIzMiANCmluc3RlYWQgb2YgVUNoYXI/ DQoNCkkgaG9wZSBpdCdzIG5vdCB0b28gb2ZmLXRvcGljIHRvIG5vdGUgdGhhdCB0aGUgZW1haWwg c3lzdGVtIGhhZCBzb21lIA0KdHJvdWJsZSB3aXRoIGEgbm9uLUFTQ0lJIGNoYXJhY3RlciBpbiB0 aGUgbWVzc2FnZSBhYm91dCANCnVfc3RyQ2FzZUNvbXBhcmUoKS4gVGhlIGNoYXJhY3RlciBpbiBx dWVzdGlvbiBpcyBVKzAwREYgPSDDnyA9IExBVElOIA0KU01BTEwgTEVUVEVSIFNIQVJQIFMuDQoN ClRoZSBmaXJzdCBtZXNzYWdlIChmcm9tIEFuZHJlaSkgYXBwZWFycyB0byBpbmNsdWRlIHRoZXNl IGhlYWRlcnM6DQoNClVzZXItQWdlbnQ6IE11dHQvMS40LjFpDQpDb250ZW50LVR5cGU6IHRleHQv cGxhaW47IGNoYXJzZXQ9dW5rbm93bi04Yml0DQoNCkV2aWRlbnRseSBpdCB3YXMgVVRGLTggZW5j b2RlZC4gSSBoYWQgdG8gdGVsbCBteSBlbWFpbCBwcm9ncmFtIChBcHBsZSANCk1haWwpIGV4cGxp Y2l0bHkgdG8gdXNlIFVURi04IHNvIEkgY291bGQgcmVhZCB0aGUgbWVzc2FnZTsgb3RoZXJ3aXNl IA0KKHdpdGggIkF1dG9tYXRpYyIgZW5jb2RpbmcpLCB0aGUgw58gYXBwZWFyZWQgYXMgVSswMEMz ID0gw4MgPSBMQVRJTiANCkNBUElUQUwgTEVUVEVSIEEgV0lUSCBUSUxERS4NCg0KVGhlIHNlY29u ZCBtZXNzYWdlIChmcm9tIERhdmUpIGFwcGVhcnMgdG8gaW5jbHVkZSB0aGVzZSBoZWFkZXJzOg0K DQpYLU1haWxlcjogTG90dXMgTm90ZXMgUmVsZWFzZSA2LjUuMiBKdW5lIDAxLCAyMDA0DQpDb250 ZW50LVR5cGU6IHRleHQvcGxhaW47IGNoYXJzZXQ9IlVURi04Ig0KQ29udGVudC1UcmFuc2Zlci1F bmNvZGluZzogYmFzZTY0DQoNCkJ1dCwgaW4gdGhlIHNlY29uZCBtZXNzYWdlLCB3aXRoIGVuY29k aW5nIHNldCB0byBVVEYtOCwgaW5zdGVhZCBvZiDDnyANCkkgc2VlIHR3byBjaGFyYWN0ZXJzOiAg VSswMEMzID0gw4MgPSBMQVRJTiBDQVBJVEFMIExFVFRFUiBBIFdJVEggDQpUSUxERSwgZm9sbG93 ZWQgYnkgVSswMTc4ID0gxbggPSBMQVRJTiBDQVBJVEFMIExFVFRFUiBZIFdJVEggDQpESUFFUkVT SVMuIEkgY2FuJ3QgbWFrZSB0aGUgc2Vjb25kIG1lc3NhZ2UgZGlzcGxheSBjb3JyZWN0bHkuIFRo aXMgDQpzbmFmdSBtaWdodCBiZSB0aGUgY3VtdWxhdGl2ZSByZXN1bHQgb2YgdHdvIGNhdXNlczog dGhlIGZpcnN0IGVtYWlsIA0KcHJvZ3JhbSBmYWlsZWQgdG8gc3BlY2lmeSBjaGFyc2V0PSJVVEYt OCIsIGV2ZW4gdGhvdWdoIGl0IGRpZCANCmNvcnJlY3RseSB0cmFuc21pdCBVVEYtOCB0ZXh0OyB0 aGUgc2Vjb25kIGVtYWlsIHByb2dyYW0gZmFpbGVkIHRvIA0KcmVjb2duaXplIHRoZSBmaXJzdCBt ZXNzYWdlIGFzIFVURi04LCBhbmQgd2hlbiBpdCBjb21wb3NlZCBhIHJlcGx5LCANCml0IGFzc3Vt ZWQgdGhlIGZpcnN0IG1lc3NhZ2Ugd2FzIG5vbi1VVEYtOCwgdGhvdWdoIHRoZSByZXBseSB3YXMg VVRGLTguDQoNClRoaXMgYmVjb21lcyBzbGlnaHRseSBjbGVhcmVyIGlmIHlvdSBub3RpY2UgdGhh dCBpbiBVVEYtOCwgVSswMERGID0gw58gDQo9IExBVElOIFNNQUxMIExFVFRFUiBTSEFSUCBTIGlz IGVuY29kZWQgYXMgdHdvIGJ5dGVzOg0KDQogICAgIGMzIDlmDQoNClNvLCBpdCdzIGNsZWFyIHRo YXQgdGhlIFUrMDBDMyA9IMODID0gTEFUSU4gQ0FQSVRBTCBMRVRURVIgQSBXSVRIIA0KVElMREUg Y291bGQgaGF2ZSByZXN1bHRlZCBmcm9tIG1pc2ludGVycHJldGluZyB0aGUgVVRGLTggYXMgTEFU SU4xLiANCkJ1dCBob3cgdGhlIHNlY29uZCBieXRlIDlmIHR1cm5lZCBpbnRvIFUrMDE3OCA9IMW4 ID0gTEFUSU4gQ0FQSVRBTCANCkxFVFRFUiBZIFdJVEggRElBRVJFU0lTLCB3aG8ga25vd3MsIG1h eWJlIGEgImNvZGVwYWdlIj8gKFUrMDA5RiBpcyBhIA0KY29udHJvbCBjaGFyYWN0ZXIuKQ0KDQpI ZXJlJ3MgYSB0aXAgdGhhdCBvdGhlcnMgb24gdGhpcyBsaXN0IG1pZ2h0IGZpbmQgaGVscGZ1bC4g QXQgdGhlIA0KYm90dG9tIG9mIHlvdXIgZW1haWwgc2lnbmF0dXJlLCB5b3UgY2FuIGluY2x1ZGUg dGhpcyBjaGFyYWN0ZXI6DQoNCiAgICAgVSsyNjJGID0g4pivID0gWUlOIFlBTkcNCg0KU2luY2Ug dGhpcyBjaGFyYWN0ZXIgaXMgbm90IGluIG1vc3Qgb3RoZXIgY2hhcmFjdGVyIHNldHMsIGl0cyAN CmluY2x1c2lvbiBhcHBlYXJzIHRvIGNhdXNlIHNvbWUgZW1haWwgYXBwbGljYXRpb25zIChpbiB0 aGlzIGNhc2UsIA0KQXBwbGUgTWFpbCkgdG8gdXNlIFVuaWNvZGUgLS0gd2l0aCAiY2hhcnNldD1V VEYtOCIgaW4gdGhlIGhlYWRlci4gSSANCndpc2ggSSBrbmV3IGEgbW9yZSBzdHJhaWdodGZvcndh cmQgd2F5IHRvIGZvcmNlIFVURi04OyBBcHBsZSBNYWlsIA0KZGVmYXVsdHMgdG8gIkF1dG9tYXRp YyIgZm9yIGV2ZXJ5IG1lc3NhZ2U7IGl0IHNlZW1zIHlvdSBjYW4gb25seSANCmNoYW5nZSB0aGUg ZW5jb2RpbmcgZm9yIHRoZSBjdXJyZW50IG1lc3NhZ2UsIG5vdCBzcGVjaWZ5IGEgZGVmYXVsdCAN CmVuY29kaW5nIGZvciBuZXcgbWVzc2FnZXMuIFNvbWUgb3RoZXIgZGluZ2JhdHMgYmVzaWRlcyDi mK8gbWlnaHQgd29yayANCmp1c3QgYXMgd2VsbDog4piu4pi74pi15Le4Li4uDQoNClRvbSBCaXNo b3ANCg0KDQoNCg== |
From: Wenlin I. <we...@we...> - 2005-08-23 18:44:17
|
On Aug 23, 2005, at 08:36, George Rhoten wrote: > Remember ICU uses UTF-16 throughout the vast majority of its =20 > interfaces. > If you're really passing in a char * string, that's your problem. It's not just my problem. Someone else, who started this thread, =20 seems to have made the same mistake, and I'll bet we're not the first =20= ones. Some of us have been using UTF-8 string literals in C programs =20 since the late twentieth century, and we might naturally (naively?) =20 assume that ICU, being "for Unicode", by default would treat a char * =20= string as UTF-8, rather than Latin1, MacRoman, GB-2312, or any other =20 non-Unicode encoding. > You > should have seen a compiler warning about this issue. In my own case I was using the sample collation program icu/samples/=20 coll/coll.cpp. That sample program declares two variable as follows: char * opt_source =3D "abc"; char * opt_target =3D "abd"; These are passed to a routine u_unescape(), which evidently converts =20 the strings to UTF-16 based on the assumption that they are in some =20 non-Unicode encoding. This assumption seems counter-intuitive. I =20 still don't know whether opt_source and opt_target are assumed to be =20 Latin1, or what. So, I'm not the one who introduced "char *" into the sample code, and =20= there is no compiler warning (nor should there be one). > Please convert your > string from UTF-8 to UTF-16, and that should fix your problem. Thank you, I'll try to fix coll.cpp so it supports UTF-8. > You can create string literals with a UnicodeString, and specify the > charset to be UTF-8. That may be your easiest solution. See the > UnicodeString constructors for details. OK, but it sure would be easier if you could just use UTF-8 char * =20 strings directly. C compilers generally allow C source code to be =20 UTF-8, but they don't allow C source code to be UTF-16. The GNU C =20 compiler, for example, chokes on UTF-16 source code, but it handles =20 UTF-8 just fine. Isn't it time we should be able to use Unicode text =20 directly in our source code, without hexadecimal escaping or any =20 other bothersome contortions? Consider the Perl language. If you just =20= put this magic formula at the top of a perl script, then all text is =20 assumed to be UTF-8 unless explicitly designated otherwise: use utf8; use open ':utf8'; use open ':std'; It would be great if ICU had an analogous magic formula. Best wishes, Tom Bishop > > George Rhoten > IBM Globalization Center of Competency/ICU San Jos=C3=A9, CA, USA > http://www.icu-project.org/ > http://icu.sourceforge.net/ > > > > Wenlin Institute <we...@we...> > Sent by: icu...@li... > 08/22/2005 11:13 PM > Please respond to > icu-support > > > To > icu...@li... > cc > > Subject > Re: [icu-support] u_strCaseCompare() issue > > > > > > > > On Aug 22, 2005, at 15:31, Andrei Zmievski wrote: > > >> I'm trying to use u_strCaseCompare to perform case-insensitive >> comparisons of strings. One of the test cases I have does not seem to >> work though: >> >> u_strCaseCompare("=C3=9F", 1, "ss", 2, U_COMPARE_CODE_POINT_ORDER, >> &status); >> >> This keeps returning -1, although I was expecting that sharp S and >> 'ss' >> would fold to the same string.. Am I doing something wrong? >> > > I don't know the answer, but I'm also having trouble using UTF-8 > literals with ICU, as mentioned in another thread ("collation ... abc > >> =C3=A1bc"). Apparently char strings are assumed by ICU to be non- >> > Unicode, at least sometimes. > > On Aug 22, 2005, at 21:00, dav...@us... wrote: > ... > >> // 0xDF is Latin Small Letter Sharp S (Eszett) >> const UChar str1[] =3D { 0xDF }; >> > > So you're using 16-bit UChar instead of 8-bit char (UTF-8), and > you're forced to use hexadecimal instead of a string literal. Yikes! > I guess for a non-BMP character like U+20000 you'd have to specify > the UTF-16 surrogates in hexadecimal? Or could you use UChar32 > instead of UChar? > > I hope it's not too off-topic to note that the email system had some > trouble with a non-ASCII character in the message about > u_strCaseCompare(). The character in question is U+00DF =3D =C3=9F =3D = LATIN > SMALL LETTER SHARP S. > > The first message (from Andrei) appears to include these headers: > > User-Agent: Mutt/1.4.1i > Content-Type: text/plain; charset=3Dunknown-8bit > > Evidently it was UTF-8 encoded. I had to tell my email program (Apple > Mail) explicitly to use UTF-8 so I could read the message; otherwise > (with "Automatic" encoding), the =C3=9F appeared as U+00C3 =3D =C3=83 = =3D LATIN > CAPITAL LETTER A WITH TILDE. > > The second message (from Dave) appears to include these headers: > > X-Mailer: Lotus Notes Release 6.5.2 June 01, 2004 > Content-Type: text/plain; charset=3D"UTF-8" > Content-Transfer-Encoding: base64 > > But, in the second message, with encoding set to UTF-8, instead of =C3=9F= > I see two characters: U+00C3 =3D =C3=83 =3D LATIN CAPITAL LETTER A = WITH > TILDE, followed by U+0178 =3D =C5=B8 =3D LATIN CAPITAL LETTER Y WITH > DIAERESIS. I can't make the second message display correctly. This > snafu might be the cumulative result of two causes: the first email > program failed to specify charset=3D"UTF-8", even though it did > correctly transmit UTF-8 text; the second email program failed to > recognize the first message as UTF-8, and when it composed a reply, > it assumed the first message was non-UTF-8, though the reply was =20 > UTF-8. > > This becomes slightly clearer if you notice that in UTF-8, U+00DF =3D = =C3=9F > =3D LATIN SMALL LETTER SHARP S is encoded as two bytes: > > c3 9f > > So, it's clear that the U+00C3 =3D =C3=83 =3D LATIN CAPITAL LETTER A = WITH > TILDE could have resulted from misinterpreting the UTF-8 as LATIN1. > But how the second byte 9f turned into U+0178 =3D =C5=B8 =3D LATIN = CAPITAL > LETTER Y WITH DIAERESIS, who knows, maybe a "codepage"? (U+009F is a > control character.) > > Here's a tip that others on this list might find helpful. At the > bottom of your email signature, you can include this character: > > U+262F =3D =E2=98=AF =3D YIN YANG > > Since this character is not in most other character sets, its > inclusion appears to cause some email applications (in this case, > Apple Mail) to use Unicode -- with "charset=3DUTF-8" in the header. I > wish I knew a more straightforward way to force UTF-8; Apple Mail > defaults to "Automatic" for every message; it seems you can only > change the encoding for the current message, not specify a default > encoding for new messages. Some other dingbats besides =E2=98=AF might = work > just as well: =E2=98=AE=E2=98=BB=E2=98=B5=E4=B7=B8... > > Tom Bishop > > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle =20 > Practices > Agile & Plan-Driven Development * Managing Projects & Teams * =20 > Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/=20 > bsce5sf > _______________________________________________ > icu-support mailing list - icu...@li... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-=20 > support > > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese E-mail: we...@we... Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) =E2=98=AF |
From: Steven R. L. <sr...@ic...> - 2005-08-23 19:00:10
|
ICU assumes that char*s are in the default character set for the operating system, that is, that which would be used with fopen (opt_source, "r"); to name a file, or with puts(). the API reference for u_unescape() refers to src as "a zero- terminated string of invariant characters", utypes.h defines it in more detail, but it is basically a subset of ASCII or EBCDIC, depending. We do have some APIs that provide UTF-8 access directly. Hope this helps, Steven Steven R. Loomis * sr...@ic... http://ibm.com/software/globalization/icu On 23 Awi 2005, at 11:44, Wenlin Institute wrote: > In my own case I was using the sample collation program icu/samples/ > coll/coll.cpp. That sample program declares two variable as follows: > > char * opt_source = "abc"; > char * opt_target = "abd"; > > These are passed to a routine u_unescape(), which evidently > converts the strings to UTF-16 based on the assumption that they > are in some non-Unicode encoding. This assumption seems counter- > intuitive. I still don't know whether opt_source and opt_target are > assumed to be Latin1, or what. > > So, I'm not the one who introduced "char *" into the sample code, > and there is no compiler warning (nor should there be one). > |
From: Wenlin I. <we...@we...> - 2005-08-23 21:21:26
|
On Aug 23, 2005, at 11:59, Steven R. Loomis wrote: > ICU assumes that char*s are in the default character set for the =20 > operating system, that is, that which would be used with fopen=20 > (opt_source, "r"); to name a file, or with puts(). Interesting! For OS X, that must be UTF-8, since fopen() and fputs() =20 work fine with UTF-8 on OS X. > the API reference for u_unescape() refers to src as "a zero-=20 > terminated string of invariant characters", utypes.h defines it in =20= > more detail, but it is basically a subset of ASCII or EBCDIC, =20 > depending. http://icu.sourceforge.net/apiref/icu4c/ustring_8h.html#a47 refers to =20= "the compiler's codepage" (whatever that means)... but you're saying =20 u_unescape() is limited to a subset of ASCII or EBCDIC, so it's =20 unusable for UTF-8, regardless of the codepage, compiler, or =20 operating system. Correct? > We do have some APIs that provide UTF-8 access directly. Hope this =20= > helps, Yes, it does help, thank you very much. Really, converting UTF-8 to =20 UTF-16 isn't a problem, it just wasn't obvious that u_unescape() and =20 coll.cpp were limited to ASCII, EBCDIC, and hexadecimal input. (A =20 very strange limitation for Unicode software, but never mind.) Best wishes, Tom > Steven > > Steven R. Loomis * sr...@ic... > http://ibm.com/software/globalization/icu > > On 23 Awi 2005, at 11:44, Wenlin Institute wrote: > >> In my own case I was using the sample collation program icu/=20 >> samples/coll/coll.cpp. That sample program declares two variable =20 >> as follows: >> >> char * opt_source =3D "abc"; >> char * opt_target =3D "abd"; >> >> These are passed to a routine u_unescape(), which evidently =20 >> converts the strings to UTF-16 based on the assumption that they =20 >> are in some non-Unicode encoding. This assumption seems counter-=20 >> intuitive. I still don't know whether opt_source and opt_target =20 >> are assumed to be Latin1, or what. >> >> So, I'm not the one who introduced "char *" into the sample code, =20 >> and there is no compiler warning (nor should there be one). >> > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese E-mail: we...@we... Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) =E2=98=AF |
From: Wenlin I. <we...@we...> - 2005-08-25 17:17:26
|
> On Aug 23, 2005, at 11:59, Steven R. Loomis wrote: >> the API reference for u_unescape() refers to src as "a zero-=20 >> terminated string of invariant characters", utypes.h defines it =20 >> in more detail, but it is basically a subset of ASCII or EBCDIC, =20 >> depending. Evidently, while u_unescape() is only intended for a subset of ASCII =20 or EBCDIC, if it is passed any non-ASCII characters (on a non-EBCDIC =20 machine), it treats them as Latin1, regardless of the operating =20 system, codepage, etc. The key piece of code appears to be in the function u_charsToUChars=20 (), in uinvchar.c: *********** #if U_CHARSET_FAMILY=3D=3DU_ASCII_FAMILY u=3D(UChar)c; #elif U_CHARSET_FAMILY=3D=3DU_EBCDIC_FAMILY u=3D(UChar)asciiFromEbcdic[c]; *********** Each byte is simply zero-extended from 8 to 16 bits. In the sample code coll.cpp, there is no validity checking to make =20 sure the input contains only "invariant characters", so let the user =20 beware. Best wishes, Tom Bishop =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese E-mail: we...@we... Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) =E2=98=AF |
From: Wenlin I. <we...@we...> - 2005-08-25 18:03:21
|
Sorry, I meant u_unescape(), not u_unencode(). -Tom On Aug 25, 2005, at 10:17, Wenlin Institute wrote: > > >> On Aug 23, 2005, at 11:59, Steven R. Loomis wrote: >> > > >>> the API reference for u_unescape() refers to src as "a zero-=20 >>> terminated string of invariant characters", utypes.h defines it =20 >>> in more detail, but it is basically a subset of ASCII or EBCDIC, =20= >>> depending. >>> > > Evidently, while u_unescape() is only intended for a subset of =20 > ASCII or EBCDIC, if it is passed any non-ASCII characters (on a non-=20= > EBCDIC machine), it treats them as Latin1, regardless of the =20 > operating system, codepage, etc. > > The key piece of code appears to be in the function u_charsToUChars=20 > (), in uinvchar.c: > > *********** > #if U_CHARSET_FAMILY=3D=3DU_ASCII_FAMILY > u=3D(UChar)c; > #elif U_CHARSET_FAMILY=3D=3DU_EBCDIC_FAMILY > u=3D(UChar)asciiFromEbcdic[c]; > *********** > > Each byte is simply zero-extended from 8 to 16 bits. > > In the sample code coll.cpp, there is no validity checking to make =20 > sure the input contains only "invariant characters", so let the =20 > user beware. > > Best wishes, > > Tom Bishop > > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese > E-mail: we...@we... Web: http://www.wenlin.com > Telephone: 1-877-4-WENLIN (1-877-493-6546) > =E2=98=AF > > > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle =20 > Practices > Agile & Plan-Driven Development * Managing Projects & Teams * =20 > Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/=20 > bsce5sf > _______________________________________________ > icu-support mailing list - icu...@li... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-=20 > support > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese E-mail: we...@we... Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) =E2=98=AF |
From: Wenlin I. <we...@we...> - 2005-08-25 19:08:47
|
Below is a version of samples/coll/coll.cpp, revised to support UTF-8 input rather than only ASCII/EBCDIC/hexadecimal. It might not support UTF-8 on EBCDIC machines, but I've tried to make it work the same as the original version if only "invariant characters" are specified as input. The original version treated non-ASCII/EBCDIC input as Latin1, due to possibly unintentional behavior of u_unescape(). This revised version still supports Latin1 instead of UTF-8 if the -u_unencode option is specified, but this support is not guaranteed for future versions of ICU. The new "-verbose" option might be useful to anyone studying ICU sort keys, regardless of whether UTF-8 is used. Best wishes, Tom Bishop /******************************************************************** * COPYRIGHT: * Copyright (C) 2002-2003 IBM, Inc. All Rights Reserved. * ********************************************************************/ /** * This program demos ICU string collation. */ /* Changes 2005.08.25 by tb...@we...: * Added support for UTF-8 source and target, instead of only ASCII/EBCDIC/hexadecimal. * Added the "-verbose" option to display UTF-16 and sort keys. * Added the "-u_unescape" option to force u_unescape() instead of UTF-8 conversion. * Added the "-utf8" option for UTF-8 conversion; needed only on EBCDIC machines. * Changed name of a local function from strcmp() to ICU_CompareStrings(). * Removed warning U_USING_FALLBACK_ERROR for en_US. * Fixed compiler warning about pOpt. * Corrected a few typos and added a few comments. */ const char gHelpString[] = "usage: coll [options*] -source source_string -target target_string\n" "-help Display this message.\n" "-locale name ICU locale to use. Default is en_US\n" #if 1 // 2005.08.25 deleted "file" "-rules rule Collation rules (overrides locale)\n" #else "-rules rule Collation rules file (overrides locale)\n" #endif "-french French accent ordering\n" "-norm Normalizing mode on\n" "-shifted Shifted mode\n" "-lower Lower case first\n" "-upper Upper case first\n" "-case Enable separate case level\n" "-level n Sort level, 1 to 5, for Primary, Secondary, Tertiary, Quaternary, Identical\n" "-source string Source string for comparison\n" "-target string Target string for comparison\n" "-verbose Display UTF-16 and sort keys\n" // 2005.08.25 added "-u_unescape Force u_unescape() instead of UTF-8 conversion \n" // 2005.08.25 added "-utf8 Treat input as UTF-8 (needed only on EBCDIC machines?)\n" // 2005.08.25 added "Example coll -rules \\u0026b\\u003ca -source a -target b\n" /* Note: "\\u0026b\\u003ca" means "&b<a". */ "The format \\uXXXX is supported for the rules and comparison strings.\n" "UTF-8 is also supported for comparison strings (but not mixed UTF-8 and \\uXXXX).\n" // 2005.08.25 ; #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unicode/utypes.h> #include <unicode/ucol.h> #include <unicode/ustring.h> #include <unicode/ucnv.h> /* C Converter API -- 2005.08.25 added for UConverter */ /** * Command line option variables * These global variables are set according to the options specified * on the command line by the user. */ char * opt_locale = "en_US"; char * opt_rules = 0; UBool opt_help = FALSE; UBool opt_norm = FALSE; UBool opt_french = FALSE; UBool opt_shifted = FALSE; UBool opt_lower = FALSE; UBool opt_upper = FALSE; UBool opt_case = FALSE; UBool opt_verbose = FALSE; // 2005.08.25 UBool opt_u_unescape = FALSE; // 2005.08.25 UBool opt_utf8 = FALSE; // 2005.08.25 int opt_level = 0; char * opt_source = "abc"; char * opt_target = "abd"; UCollator * collator = 0; /** * Definitions for the command line options */ struct OptSpec { const char *name; enum {FLAG, NUM, STRING} type; void *pVar; }; OptSpec opts[] = { {"-locale", OptSpec::STRING, &opt_locale}, {"-rules", OptSpec::STRING, &opt_rules}, {"-source", OptSpec::STRING, &opt_source}, {"-target", OptSpec::STRING, &opt_target}, {"-norm", OptSpec::FLAG, &opt_norm}, {"-french", OptSpec::FLAG, &opt_french}, {"-shifted", OptSpec::FLAG, &opt_shifted}, {"-lower", OptSpec::FLAG, &opt_lower}, {"-upper", OptSpec::FLAG, &opt_upper}, {"-case", OptSpec::FLAG, &opt_case}, {"-level", OptSpec::NUM, &opt_level}, {"-help", OptSpec::FLAG, &opt_help}, {"-?", OptSpec::FLAG, &opt_help}, {"-verbose", OptSpec::FLAG, &opt_verbose}, // 2005.08.25 {"-u_unescape", OptSpec::FLAG, &opt_u_unescape}, // 2005.08.25 {"-utf8", OptSpec::FLAG, &opt_utf8}, // 2005.08.25 {0, OptSpec::FLAG, 0} }; /* Function prototypes: */ static UBool processOptions(int argc, const char **argv, OptSpec opts []); static UBool processCollator(void); static int ICU_CompareStrings(void); static uint32_t UTF8_To_UTF16(char *utf8, UChar *utf16, uint32_t utf16size); static int AllASCII(char *s); static void HexDumpUCharString(UChar *s); static void HexDumpByteArray(uint8_t *key, size_t len); /** * processOptions() Function to read the command line options. */ static UBool processOptions(int argc, const char **argv, OptSpec opts[]) { for (int argNum = 1; argNum < argc; argNum ++) { const char *pArgName = argv[argNum]; #if 1 // avoid warning OptSpec *pOpt; for (pOpt = opts; pOpt->name != 0; pOpt++) { #else for (OptSpec *pOpt = opts; pOpt->name != 0; pOpt ++) { #endif if (strcmp(pOpt->name, pArgName) == 0) { switch (pOpt->type) { case OptSpec::FLAG: *(UBool *)(pOpt->pVar) = TRUE; break; case OptSpec::STRING: argNum ++; if (argNum >= argc) { fprintf(stderr, "value expected for \"%s\" option.\n", pOpt->name); return FALSE; } *(const char **)(pOpt->pVar) = argv[argNum]; break; case OptSpec::NUM: argNum ++; if (argNum >= argc) { fprintf(stderr, "value expected for \"%s\" option.\n", pOpt->name); return FALSE; } char *endp; int i = strtol(argv[argNum], &endp, 0); if (endp == argv[argNum]) { fprintf(stderr, "integer value expected for \"%s\" option.\n", pOpt->name); return FALSE; } *(int *)(pOpt->pVar) = i; } break; } } if (pOpt->name == 0) { fprintf(stderr, "Unrecognized option \"%s\"\n", pArgName); return FALSE; } } return TRUE; } /** * ICU string comparison */ static int ICU_CompareStrings(void) // 2005.08.25 renamed, was strcmp { #if 1 // 2005.08.25 #define ICU_BUFFER_SIZE 1000 UChar source[ICU_BUFFER_SIZE]; UChar target[ICU_BUFFER_SIZE]; /* If both strings are ASCII, or -u_unescape option is used, use u_unescape(); otherwise assume UTF-8. But that doesn't work for EBCDIC, so don't assume UTF-8 on an EBCDIC machine unless -utf8 option is used. */ #if U_CHARSET_FAMILY==U_EBCDIC_FAMILY if (opt_u_unescape || ! opt_utf8) { #else if (opt_u_unescape || (opt_utf8 == FALSE && AllASCII(opt_source) && AllASCII(opt_target))) { #endif if (opt_verbose) { printf("Using u_unescape().\n"); } u_unescape(opt_source, source, ICU_BUFFER_SIZE); u_unescape(opt_target, target, ICU_BUFFER_SIZE); } else { if (opt_verbose) { printf("Using UTF8_To_UTF16().\n"); } UTF8_To_UTF16(opt_source, source, ICU_BUFFER_SIZE); UTF8_To_UTF16(opt_target, target, ICU_BUFFER_SIZE); } if (opt_verbose) { printf("source bytes before conversion: "); HexDumpByteArray((uint8_t*) opt_source, strlen(opt_source)); printf("target bytes before conversion: "); HexDumpByteArray((uint8_t*) opt_target, strlen(opt_target)); printf("UTF-16 source after conversion: "); HexDumpUCharString(source); printf("UTF-16 target after conversion: "); HexDumpUCharString(target); uint8_t sourceKey[1024], targetKey[1024]; size_t sourceKeyLen, targetKeyLen; sourceKeyLen = ucol_getSortKey(collator, source, u_strlen (source), sourceKey, sizeof(sourceKey)); targetKeyLen = ucol_getSortKey(collator, target, u_strlen (target), targetKey, sizeof(targetKey)); printf("source key: "); HexDumpByteArray(sourceKey, sourceKeyLen); printf("target key: "); HexDumpByteArray(targetKey, targetKeyLen); } #else UChar source[100]; UChar target[100]; u_unescape(opt_source, source, 100); u_unescape(opt_target, target, 100); #endif UCollationResult result = ucol_strcoll(collator, source, -1, target, -1); if (result == UCOL_LESS) { return -1; } else if (result == UCOL_GREATER) { return 1; } return 0; } // ICU_CompareStrings /** * Creates a collator */ static UBool processCollator() { // Set up an ICU collator UErrorCode status = U_ZERO_ERROR; UChar rules[100]; if (opt_rules != 0) { u_unescape(opt_rules, rules, 100); collator = ucol_openRules(rules, -1, UCOL_OFF, UCOL_TERTIARY, NULL, &status); } else { collator = ucol_open(opt_locale, &status); } if (U_FAILURE(status)) { fprintf(stderr, "Collator creation failed.: %d\n", status); return FALSE; } if (status == U_USING_DEFAULT_WARNING) { fprintf(stderr, "Warning, U_USING_DEFAULT_WARNING for %s\n", opt_locale); } // 2005.08.25 don't issue pointless warning if U_USING_FALLBACK_ERROR for en_US. // (Maybe shouldn't issue it for other locales either?) if (status == U_USING_FALLBACK_WARNING && strcmp(opt_locale, "en_US") != 0) { fprintf(stderr, "Warning, U_USING_FALLBACK_ERROR for %s\n", opt_locale); } if (opt_norm) { ucol_setAttribute(collator, UCOL_NORMALIZATION_MODE, UCOL_ON, &status); } if (opt_french) { ucol_setAttribute(collator, UCOL_FRENCH_COLLATION, UCOL_ON, &status); } if (opt_lower) { ucol_setAttribute(collator, UCOL_CASE_FIRST, UCOL_LOWER_FIRST, &status); } if (opt_upper) { ucol_setAttribute(collator, UCOL_CASE_FIRST, UCOL_UPPER_FIRST, &status); } if (opt_case) { ucol_setAttribute(collator, UCOL_CASE_LEVEL, UCOL_ON, &status); } if (opt_shifted) { ucol_setAttribute(collator, UCOL_ALTERNATE_HANDLING, UCOL_SHIFTED, &status); } if (opt_level != 0) { switch (opt_level) { case 1: ucol_setAttribute(collator, UCOL_STRENGTH, UCOL_PRIMARY, &status); break; case 2: ucol_setAttribute(collator, UCOL_STRENGTH, UCOL_SECONDARY, &status); break; case 3: ucol_setAttribute(collator, UCOL_STRENGTH, UCOL_TERTIARY, &status); break; case 4: ucol_setAttribute(collator, UCOL_STRENGTH, UCOL_QUATERNARY, &status); break; case 5: ucol_setAttribute(collator, UCOL_STRENGTH, UCOL_IDENTICAL, &status); break; default: fprintf(stderr, "-level param must be between 1 and 5\n"); return FALSE; } } if (U_FAILURE(status)) { fprintf(stderr, "Collator attribute setting failed.: %d\n", status); return FALSE; } return TRUE; } // processCollator static uint32_t UTF8_To_UTF16(char *utf8, UChar *utf16, uint32_t utf16size) /* Convert a UTF-8 string to UTF-16. */ { UConverter *conv = NULL; UErrorCode status = U_ZERO_ERROR; uint32_t len; conv = ucnv_open("utf-8", &status); if (U_FAILURE(status)) { fprintf(stderr, "Error, status = %d for ucnv_open\n", (int) status); exit(1); } len = ucnv_toUChars(conv, utf16, utf16size, utf8, strlen(utf8), &status); if (U_FAILURE(status)) { fprintf(stderr, "Error, status = %d for ucnv_toUChars\n", (int) status); exit(1); } ucnv_close(conv); return len; } // UTF8_To_UTF16 static int AllASCII(char *s) /* Return TRUE if all the characters in s are ASCII. */ { while (*s) { if (*s++ & 0x80) { return FALSE; } } return TRUE; } // AllASCII static void HexDumpUCharString(UChar *s) { while (*s) { printf("%04x ", (int) *s++); } printf("\n"); } // HexDumpUCharString static void HexDumpByteArray(uint8_t *key, size_t len) { size_t i; for (i = 0; i < len; i++) { printf("%02x ", (int) key[i]); } printf("\n"); } // HexDumpByteArray /** * Main -- process command line, read in and pre-process the input, * call other functions to do the actual tests. */ int main(int argc, const char** argv) { if (processOptions(argc, argv, opts) != TRUE || opt_help) { printf(gHelpString); return -1; } if (processCollator() != TRUE) { fprintf(stderr, "Error creating collator for comparison\n"); return -1; } fprintf(stdout, "Comparing source=%s and target=%s\n", opt_source, opt_target); int result = ICU_CompareStrings(); if (result == 0) { fprintf(stdout, "source is equal to target\n"); } else if (result < 0) { fprintf(stdout, "source is less than target\n"); } else { fprintf(stdout, "source is greater than target\n"); } ucol_close(collator); return 0; } // main |
From: Andrei Z. <an...@ya...> - 2005-08-23 19:41:55
|
On Tue, 23 Aug 2005, Wenlin Institute wrote: > It's not just my problem. Someone else, who started this thread, > seems to have made the same mistake, and I'll bet we're not the first > ones. No, my mistake was in my example, not in the real code, as far as UTF-8 vs. UTF-16 is concerned. I was simply trying to show visually what characters were being compared. In my app, I always pass in UTF-16 arguments. - Andrei |
From: Andy H. <and...@gm...> - 2005-08-25 04:04:23
|
Expanding a bit on the previous responses in this thread On 8/23/05, Wenlin Institute <we...@we...> wrote: >=20 >=20 > [...] it sure would be easier if you could just use UTF-8 char * > strings directly. C compilers generally allow C source code to be > UTF-8, but they don't allow C source code to be UTF-16.=20 Well, no, C compilers don't, in any portable way, generate UTF-8 encoded=20 char * string literals. It works in some environments, and not in others. I= f=20 you really care about the portability of your code, the only thing that you= =20 can safely put in C string literals are the annoyingly restricted "invarian= t=20 characters", a set even smaller than 7 bit ASCII. For better or worse, ICU is a UTF-16 based library. The ICU APIs taking in= =20 Unicode data almost all want UTF-16, and computations are done directly in= =20 UTF-16. You are not alone in having UTF-8 encoded data and wanting to operate on it= =20 directly. Others, especially from the Linux world, have UTF-32 format=20 wchar_t string data, and would really like to be able to work directly with= =20 that. We are looking at ways to make both of these easier than they are now= ,=20 but ICU is and will remain a UTF-16 based library. See=20 http://icu.sourceforge.net/userguide/utext.html if you're curious about=20 what's being planned for support of non-ICU-native data formats. And then there is (char *). My personal opinion is that ICU would have been= =20 better off with fewer (char *) function parameters than it has. Many of=20 these are for things like locale names or charset names - items that can be= =20 coded safely with C string literals. But then, when you have a 100% UTF-16= =20 Unicode app, the values need to be converted back to (char *) before callin= g=20 some functions. Annoying. But not about to change. Bottom Line - there are too many possible data storage formats for ICU to= =20 provide complete families of functions that directly operate on all=20 plausible and useful formats. That's the breaks. Live with it. -- Andy Heninger |
From: Wenlin I. <we...@we...> - 2005-08-25 17:55:06
|
On Aug 24, 2005, at 21:04, Andy Heninger wrote: > > Expanding a bit on the previous responses in this thread > > On 8/23/05, Wenlin Institute <we...@we...> wrote: > [...] it sure would be easier if you could just use UTF-8 char * > strings directly. C compilers generally allow C source code to be > UTF-8, but they don't allow C source code to be UTF-16. > > Well, no, C compilers don't, in any portable way, generate UTF-8 =20 > encoded char * string literals. Right, well, we don't generally need the compiler to *generate* =20 UTF-8, only to leave it unmolested, and that appears to be the case =20 with all the C compilers I've used on MS-Windows, Macintosh, and Linux. > It works in some environments, and not in others. Would the main exception be EBCDIC machines? > If you really care about the portability of your code, the only =20 > thing that you can safely put in C string literals are the =20 > annoyingly restricted "invariant characters", a set even smaller =20 > than 7 bit ASCII. That may be the unfortunate truth today, but maybe everyone on this =20 list would agree that UTF-8 should eventually be supported in C =20 source code on all platforms. For example, a programmer involved with =20= French should be able to write this -- char fr[] =3D "Fran=C3=A7ais"; -- rather than this -- char fr[] =3D "Fran\\u00e7ais"; -- or this -- UChar fr[] =3D {'F', 'r', 'a', 'n', 0xe7, 'a', 'i', 's'}; > For better or worse, ICU is a UTF-16 based library. The ICU APIs =20 > taking in Unicode data almost all want UTF-16, and computations are =20= > done directly in UTF-16. OK, that's only superficially inconvenient, since converting UTF-8 to =20= UTF-16 isn't difficult. Thank you very much for your insightful =20 comments. I'll also look into UText. Best wishes, Tom > You are not alone in having UTF-8 encoded data and wanting to =20 > operate on it directly. Others, especially from the Linux world, =20 > have UTF-32 format wchar_t string data, and would really like to be =20= > able to work directly with that. We are looking at ways to make =20 > both of these easier than they are now, but ICU is and will remain =20 > a UTF-16 based library. See http://icu.sourceforge.net/userguide/=20 > utext.html if you're curious about what's being planned for support =20= > of non-ICU-native data formats. > > And then there is (char *). My personal opinion is that ICU would =20 > have been better off with fewer (char *) function parameters than =20 > it has. Many of these are for things like locale names or charset =20 > names - items that can be coded safely with C string literals. But =20= > then, when you have a 100% UTF-16 Unicode app, the values need to =20 > be converted back to (char *) before calling some functions. =20 > Annoying. But not about to change. > > Bottom Line - there are too many possible data storage formats for =20 > ICU to provide complete families of functions that directly operate =20= > on all plausible and useful formats. That's the breaks. Live with =20= > it. > > -- Andy Heninger > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese E-mail: we...@we... Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) =E2=98=AF |
From: Eike R. <er...@su...> - 2005-08-26 13:22:45
|
Hi Wenlin, On Thu, Aug 25, 2005 at 10:55:02 -0700, Wenlin Institute wrote: > Right, well, we don't generally need the compiler to *generate* > UTF-8, only to leave it unmolested, and that appears to be the case > with all the C compilers I've used on MS-Windows, Macintosh, and Linux. It isn't if, for example, you use a Japanese MS compiler and have no #pragma setlocale("C"). It bails out if an 8-bit character sequence doesn't form a 2 byte Kanji character it expects, e.g. at the string literal's end if the last character before the double quote is 8-bit it combines the 8-bit plus the quote character and then sees no string end anymore.. the same with an 8-bit character at a comment line's end, it expects a second byte, which is missing. I further assume that even if it compiled it would simply render your utf-8 string useless because it converted the 8-bit sequences to double byte characters. This is just an example. Unescaped non-7-bit in C/C++ sources is a nono if thinking international. And regarding string literals the already mentioned invariant characters are even more a restriction. Yes, it would be nice if utf-8 was accepted everywhere, but it isn't, we've to live with that. Eike -- OOo/SO Calc core developer. Number formatter bedevilled I18N transpositionizer. GnuPG key 0x293C05FD: 997A 4C60 CE41 0149 0DB3 9E96 2F1A D073 293C 05FD |
From: Wenlin I. <we...@we...> - 2005-08-26 16:17:29
|
Hi Eike, People on this list keep saying we have to "live with" the absence of =20= Unicode compatibility. The question is, how much longer do we have to =20= live with it? One year? Ten years? There's no need to be fatalistic. =20 Many of the people writing tomorrow's operating systems and compilers =20= are reading this list today, and the situation is really not so bad =20 even with today's software, if used carefully. You pointed out the possibility of adding #pragma setlocale("C") to a =20= C program. That's not a high price to pay if it enables you to use =20 UTF-8 instead of JIS in your source code, is it? Look at what Larry Wall and his team have already accomplished with =20 Perl. By adding one line to the top of a perl script, you can make it =20= 100% Unicode-based. C is no more complex than Perl, but maybe there =20 is less solidarity between the implementers of C. I wonder if Brian =20 Kernighan and Dennis Ritchie would be willing to rally all the C =20 people around the Unicode flag (or a Unicode #pragma). How come when =20 I do a Google search for "Unicode pragma", most of the results are =20 about Perl or Python? Cheers, Tom On Aug 26, 2005, at 06:22, Eike Rathke wrote: > Hi Wenlin, > > On Thu, Aug 25, 2005 at 10:55:02 -0700, Wenlin Institute wrote: > > >> Right, well, we don't generally need the compiler to *generate* >> UTF-8, only to leave it unmolested, and that appears to be the case >> with all the C compilers I've used on MS-Windows, Macintosh, and =20 >> Linux. >> > > It isn't if, for example, you use a Japanese MS compiler and have no > #pragma setlocale("C"). It bails out if an 8-bit character sequence > doesn't form a 2 byte Kanji character it expects, e.g. at the string > literal's end if the last character before the double quote is 8-=20 > bit it > combines the 8-bit plus the quote character and then sees no string =20= > end > anymore.. the same with an 8-bit character at a comment line's end, it > expects a second byte, which is missing. I further assume that even if > it compiled it would simply render your utf-8 string useless =20 > because it > converted the 8-bit sequences to double byte characters. This is =20 > just an > example. Unescaped non-7-bit in C/C++ sources is a nono if thinking > international. And regarding string literals the already mentioned > invariant characters are even more a restriction. Yes, it would be =20 > nice > if utf-8 was accepted everywhere, but it isn't, we've to live with =20 > that. > > Eike > > --=20 > OOo/SO Calc core developer. Number formatter bedevilled I18N =20 > transpositionizer. > GnuPG key 0x293C05FD: 997A 4C60 CE41 0149 0DB3 9E96 2F1A D073 =20 > 293C 05FD > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle =20 > Practices > Agile & Plan-Driven Development * Managing Projects & Teams * =20 > Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/=20 > bsce5sf > _______________________________________________ > icu-support mailing list - icu...@li... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-=20 > support > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese E-mail: we...@we... Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) =E2=98=AF |
From: Wenlin I. <we...@we...> - 2005-08-26 18:07:59
|
Of course, if #pragma setlocale("C") has unwanted side-effects, =20 unrelated to character encoding, then using it might well be too high =20= a price to pay. What's needed is a pragma that simply identifies a C =20 source file as UTF-8. Has anybody tried to standardize something like #pragma utf8 ? -Tom On Aug 26, 2005, at 09:16, Wenlin Institute wrote: > Hi Eike, > > People on this list keep saying we have to "live with" the absence =20 > of Unicode compatibility. The question is, how much longer do we =20 > have to live with it? One year? Ten years? There's no need to be =20 > fatalistic. Many of the people writing tomorrow's operating systems =20= > and compilers are reading this list today, and the situation is =20 > really not so bad even with today's software, if used carefully. > > You pointed out the possibility of adding #pragma setlocale("C") to =20= > a C program. That's not a high price to pay if it enables you to =20 > use UTF-8 instead of JIS in your source code, is it? > > Look at what Larry Wall and his team have already accomplished with =20= > Perl. By adding one line to the top of a perl script, you can make =20 > it 100% Unicode-based. C is no more complex than Perl, but maybe =20 > there is less solidarity between the implementers of C. I wonder if =20= > Brian Kernighan and Dennis Ritchie would be willing to rally all =20 > the C people around the Unicode flag (or a Unicode #pragma). How =20 > come when I do a Google search for "Unicode pragma", most of the =20 > results are about Perl or Python? > > Cheers, > > Tom > > On Aug 26, 2005, at 06:22, Eike Rathke wrote: > > >> Hi Wenlin, >> >> On Thu, Aug 25, 2005 at 10:55:02 -0700, Wenlin Institute wrote: >> >> >> >>> Right, well, we don't generally need the compiler to *generate* >>> UTF-8, only to leave it unmolested, and that appears to be the case >>> with all the C compilers I've used on MS-Windows, Macintosh, and =20 >>> Linux. >>> >>> >> >> It isn't if, for example, you use a Japanese MS compiler and have no >> #pragma setlocale("C"). It bails out if an 8-bit character sequence >> doesn't form a 2 byte Kanji character it expects, e.g. at the string >> literal's end if the last character before the double quote is 8-=20 >> bit it >> combines the 8-bit plus the quote character and then sees no =20 >> string end >> anymore.. the same with an 8-bit character at a comment line's =20 >> end, it >> expects a second byte, which is missing. I further assume that =20 >> even if >> it compiled it would simply render your utf-8 string useless =20 >> because it >> converted the 8-bit sequences to double byte characters. This is =20 >> just an >> example. Unescaped non-7-bit in C/C++ sources is a nono if thinking >> international. And regarding string literals the already mentioned >> invariant characters are even more a restriction. Yes, it would be =20= >> nice >> if utf-8 was accepted everywhere, but it isn't, we've to live with =20= >> that. >> >> Eike >> >> --=20 >> OOo/SO Calc core developer. Number formatter bedevilled I18N =20 >> transpositionizer. >> GnuPG key 0x293C05FD: 997A 4C60 CE41 0149 0DB3 9E96 2F1A D073 =20 >> 293C 05FD >> >> >> ------------------------------------------------------- >> SF.Net email is Sponsored by the Better Software Conference & EXPO >> September 19-22, 2005 * San Francisco, CA * Development Lifecycle =20 >> Practices >> Agile & Plan-Driven Development * Managing Projects & Teams * =20 >> Testing & QA >> Security * Process Improvement & Measurement * http://www.sqe.com/=20 >> bsce5sf >> _______________________________________________ >> icu-support mailing list - icu...@li... >> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-=20 >> support >> >> > > > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese > E-mail: we...@we... Web: http://www.wenlin.com > Telephone: 1-877-4-WENLIN (1-877-493-6546) > =E2=98=AF > > > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO > September 19-22, 2005 * San Francisco, CA * Development Lifecycle =20 > Practices > Agile & Plan-Driven Development * Managing Projects & Teams * =20 > Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/=20 > bsce5sf > _______________________________________________ > icu-support mailing list - icu...@li... > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-=20 > support > =E6=96=87=E6=9E=97 Wenlin Institute, Inc. Software for Learning = Chinese E-mail: we...@we... Web: http://www.wenlin.com Telephone: 1-877-4-WENLIN (1-877-493-6546) =E2=98=AF |
From: Andy H. <and...@gm...> - 2005-08-26 20:34:36
|
On 8/26/05, Wenlin Institute <we...@we...> wrote: > Has anybody tried to standardize something like >=20 > #pragma utf8=20 You can see what has been standardized for C already here http://www.open-std.org/jtc1/sc22/wg14/www/docs/n843.htm One of the comp.lang.c* newsgroups would probably be a better place to find out what's really going on in this area. Changes to the C and C++ standards happen very slowly, and, even after something is approved, your still looking at a decade or so before the installed base compilers has largely moved over. I don't like it either - there's stuff that's been in the language standards for years that I would love to use, but can't. -- Andy |
From: <dav...@us...> - 2005-08-23 18:52:51
|
PiBJIGRvbid0IGtub3cgdGhlIGFuc3dlciwgYnV0IEknbSBhbHNvIGhhdmluZyB0cm91YmxlIHVz aW5nIFVURi04IA0KPiBsaXRlcmFscyB3aXRoIElDVSwgYXMgbWVudGlvbmVkIGluIGFub3RoZXIg dGhyZWFkICgiY29sbGF0aW9uIC4uLiBhYmMgDQo+ICA+IMOhYmMiKS4gQXBwYXJlbnRseSBjaGFy IHN0cmluZ3MgYXJlIGFzc3VtZWQgYnkgSUNVIHRvIGJlIG5vbi0gDQo+IFVuaWNvZGUsIGF0IGxl YXN0IHNvbWV0aW1lcy4NCg0KV2VsbCwgaXQncyBoYXJkIHRvIHRlbGwgaW4gYW4gZW1haWwgY2xp ZW50LCB3aGV0aGVyIHlvdXIgc3RyaW5ncyBhcmUgDQphY3R1YWxseSBlbmNvZGVkIGluIFVURi04 LiAgSWYgeW91J3JlIHVzaW5nIHRoZSBjbGFzcyBVbmljb2RlU3RyaW5nLCB5b3UgDQpuZWVkIHRv IHJlYWQgdGhlIGRvY3VtZW50YXRpb24gZm9yIHRoZSBjb25zdHJ1Y3RvcnMgdGhhdCBhY2NlcHQg Y29uc3QgDQpjaGFyKiAgYW5kIGFsbG93IHlvdSB0byBzcGVjaWZ5IGEgc3RyaW5nIHRoYXQgaW5k aWNhdGVzIHRoZSBlbmNvZGluZyBvZiANCnRoZSBkYXRhLg0KDQo+IFNvIHlvdSdyZSB1c2luZyAx Ni1iaXQgVUNoYXIgaW5zdGVhZCBvZiA4LWJpdCBjaGFyIChVVEYtOCksIGFuZCANCj4geW91J3Jl IGZvcmNlZCB0byB1c2UgaGV4YWRlY2ltYWwgaW5zdGVhZCBvZiBhIHN0cmluZyBsaXRlcmFsLiBZ aWtlcyEgDQo+IEkgZ3Vlc3MgZm9yIGEgbm9uLUJNUCBjaGFyYWN0ZXIgbGlrZSBVKzIwMDAwIHlv dSdkIGhhdmUgdG8gc3BlY2lmeSANCj4gdGhlIFVURi0xNiBzdXJyb2dhdGVzIGluIGhleGFkZWNp bWFsPyBPciBjb3VsZCB5b3UgdXNlIFVDaGFyMzIgDQo+IGluc3RlYWQgb2YgVUNoYXI/DQoNCkkn bSAiZm9yY2VkIiBpbiB0aGUgc2Vuc2UgdGhhdCB0aGVyZSBpcyBubyBwb3J0YWJsZSB3YXkgaW4g Qy9DKysgdG8gDQpzcGVjaWZ5IHN0cmluZyBsaXRlcmFscyBlbmNvZGVkIGluIFVURi04LCBzaG9y dCBvZiB1c2luZyBvY3RhbCwgDQpoZXhhZGVjaW1hbCwgb3IgXHV4eHh4IGVzY2FwZSBzZXF1ZW5j ZXMuICBTaW5jZSB0aGUgSUNVIG9wZXJhdGVzIA0KaW50ZXJuYWxseSBpbiBVVEYtMTYsIGl0J3Mg ZmFyIGJldHRlciB0byB3b3JrIHdpdGggc3RyaW5nIGxpdGVyYWxzIGVuY29kZWQgDQppbiBVVEYt MTYuICBFbmNvZGluZyBhIFVuaWNvZGUgY2hhcmFjdGVyIG91dHNpZGUgb2YgdGhlIEJNUCBkb2Vz IHJlcXVpcmUgDQp1c2luZyBzdXJyb2dhdGUgcGFpcnMuDQoNCj4gQnV0LCBpbiB0aGUgc2Vjb25k IG1lc3NhZ2UsIHdpdGggZW5jb2Rpbmcgc2V0IHRvIFVURi04LCBpbnN0ZWFkIG9mIMOfIA0KPiBJ IHNlZSB0d28gY2hhcmFjdGVyczogIFUrMDBDMyA9IMODID0gTEFUSU4gQ0FQSVRBTCBMRVRURVIg QSBXSVRIIA0KPiBUSUxERSwgZm9sbG93ZWQgYnkgVSswMTc4ID0gxbggPSBMQVRJTiBDQVBJVEFM IExFVFRFUiBZIFdJVEggDQo+IERJQUVSRVNJUy4gSSBjYW4ndCBtYWtlIHRoZSBzZWNvbmQgbWVz c2FnZSBkaXNwbGF5IGNvcnJlY3RseS4gVGhpcyANCj4gc25hZnUgbWlnaHQgYmUgdGhlIGN1bXVs YXRpdmUgcmVzdWx0IG9mIHR3byBjYXVzZXM6IHRoZSBmaXJzdCBlbWFpbCANCj4gcHJvZ3JhbSBm YWlsZWQgdG8gc3BlY2lmeSBjaGFyc2V0PSJVVEYtOCIsIGV2ZW4gdGhvdWdoIGl0IGRpZCANCj4g Y29ycmVjdGx5IHRyYW5zbWl0IFVURi04IHRleHQ7IHRoZSBzZWNvbmQgZW1haWwgcHJvZ3JhbSBm YWlsZWQgdG8gDQo+IHJlY29nbml6ZSB0aGUgZmlyc3QgbWVzc2FnZSBhcyBVVEYtOCwgYW5kIHdo ZW4gaXQgY29tcG9zZWQgYSByZXBseSwgDQo+IGl0IGFzc3VtZWQgdGhlIGZpcnN0IG1lc3NhZ2Ug d2FzIG5vbi1VVEYtOCwgdGhvdWdoIHRoZSByZXBseSB3YXMgVVRGLTguDQoNClllcywgdGhpcyB3 YXMgdGhlIHJlc3VsdCBvZiBzZXJpYWwgbWFuZ2xpbmcgb2YgdGhlIGVtYWlsIG1lc3NhZ2UuICBH aXZlbiANCnRoZSBzYWQgc3RhdGUgb2YgdmFyaW91cyBlbWFpbCBjbGllbnRzIGFuZCBnYXRld2F5 cyB3aXRoIHJlZ2FyZCB0byANCm5vbi1BU0NJSSBjaGFyYWN0ZXJzLCBJIHByZWZlciB0byBhdm9p ZCB1c2luZyB0aGVtIGluIGNhc2VzIGxpa2UgdGhpcy4gDQpUaGF0J3Mgd2h5IEkgZ2F2ZSB0aGUg VW5pY29kZSBjb2RlIHBvaW50IGZvciB0aGUgY2hhcmFjdGVyIGFsb25nIHdpdGggaXRzIA0KZGVz Y3JpcHRpb24gLS0gbm8gbmVlZCB0byB3b3JyeSBhYm91dCBpdCBnZXR0aW5nIG1hbmdsZWQuDQoN Cj4gU28sIGl0J3MgY2xlYXIgdGhhdCB0aGUgVSswMEMzID0gw4MgPSBMQVRJTiBDQVBJVEFMIExF VFRFUiBBIFdJVEggDQo+IFRJTERFIGNvdWxkIGhhdmUgcmVzdWx0ZWQgZnJvbSBtaXNpbnRlcnBy ZXRpbmcgdGhlIFVURi04IGFzIExBVElOMS4gDQo+IEJ1dCBob3cgdGhlIHNlY29uZCBieXRlIDlm IHR1cm5lZCBpbnRvIFUrMDE3OCA9IMW4ID0gTEFUSU4gQ0FQSVRBTCANCj4gTEVUVEVSIFkgV0lU SCBESUFFUkVTSVMsIHdobyBrbm93cywgbWF5YmUgYSAiY29kZXBhZ2UiPyAoVSswMDlGIGlzIGEg DQo+IGNvbnRyb2wgY2hhcmFjdGVyLikNCg0KT25lIHBvc3NpYmlsaXR5IGlzIFdpbmRvd3MtMTI1 Miwgd2hpY2ggaGFzIFUrMDE3OCBhdCBjb2RlIHBvaW50IDB4OUYuDQoNCkRhdmUNCg== |