|
From: Markus S. <mar...@gm...> - 2024-12-17 22:31:14
|
On Tue, Dec 17, 2024 at 7:32 AM Gregorio Litenstein <g.l...@gm...>
wrote:
> While debugging some stuff related to text conversion, I have noticed that
> converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and
> `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why
> is this BOM being appended to my string, and why does it only seem to
> happen when converting from UTF16?
UnicodeString::toUTF8String() does not prepend the BOM.
I just tried this:
UnicodeString s16(u"abcçカ🚴");
std::string s8;
s16.toUTF8String(s8);
printf("s8.length=%d [%2x %2x %2x %2x ...] \"%s\"\n",
(int)s8.length(),
(uint8_t)s8[0], (uint8_t)s8[1], (uint8_t)s8[2], (uint8_t)s8[3],
s8.c_str());
As expected, this outputs 12 bytes, starting with 0x61 for 'a':
s8.length=12 [61 62 63 c3 ...] "abcçカ🚴"
If you get a BOM (which in UTF-8 is really merely a "signature byte
sequence") in the output from UnicodeString::toUTF8String(), then the
UnicodeString starts with U+FEFF.
Best regards,
markus
--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un....
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6pEGnnjWcZE1NHGCucAjG5zYh%3DAVtC2m9UWMzBGTtf9Jw%40mail.gmail.com.
|