Re: [icu-support] UnicodeString::toUTF8String, why BOM?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ok, I think the fundamental issue is that I didn't properly understand how the BOM actually works. I was assuming it behaved similar to magic numbers/4CC or such (and thus I was under the impression that they were agreed-on but arbitrary sequences)

If I now understand it correctly, it's always a single sequence that is not a visible character in either endianness, but the result of trying to interpret it one way or another can fail in a specific and expected way such that catching that provides a hint for which endianness use, yes?

Is 0xEFBBBF the direct translation of 0xFEFF?

I just got the answer to my question from Wikipedia, it seems:

https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 says, "...or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there..."

Thanks again for helping me clear this up!

Gregorio Litenstein Goldzweig
Médico Cirujano

• Fono: +56 9 96343643
• E-Mail: g.l...@gm...

On 17 Dec 2024 at 19:49 -0300, Steven R. Loomis <srl...@un...>, wrote:
> Thanks,Markus,
>
> I actually wonder if the issue is the other way… Gregorio, perhaps you are choosing a converter type that does not recognize the BOM but you have input data that has a BOM, ICU won’t automatically detect and strip it.
>
> Steven
>
> > El El mar, dic 17, 2024 a la(s) 4:31 p.m., Markus Scherer <mar...@gm...> escribió:
> > > On Tue, Dec 17, 2024 at 7:32 AM Gregorio Litenstein <g.l...@gm...> wrote:
> > > > > While debugging some stuff related to text conversion, I have noticed that converting from UTF16 to UTF8 (via an intermediary `UnicodeString` and `toUTF8String`) results in a UTF8 string that starts with \xEF\xBB\xBF. Why is this BOM being appended to my string, and why does it only seem to happen when converting from UTF16?
> > > >
> > > > UnicodeString::toUTF8String() does not prepend the BOM.
> > > >
> > > > I just tried this:
> > > >     UnicodeString s16(u"abcçカ🚴");
> > > >     std::string s8;
> > > >     s16.toUTF8String(s8);
> > > >     printf("s8.length=%d [%2x %2x %2x %2x ...] \"%s\"\n",
> > > >            (int)s8.length(),
> > > >            (uint8_t)s8[0], (uint8_t)s8[1], (uint8_t)s8[2], (uint8_t)s8[3],
> > > >            s8.c_str());
> > > >
> > > > As expected, this outputs 12 bytes, starting with 0x61 for 'a':
> > > > s8.length=12 [61 62 63 c3 ...] "abcçカ🚴"
> > > >
> > > > If you get a BOM (which in UTF-8 is really merely a "signature byte sequence") in the output from UnicodeString::toUTF8String(), then the UnicodeString starts with U+FEFF.
> > > >
> > > > Best regards,
> > > > markus
> > > --
> > > You received this message because you are subscribed to the Google Groups "icu-support" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un....
> > > To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6pEGnnjWcZE1NHGCucAjG5zYh%3DAVtC2m9UWMzBGTtf9Jw%40mail.gmail.com.

-- 
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un....
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/98918820-7336-4b55-8383-863f131cb761%40Spark.

Re: [icu-support] UnicodeString::toUTF8String, why BOM?

Open Source C/C++/Java libraries from Unicode

Re: [icu-support] UnicodeString::toUTF8String, why BOM?