Re: [icu-support] UnicodeString::toUTF8String, why BOM?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Gregorio Litenstein Goldzweig
Médico Cirujano

• Fono: +56 9 96343643
• E-Mail: g.l...@gm...

On 19 Dec 2024 at 23:04 -0300, Markus Scherer <mar...@gm...>, wrote:
> Minor code feedback --
>
> On Thu, Dec 19, 2024 at 5:35 PM Gregorio Litenstein <g.l...@gm...> wrote:
> > icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
> > 	std::scoped_lock l(m_lock);	icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error);	if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName());	return ret;}
>
> Misnomer: This converts to UTF-16, not to UTF-8.

You're right of course. I guess I didn't think about it when I wrote that, but I'm loath to change it now after several years. I'll probably add a comment though.

>
> > std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
> > 	icu::UnicodeString ustring;	std::string charset;	if (assumeUTF8) charset = "UTF-8";	else charset = UnicodeUtil::getCharset(str);		if (charset != "UTF-8") {			if (!_filename.empty()) {				SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset);			}			ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str);		}	else { ustring = icu::UnicodeString::fromUTF8(str.data()); }
>
> This line does not work if str contains NUL bytes. Just remove .data() -->
> else { ustring = icu::UnicodeString::fromUTF8(str); }
>
That was a (possibly misguided) but deliberate decision at a time when some of the platforms we were working with had versions of ICU before 65. Thanks for pointing it out though.
> > 	switch(toCase) {		case CaseMapping::UPPER:			ustring.toUpper();			break;
> ...
>
> Note that case mappings are language-sensitive. Calling these functions without a Locale parameter / locale ID string will use the machine's default locale. You will get different results if your machine is set to Turkish, Dutch, Greek, ...
>
This is very unlikely to actually come up in usage or our app but thanks for pointing it out, should be easy to correct.
> > It seems that the BOM only gets appended for a UTF16 source (I tried converting from ANSI as well as Shift-JIS).
>
> As discussed, UnicodeString::toUTF8String() does not add a BOM. It will convert one if there is one.
>
> > Considering for UTF8 the BOM is not encouraged, I would expect ICU to just remove the UTF16 BOM and not add a new one.
>
> On input, the "UTF-16" converter will detect and remove the BOM. The "UTF-16LE" and "UTF-16BE" converters will not remove the BOM. All according to the standard.
>
> Best regards,
> markus

-- 
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un....
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/726794f0-ffac-4f28-aa88-2fdd10b09b2d%40Spark.

Re: [icu-support] UnicodeString::toUTF8String, why BOM?

Open Source C/C++/Java libraries from Unicode

Re: [icu-support] UnicodeString::toUTF8String, why BOM?