Re: [icu-support] UnicodeString::toUTF8String, why BOM?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Actually… a couple follow-up questions:

1) Are instances of icu::Locale expensive to construct? i.e. does it make sense to keep a reference to a locale based on the app’s current language setting?
2) Do I need to worry about cleaning them up? and if so, how?

Thanks again!

Gregorio Litenstein Goldzweig
Médico Cirujano

• Fono: +56 9 96343643
• E-Mail: g.l...@gm...

On 19 Dec 2024 at 23:04 -0300, Markus Scherer <mar...@gm...>, wrote:
> Minor code feedback --
>
> On Thu, Dec 19, 2024 at 5:35 PM Gregorio Litenstein <g.l...@gm...> wrote:
> > icu::UnicodeString Converter::convertToUTF8(std::string_view sv) {
> > 	std::scoped_lock l(m_lock);
> > 	icu::UnicodeString ret(sv.data(), static_cast<int>(sv.length()), m_converter.get(), m_error);
> > 	if (m_error.isFailure()) throw std::runtime_error("Couldn't convert string: " + std::string(sv) + " to UTF-8. Error: " + std::to_string(m_error.get()) + ": " + m_error.errorName());
> > 	return ret;
> > }
>
> Misnomer: This converts to UTF-16, not to UTF-8.
>
> > std::string UnicodeUtil::convertToUTF8 (std::string_view str, std::string _filename, CaseMapping toCase, bool assumeUTF8) {
> > 	icu::UnicodeString ustring;
> > 	std::string charset;
> > 	if (assumeUTF8) charset = "UTF-8";
> > 	else charset = UnicodeUtil::getCharset(str);
> > 		if (charset != "UTF-8") {
> > 			if (!_filename.empty()) {
> > 				SpdLogger::info(LogSystem::I18N, "Filename={} does not seem to be UTF-8. Detected encoding={}", _filename, charset);
> > 			}
> > 			ustring = UnicodeUtil::getConverter(charset).convertToUTF8(str);
> > 		}
> > 	else { ustring = icu::UnicodeString::fromUTF8(str.data()); }
>
> This line does not work if str contains NUL bytes. Just remove .data() -->
> else { ustring = icu::UnicodeString::fromUTF8(str); }
>
> > 	switch(toCase) {
> > 		case CaseMapping::UPPER:
> > 			ustring.toUpper();
> > 			break;
> ...
>
> Note that case mappings are language-sensitive. Calling these functions without a Locale parameter / locale ID string will use the machine's default locale. You will get different results if your machine is set to Turkish, Dutch, Greek, ...
>
> > It seems that the BOM only gets appended for a UTF16 source (I tried converting from ANSI as well as Shift-JIS).
>
> As discussed, UnicodeString::toUTF8String() does not add a BOM. It will convert one if there is one.
>
> > Considering for UTF8 the BOM is not encouraged, I would expect ICU to just remove the UTF16 BOM and not add a new one.
>
> On input, the "UTF-16" converter will detect and remove the BOM. The "UTF-16LE" and "UTF-16BE" converters will not remove the BOM. All according to the standard.
>
> Best regards,
> markus

-- 
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu...@un....
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/2d6b46e3-4a86-4fa6-b261-8be10d0273ef%40Spark.

Re: [icu-support] UnicodeString::toUTF8String, why BOM?

Open Source C/C++/Java libraries from Unicode

Re: [icu-support] UnicodeString::toUTF8String, why BOM?