Re: [icu-design] updated proposal for encoding selection API

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

After looking at the ucnvsel.h header, I have a few comments.

1) This public header incorrectly includes private headers.  The 
udataswp.h header is private, and it unnecessarily exports API that 
doesn't need to be public (e.g. ucnvsel_swap).

2) Are the serialize and unserialize functions really needed?  Are the 
functions meant to be a way to freeze into an optimal data structure for 
performance reasons (as the API states)?  If so, why not implement a 
freeze function instead of exposing the serialization functions.  Most 
ICU4C objects do not have public API for object serialization.  While I'm 
not against such a serialization process, it kind of reminds me of the 
safeClone operations, which I've never really been a fan of maintaining in 
the past.  I'm sure that Andy has an opinion on ubrk_safeClone, and I'm 
sure that Markus has an opinion on ucnv_safeClone ;-)  Coming at this 
functionality with a blank slate, the API documentation doesn't provide 
enough information on the scenarios when the serialize/unserialize 
functions should be used.

3) ucnvsel_open uses a UBool to state whether fallbacks should be used for 
consideration.  Why not use UConverterUnicodeSet directly instead?  There 
is no need to remap this enum to a boolean.    The enum is there in case 
we ever want to add other states to consider Unicode set values.  Maybe we 
want to only consider only reverse fallbacks to decide which Unicode 
characters may be unsafe during a conversion process.  That exact enum 
isn't there in UConverterUnicodeSet right now, but my point is that other 
values may be added in the future.

4) The ucnvsel_open function doesn't say who owns the list of converter 
aliases.  You might want to make it explicit that the list is copied.  BTW 
the strings from ICU's ucnv_getAvailableName API don't need to be copied, 
but I'm sure the implementation is simpler to always copy it.  Also, I 
explicitly aligned the start of those char * strings on 2 byte boundaries 
so that string referencing could happen faster.  If you look at what 
happens with the Microsoft compiler optimization options /O1 (optimize 
smaller code) and /O2 (optimize faster code), those options actually 
affect how the char * strings are aligned.  It seems that Intel CPUs and 
memory reference to these strings happen faster when they're aligned on 2 
byte boundaries.  I only saw a small performance difference from my 
testing a few years ago.  This performance improvement is likely due to 
how memory references read several bytes from RAM into the cache even 
though you're only reading a single byte.  I only mention this in case 
performance is a big concern.

5) You have a ucnvsel_selectForString and ucnvsel_selectForUTF8.  Have you 
considered just adding an API for USet?  Such an API could also be helpful 
for implementing ticket 49 http://bugs.icu-project.org/trac/ticket/49

By using a USet as input for the API, you could use the USet from 
ulocdata_getExemplarSet with this API to more generically ask ICU, "Which 
charsets support these languages?"  While you can technically already get 
that information, I presume the whole purpose of this API is to simplify 
"common" operations with ucnv_getUnicodeSet.  BTW someone should cross 
link tickets 49 and 6453 with each other.

Quickly looking at the implementation, I have the following comments:

6) You shouldn't be reinventing the wheel when trying to find the 
endianess in ICU.  I recommend using U_IS_BIG_ENDIAN to get the endianness 
of the platform.  This is what the rest of the code uses.  It's reliable 
and fast.

7) I find it interesting that there is a counting the ones in the bit mask 
algorithm being used.  Other people on the ICU team might remember that I 
use that question for interviews.  I guess it proves that my standard 
interview question does have practical purposes :-)  That's probably only 
the second time I've seen that happen, and it's the best implementation 
I've seen.

8) The code states the following:

// 66000 as suggested by Markus [I suggest something like 66000 which
// exceeds the number of BMP code points. There will be fewer ranges of
// combinations of encodings. (I believe there are no encodings that have
// interesting mappings for supplementary code points. All encodings 
either
// support all of them or none of them.)]

When did that happen?  If you want interesting supplementary ranges, try 
ibm-1390, ibm-1399, ibm-16684, Big5-HKSCS and ISO-2022-CN-EXT charsets. 
They have several non-contiguous supplementary ranges.  There may be 
others in ICU, but they're a minimal set.  The first 3 contain JISX213, 
which also means that they also contain several multi-codepoint mappings. 
You should take a look at Converter Explorer for the Unicode set values 
for these charsets.  These multi-codepoint values were an issue for 
generating the list of supported languages per charset in Converter 
Explorer.  You should make sure you test these specific charsets, since 
it's technically possible to have a multi-codepoint value that doesn't 
have individual codepoints supported by a given codepage.  For example, 
ibm-1390 supports the character {\u304B\u309A}, but \u309A does NOT have a 
valid separate mapping.  I believe uset_span says it can do this type of 
matching operation, but I don't think your string iteration code allows 
this scenario of multi-codepoint support.

George Rhoten
Tivoli
San José, CA, USA

Re: [icu-design] updated proposal for encoding selection API

Open Source C/C++/Java libraries from Unicode

Re: [icu-design] updated proposal for encoding selection API