|
From: George R. <gr...@us...> - 2008-07-23 08:44:29
|
After looking at the ucnvsel.h header, I have a few comments. 1) This public header incorrectly includes private headers. The udataswp.h header is private, and it unnecessarily exports API that doesn't need to be public (e.g. ucnvsel_swap). 2) Are the serialize and unserialize functions really needed? Are the functions meant to be a way to freeze into an optimal data structure for performance reasons (as the API states)? If so, why not implement a freeze function instead of exposing the serialization functions. Most ICU4C objects do not have public API for object serialization. While I'm not against such a serialization process, it kind of reminds me of the safeClone operations, which I've never really been a fan of maintaining in the past. I'm sure that Andy has an opinion on ubrk_safeClone, and I'm sure that Markus has an opinion on ucnv_safeClone ;-) Coming at this functionality with a blank slate, the API documentation doesn't provide enough information on the scenarios when the serialize/unserialize functions should be used. 3) ucnvsel_open uses a UBool to state whether fallbacks should be used for consideration. Why not use UConverterUnicodeSet directly instead? There is no need to remap this enum to a boolean. The enum is there in case we ever want to add other states to consider Unicode set values. Maybe we want to only consider only reverse fallbacks to decide which Unicode characters may be unsafe during a conversion process. That exact enum isn't there in UConverterUnicodeSet right now, but my point is that other values may be added in the future. 4) The ucnvsel_open function doesn't say who owns the list of converter aliases. You might want to make it explicit that the list is copied. BTW the strings from ICU's ucnv_getAvailableName API don't need to be copied, but I'm sure the implementation is simpler to always copy it. Also, I explicitly aligned the start of those char * strings on 2 byte boundaries so that string referencing could happen faster. If you look at what happens with the Microsoft compiler optimization options /O1 (optimize smaller code) and /O2 (optimize faster code), those options actually affect how the char * strings are aligned. It seems that Intel CPUs and memory reference to these strings happen faster when they're aligned on 2 byte boundaries. I only saw a small performance difference from my testing a few years ago. This performance improvement is likely due to how memory references read several bytes from RAM into the cache even though you're only reading a single byte. I only mention this in case performance is a big concern. 5) You have a ucnvsel_selectForString and ucnvsel_selectForUTF8. Have you considered just adding an API for USet? Such an API could also be helpful for implementing ticket 49 http://bugs.icu-project.org/trac/ticket/49 By using a USet as input for the API, you could use the USet from ulocdata_getExemplarSet with this API to more generically ask ICU, "Which charsets support these languages?" While you can technically already get that information, I presume the whole purpose of this API is to simplify "common" operations with ucnv_getUnicodeSet. BTW someone should cross link tickets 49 and 6453 with each other. Quickly looking at the implementation, I have the following comments: 6) You shouldn't be reinventing the wheel when trying to find the endianess in ICU. I recommend using U_IS_BIG_ENDIAN to get the endianness of the platform. This is what the rest of the code uses. It's reliable and fast. 7) I find it interesting that there is a counting the ones in the bit mask algorithm being used. Other people on the ICU team might remember that I use that question for interviews. I guess it proves that my standard interview question does have practical purposes :-) That's probably only the second time I've seen that happen, and it's the best implementation I've seen. 8) The code states the following: // 66000 as suggested by Markus [I suggest something like 66000 which // exceeds the number of BMP code points. There will be fewer ranges of // combinations of encodings. (I believe there are no encodings that have // interesting mappings for supplementary code points. All encodings either // support all of them or none of them.)] When did that happen? If you want interesting supplementary ranges, try ibm-1390, ibm-1399, ibm-16684, Big5-HKSCS and ISO-2022-CN-EXT charsets. They have several non-contiguous supplementary ranges. There may be others in ICU, but they're a minimal set. The first 3 contain JISX213, which also means that they also contain several multi-codepoint mappings. You should take a look at Converter Explorer for the Unicode set values for these charsets. These multi-codepoint values were an issue for generating the list of supported languages per charset in Converter Explorer. You should make sure you test these specific charsets, since it's technically possible to have a multi-codepoint value that doesn't have individual codepoints supported by a given codepage. For example, ibm-1390 supports the character {\u304B\u309A}, but \u309A does NOT have a valid separate mapping. I believe uset_span says it can do this type of matching operation, but I don't think your string iteration code allows this scenario of multi-codepoint support. George Rhoten Tivoli San José, CA, USA |