Re: [icu-design] updated proposal for encoding selection API

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Cool. Thanks for the comments. I made some changes, and they will be checked
in soon.

On Wed, Jul 23, 2008 at 1:41 AM, George Rhoten <gr...@us...> wrote:

> After looking at the ucnvsel.h header, I have a few comments.
>
> 1) This public header incorrectly includes private headers.  The
> udataswp.h header is private, and it unnecessarily exports API that
> doesn't need to be public (e.g. ucnvsel_swap).
>

Yes. This was a mistake. I have corrected this one

>
> 2) Are the serialize and unserialize functions really needed?  Are the
> functions meant to be a way to freeze into an optimal data structure for
> performance reasons (as the API states)?  If so, why not implement a
> freeze function instead of exposing the serialization functions.  Most
> ICU4C objects do not have public API for object serialization.  While I'm
> not against such a serialization process, it kind of reminds me of the
> safeClone operations, which I've never really been a fan of maintaining in
> the past.  I'm sure that Andy has an opinion on ubrk_safeClone, and I'm
> sure that Markus has an opinion on ucnv_safeClone ;-)  Coming at this
> functionality with a blank slate, the API documentation doesn't provide
> enough information on the scenarios when the serialize/unserialize
> functions should be used.
>

This was more of just my bad English playing its role. I didn't describe the
purpose of serialization correctly. The idea is that unserializing is way
cheaper than constructing a selector. So, once someone knows which
converters they are interested in, the selector can be saved to file, and
then we use unserialize to load it instead of constructing it everytime we
run the program. I have updated the comments to describe this more clearly.

>
> 3) ucnvsel_open uses a UBool to state whether fallbacks should be used for
> consideration.  Why not use UConverterUnicodeSet directly instead?  There
> is no need to remap this enum to a boolean.    The enum is there in case
> we ever want to add other states to consider Unicode set values.  Maybe we
> want to only consider only reverse fallbacks to decide which Unicode
> characters may be unsafe during a conversion process.  That exact enum
> isn't there in UConverterUnicodeSet right now, but my point is that other
> values may be added in the future.
>

This is a very good point. I made the change here as well.

>
> 4) The ucnvsel_open function doesn't say who owns the list of converter
> aliases.  You might want to make it explicit that the list is copied.  BTW
> the strings from ICU's ucnv_getAvailableName API don't need to be copied,
> but I'm sure the implementation is simpler to always copy it.  Also, I
> explicitly aligned the start of those char * strings on 2 byte boundaries
> so that string referencing could happen faster.  If you look at what
> happens with the Microsoft compiler optimization options /O1 (optimize
> smaller code) and /O2 (optimize faster code), those options actually
> affect how the char * strings are aligned.  It seems that Intel CPUs and
> memory reference to these strings happen faster when they're aligned on 2
> byte boundaries.  I only saw a small performance difference from my
> testing a few years ago.  This performance improvement is likely due to
> how memory references read several bytes from RAM into the cache even
> though you're only reading a single byte.  I only mention this in case
> performance is a big concern.
>

The comments are now updated to describe this

>
> 5) You have a ucnvsel_selectForString and ucnvsel_selectForUTF8.  Have you
> considered just adding an API for USet?  Such an API could also be helpful
> for implementing ticket 49 http://bugs.icu-project.org/trac/ticket/49
>
> By using a USet as input for the API, you could use the USet from
> ulocdata_getExemplarSet with this API to more generically ask ICU, "Which
> charsets support these languages?"  While you can technically already get
> that information, I presume the whole purpose of this API is to simplify
> "common" operations with ucnv_getUnicodeSet.  BTW someone should cross
> link tickets 49 and 6453 with each other.
>
>
This looks interesting. I will try to get it implemented given enough time.
If there is enough demand for it, i can give it higher priority.

>
>
> Quickly looking at the implementation, I have the following comments:
>
> 6) You shouldn't be reinventing the wheel when trying to find the
> endianess in ICU.  I recommend using U_IS_BIG_ENDIAN to get the endianness
> of the platform.  This is what the rest of the code uses.  It's reliable
> and fast.
>

This is another case of bad documentation. I was trying to find if the data
file had different endianness than the current machine. So, there was no
need to detect endianness of the machine (and hence U_IS_BIG_ENDIAN would
end up causing more work than necessary). I find whether endianness is
reversed or not, and "cheat" udata_openSwapper() args to make it do the
conversion or not :)

>
> 7) I find it interesting that there is a counting the ones in the bit mask
> algorithm being used.  Other people on the ICU team might remember that I
> use that question for interviews.  I guess it proves that my standard
> interview question does have practical purposes :-)  That's probably only
> the second time I've seen that happen, and it's the best implementation
> I've seen.
>

:-) It was the first time for me to see this algorithm as well. It is very
neat :-)

>
> 8) The code states the following:
>
> // 66000 as suggested by Markus [I suggest something like 66000 which
> // exceeds the number of BMP code points. There will be fewer ranges of
> // combinations of encodings. (I believe there are no encodings that have
> // interesting mappings for supplementary code points. All encodings
> either
> // support all of them or none of them.)]
>
> When did that happen?  If you want interesting supplementary ranges, try
> ibm-1390, ibm-1399, ibm-16684, Big5-HKSCS and ISO-2022-CN-EXT charsets.
> They have several non-contiguous supplementary ranges.  There may be
> others in ICU, but they're a minimal set.  The first 3 contain JISX213,
> which also means that they also contain several multi-codepoint mappings.
> You should take a look at Converter Explorer for the Unicode set values
> for these charsets.  These multi-codepoint values were an issue for
> generating the list of supported languages per charset in Converter
> Explorer.  You should make sure you test these specific charsets, since
> it's technically possible to have a multi-codepoint value that doesn't
> have individual codepoints supported by a given codepage.  For example,
> ibm-1390 supports the character {\u304B\u309A}, but \u309A does NOT have a
> valid separate mapping.  I believe uset_span says it can do this type of
> matching operation, but I don't think your string iteration code allows
> this scenario of multi-codepoint support.
>

I am not entirely sure about this one, but from experiments, i found that
the number of unique classes of codepoints (and hence roughly the size of
the property vector?) tended to be in the hundreds zone (around 350 max).
The system doesn't currently handle the multi-codepoints cases :-( However,
it doesn't fail on them either. It will simply look at one code point at a
time. It might return fewer converters than what really can be done (which
could make people revert to UTF-8 unnecessarily in a few cases, for
example). Does this sound too bad?

Thanks,
 Mohamed

Re: [icu-design] updated proposal for encoding selection API

Open Source C/C++/Java libraries from Unicode

Re: [icu-design] updated proposal for encoding selection API