[icu-design] Changes to the C Charset Detection API

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

These changes to Charset Detection came out of a review of the API with Mar=
kus.

=3D=3D=3D  ucsdet_getUChars()  =3D=3D=3D

This fills a caller-supplied UChar buffer with the input text data
after conversion to UChars.  The change is to the behavior when the
buffer is too small to hold the full UChar string.

As originally described, the function would put as many characters as
would fit into the output buffer, and return the number of chars
actually returned.  The total size needed to hold the entire string
was not returned.

The new behavior is the same as that of ucnv_toUChars - when the
buffer is too small, the buffer contents are undefined and return
value is the total number of UChars that would be in the output
string, not including the terminating NUL.

The new behavior follows the usual convention for ICU functions that
fill an output buffer with UChars.

The original behavior was intended to make it easier to work with
files where the total size was not known in advance, and could be
extremely large.  The file APIs have since been removed from charset
detection, which eliminates the reason for the non-standard behavior.

File APIs that work with charset detection will be proposed later for
the ICU IO package.

=3D=3D=3D    ucsdet_getDetectableCharsetName   =3D=3D=3D
=3D=3D=3D    ucsdet_DetectableCharsetsCount   =3D=3D=3D

Replace these two functions with a single one that provides a
UEnumeration over the detectable charsets.  The new function name can
be the taken from Java.

UEnumeration *
ucsdet_getAllDetectableCharsets(const UCharsetDetector *csd,
   UErrorCode *status);

This is more in keeping with the preferred conventions for new ICU
APIs, and can better deal with the chance that there may be some way
in the future to register or add detectors to the charset detector
service on the fly.  Functions on UEnumeration provide for enumerating
over the set of detectable charsets.

-- Andy Heninger

[icu-design] Changes to the C Charset Detection API

Open Source C/C++/Java libraries from Unicode

[icu-design] Changes to the C Charset Detection API