Re: [icu-support] Support for packed 7-bit character encodings

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Sat, Oct 20, 2012 at 12:41 PM, Tim Kimber <KI...@uk...> wrote:

> I've recently encountered a character encoding that doesn't use octets,
> and is not byte-aligned. It is GSM-338 (
> http://en.wikipedia.org/wiki/GSM_03.38#GSM_7_bit_default_alphabet_and_extension_table_of_3GPP_TS_23.038_.2F_GSM_03.38
> ).
>
> a) is there any way to configure ICU to encode/decode GSM-338?

We have a conversion table available here:
http://source.icu-project.org/repos/icu/data/trunk/charset/contrib/data/ucm/
http://site.icu-project.org/charts/charset

It is used assuming that GSM-03.38 is stored using whole bytes.

b) if answer to a) is 'no', is there any prospect of support being added in
> a future release of ICU?
> c) how hard would it be for an ICU user to write an ICU encoder/decoder
> that operates on a bit stream instead of an octet stream, and supports
> GSM-338? Is there any documentation on 'rolling your own' ICU encoding? (
> Apart from the documents that explain how to write data files to control
> the existing algorithms )
>

I think you would pack and unpack as a separate step, and use the ICU
converter for the byte-oriented version of your text.

Best regards,
markus
-- 
Google Internationalization Engineering

Re: [icu-support] Support for packed 7-bit character encodings

Open Source C/C++/Java libraries from Unicode

Re: [icu-support] Support for packed 7-bit character encodings