Re: [Afpfs-ng-devel] precompose and decompose

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

HAT wrote:
> hi.

Hi HAT,

thanks for looking into this!

I did the implementation of UTF-8 support for afpfs-ng.

Currently I'm on a business trip but will be back this weekend, so we
might look a little closer then.

A few remarks:

> I'm reading the source of afpfs-ng 0.8.1 for the first time now.
> I think that there are problems in precomposing and decomposing.
> 
> 1) precompose two characters only: supported
> 2) precompose two characters over: unsupported

I've checked with decompositions of up to four characters and AFAIR it
works ok. It's not done in a single call to UCS2precompose() but
repeatedly, so that with each call the "next" accent will be combined
with the base character (which in the later steps will already be an
accented one)

> 3) decompose: sample only

There is no decomposition, since the MAC filesystem does the appropriate
conversions before writing. I was not aware of any problems related to
the missing functionality so far ...

> 4) hangul: unsupported
> 5) Unicode U+010000 over: unsupported
> 6) maccodepage for AFP2: unsupported
> 
> The 2), 3) and 4) can be implemented comparatively easily
> because I did them for netatalk.
> 
> There are two methods to support the U+010000 over.
> 
> a) Using surrogate pair
> b) Using UCS4 instead of UCS2
> 
> The surrogate pair is dirty and complex.
> Because netatalk 2.1dev use the surrogate pair, It is difficult
> to support the U+010000 over.
> If we use UCS4 instead of UCS2, the implementation will be easy.
> 
> Replace.
> 
> from
> 	char16 *UTF8toUCS2(str)
> to
> 	u_int32_t *UTF8toUCS4(str)
> 
> from
> 	int UCS2precompose(first, second)
> to
> 	u_int64_t UCS4precompose(first, second)
> 
> from
> 	// worst case: 3 bytes of UTF8 per UCS2 char + terminal 0
> to
> 	// worst case: 4 bytes of UTF8 per UCS4 char + terminal 0
> 
> 
> The size of table[] is two times.
> 
> static struct {
>   int precomposed;
>   unsigned int pattern;
> } table[] = {
> { 0x00000000, 0x0000000000000000},    // Dummy entry table[0]
> { 0x000000C0, 0x0000004100000300},
> { 0x000000C1, 0x0000004100000301},
> { 0x000000C2, 0x0000004100000302},
> (snip)
> { 0x0001D1BF, 0x0001D1BB0001D16F},
> { 0x0001D1BE, 0x0001D1BC0001D16E},
> { 0x0001D1C0, 0x0001D1BC0001D16F},
> };
> 
> 
> PS.
> Don't trust the Apple's documents.
> 
> http://developer.apple.com/technotes/tn/tn1150table.html
> This table is based on Unicode 2.x.

This table is the basis for the precompositions performed by
UCS2precompose(). As written above, not only two character
decompositions but also three and four character decomps should be
handled correctly.

> http://developer.apple.com/documentation/Networking/Conceptual/AFP/AFP3_1.pdf
> This document is based on Unicode 3.2.
> 
> Mac OS X 10.5.2 Leopard use newer Unicode.
> 0x1B06  to  0x1B05 0x1B35
> This is not in Unicode 3.2.

If there are additions to the old tn1150 decomposition table we should
add them or increase the element sizes appropriately to handle Unicode
U+010000.

Thanks again + Best regards ... Michael