Re: [Afpfs-ng-devel] precompose and decompose
Status: Alpha
Brought to you by:
alexthepuffin
From: Michael U. <mu...@re...> - 2008-03-28 11:34:38
|
HAT wrote: > hi. Hi HAT, thanks for looking into this! I did the implementation of UTF-8 support for afpfs-ng. Currently I'm on a business trip but will be back this weekend, so we might look a little closer then. A few remarks: > I'm reading the source of afpfs-ng 0.8.1 for the first time now. > I think that there are problems in precomposing and decomposing. > > 1) precompose two characters only: supported > 2) precompose two characters over: unsupported I've checked with decompositions of up to four characters and AFAIR it works ok. It's not done in a single call to UCS2precompose() but repeatedly, so that with each call the "next" accent will be combined with the base character (which in the later steps will already be an accented one) > 3) decompose: sample only There is no decomposition, since the MAC filesystem does the appropriate conversions before writing. I was not aware of any problems related to the missing functionality so far ... > 4) hangul: unsupported > 5) Unicode U+010000 over: unsupported > 6) maccodepage for AFP2: unsupported > > The 2), 3) and 4) can be implemented comparatively easily > because I did them for netatalk. > > There are two methods to support the U+010000 over. > > a) Using surrogate pair > b) Using UCS4 instead of UCS2 > > The surrogate pair is dirty and complex. > Because netatalk 2.1dev use the surrogate pair, It is difficult > to support the U+010000 over. > If we use UCS4 instead of UCS2, the implementation will be easy. > > Replace. > > from > char16 *UTF8toUCS2(str) > to > u_int32_t *UTF8toUCS4(str) > > from > int UCS2precompose(first, second) > to > u_int64_t UCS4precompose(first, second) > > from > // worst case: 3 bytes of UTF8 per UCS2 char + terminal 0 > to > // worst case: 4 bytes of UTF8 per UCS4 char + terminal 0 > > > The size of table[] is two times. > > static struct { > int precomposed; > unsigned int pattern; > } table[] = { > { 0x00000000, 0x0000000000000000}, // Dummy entry table[0] > { 0x000000C0, 0x0000004100000300}, > { 0x000000C1, 0x0000004100000301}, > { 0x000000C2, 0x0000004100000302}, > (snip) > { 0x0001D1BF, 0x0001D1BB0001D16F}, > { 0x0001D1BE, 0x0001D1BC0001D16E}, > { 0x0001D1C0, 0x0001D1BC0001D16F}, > }; > > > PS. > Don't trust the Apple's documents. > > http://developer.apple.com/technotes/tn/tn1150table.html > This table is based on Unicode 2.x. This table is the basis for the precompositions performed by UCS2precompose(). As written above, not only two character decompositions but also three and four character decomps should be handled correctly. > http://developer.apple.com/documentation/Networking/Conceptual/AFP/AFP3_1.pdf > This document is based on Unicode 3.2. > > Mac OS X 10.5.2 Leopard use newer Unicode. > 0x1B06 to 0x1B05 0x1B35 > This is not in Unicode 3.2. If there are additions to the old tn1150 decomposition table we should add them or increase the element sizes appropriately to handle Unicode U+010000. Thanks again + Best regards ... Michael |