From: Deborah G. <gol...@ap...> - 2004-07-31 00:12:05
|
Can you give an example of a process that would make use of multi-character strings from an exemplar set? Deborah On Jul 30, 2004, at 1:59 PM, George Rhoten wrote: > For your purposes, these grapheme clusters or contractions aren't very > useful for you. For other things, like collation or anything that > deals > with alphabets, they are very important. Unless any of these strings > contain combining characters, they should not get any special treatment > from a font. For example, don't turn the AE grapheme cluster > (\u0041\u0045) into the AE ligature (\u00C6). > > Here is another example, in traditional Spanish, the letters ch and ll > are > each considered a single character (grapheme cluster), which are > different > from c, h and l. These multi-codepoint characters can get title cased > or > collated differently. Modern Spanish no longer uses these grapheme > clusters any more, at least that is what my old and new Spanish > dictionaries tell me. Both of my Spanish dictionaries sort the words > differently because of this difference. > > The LDML specification also briefly goes over this topic too: > http://www.unicode.org/reports/tr35/ > > George Rhoten > IBM Globalization Center of Competency/ICU San José, CA, USA > ICU main website: http://oss.software.ibm.com/icu/index.html > > > > "Elisha Berns" <e....@co...> > Sent by: icu...@ww... > 07/30/2004 12:02 PM > Please respond to > e.berns > > > To > <an...@jt...> > cc > "'ICU Support'" <icu...@ww...> > Subject > RE: FW: Question About Constructing Pattern Strings From API Results > > > > > > > Thanks for the reply Andy, > > I'm starting to feel really stupid asking so many questions about this > thing, please forgive me; I really am trying to wind this up! > > You wrote: > >> I need to look into this. I thought that scripts just populated a set >> with the code points with the matching script property, no strings. > > I think you are correct about this when the exemplar set pattern string > is a script name; however some of the exemplar set pattern strings do > contain multicharacter strings. For example, Hungarian: > > [a-z\u00E1\u00E9\u00ED\u00F3\u00F6\u00FA\u00FC\u0151\u0171 > {ccs}{cs}{ddz}{ddzs}{dz}{dzs}{ggy}{gy}{lly}{ly}{nny}{ny}{ssz} > {sz}{tty}{ty}{zs}{zzs}] > > So all those groups of characters enclosed in curly braces, what is > their meaning since they were contained in the range [a-z] at the > beginning of the pattern string? Do they get normalized to some kind > of > diacritical/letter combination? Is this their normalized > representation? > > My question is how do you transform (??) what is inside the curly > braces > to one or more code points that can be displayed by a font? Or do I > just have a major misunderstanding about this: when any one of these > combinations of code points, the "multicharacter string" is fed to a > TrueType/OpenType layout engine, the layout engine will convert this > string to a special glyph? And the only test that is *required* is for > unique code points, not all these duplicates? > > Thanks, > > Elisha > > > > _______________________________________________ > icu...@os... - icu4c-support mailing list > To Un/Subscribe: > http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c- > support > > > > _______________________________________________ > icu...@os... - icu4c-support mailing list > To Un/Subscribe: > http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c- > support |