From: Mark D. <mar...@jt...> - 2004-07-31 04:31:05
|
A few items. First, a UnicodeSet really is a set of Unicode strings, not just code points. However, its implementation is designed to be particularly compact and efficient in the storage and retrieval of individual code points. That is the reason for the way that the iterator works. I'll use the Java API for examples. The simplest way to iterate is: for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet); i.next(); ) { String s = i.getString(); // do something with s } However, that forces the creation of a string object, so the more efficient way is: for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet); i.next(); ) { if (i.codepoint == i.IS_STRING) { // do something with i.string } else { // do something with i.codepoint } } or, if the calling code can deal efficiently with ranges of code points: for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet); i.nextRange(); ) { if (i.codepoint == i.IS_STRING) { // do something with i.string } else { // do something with all the code points from i.codepoint to i.codepointEnd } } The only point you really have to be careful of is that unicodeSet.complement() is documented as *not* equivalent to a set complement. A set complement would include every string that is not in the set, clearly a memory hog ;-). Instead, unicodeSet.complement() is defined to be the equivalent of subtracting unicodeSet from the set composed of U+0000..U+10FFFF, e.g. unicodeSet = new UnicodeSet(0,0x10FFFF).removeAll(unicodeSet); Second, the exemplar set lists the characters or sequences of characters that are required for use with the language, plus those sequences typically viewed as being separate characters in the locale. Typically if a sequence is included, then it is a contraction in collation, but the reverse may not be true. For example, アー may be treated as a contraction in Japanese collation, but not as an exemplar character. Being an exemplar character does not at all require that the sequences be handled as a ligature in display. For example, 'ch' is an exemplar character for Slovak, but the rendering of 'ch' doesn't differ from just a 'c' followed by an 'h'. For determining whether a font contains the glyphs necessary for a given UnicodeSet, it is sufficient to determine that it can handle all of the individual code points listed, and can handle the sequences either as a whole or as individual code points. It is pretty unlikely that a font would handle a sequence and not be able to handle the individual code points, so if I were testing, I would use the following: boolean fontHandlesCharacters(Font f, UnicodeSet unicodeSet) { for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet); i.next(); ) { if (i.codepoint == i.IS_STRING) { if (i.codepoint > 0xFFFF) return false; // JDK can't do supplementaries yet if (!f.canDisplay((char)i.codepoint)) return false; } else { int cp; for (int j = 0; j < i.string.length(); j += UTF16.getCharCount(cp)) { cp = UTF16.charAt(i.string, j); if (cp > 0xFFFF) return false; // JDK can't do supplementaries yet if (!f.canDisplay((char)cp)) return false; } } } return true; } // disclaimer, I haven't compiled or tested any of these examples! Another way to do this is to "flatten" the UnicodeSet (actually, this might be a useful utility for us to add). UnicodeSet flatten(UnicodeSet unicodeSet) { UnicodeSet result = new UnicodeSet(); for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet); i.nextRange(); ) { if (i.codepoint == i.IS_STRING) result.addAll(i.string); // adds each code point else result.add(i.codepoint, i.codepointEnd); } return result; } And then just use a simple loop: boolean fontHandlesCharacters2(Font f, UnicodeSet unicodeSet) { for (UnicodeSetIterator i = new UnicodeSetIterator(flatten(unicodeSet)); i.next(); ) { if (i.codepoint > 0xFFFF) return false; // JDK can't do supplementaries yet if (!f.canDisplay((char)i.codepoint)) return false; } return true; } Mark ----- Original Message ----- From: "Deborah Goldsmith" <gol...@ap...> To: "George Rhoten" <gr...@us...> Cc: "'ICU Support'" <icu...@os...> Sent: Friday, July 30, 2004 14:11 Subject: Re: Question About Constructing Pattern Strings From API Results > Can you give an example of a process that would make use of > multi-character strings from an exemplar set? > > Deborah > > On Jul 30, 2004, at 1:59 PM, George Rhoten wrote: > > > For your purposes, these grapheme clusters or contractions aren't very > > useful for you. For other things, like collation or anything that > > deals > > with alphabets, they are very important. Unless any of these strings > > contain combining characters, they should not get any special treatment > > from a font. For example, don't turn the AE grapheme cluster > > (\u0041\u0045) into the AE ligature (\u00C6). > > > > Here is another example, in traditional Spanish, the letters ch and ll > > are > > each considered a single character (grapheme cluster), which are > > different > > from c, h and l. These multi-codepoint characters can get title cased > > or > > collated differently. Modern Spanish no longer uses these grapheme > > clusters any more, at least that is what my old and new Spanish > > dictionaries tell me. Both of my Spanish dictionaries sort the words > > differently because of this difference. > > > > The LDML specification also briefly goes over this topic too: > > http://www.unicode.org/reports/tr35/ > > > > George Rhoten > > IBM Globalization Center of Competency/ICU San José, CA, USA > > ICU main website: http://oss.software.ibm.com/icu/index.html > > > > > > > > "Elisha Berns" <e....@co...> > > Sent by: icu...@ww... > > 07/30/2004 12:02 PM > > Please respond to > > e.berns > > > > > > To > > <an...@jt...> > > cc > > "'ICU Support'" <icu...@ww...> > > Subject > > RE: FW: Question About Constructing Pattern Strings From API Results > > > > > > > > > > > > > > Thanks for the reply Andy, > > > > I'm starting to feel really stupid asking so many questions about this > > thing, please forgive me; I really am trying to wind this up! > > > > You wrote: > > > >> I need to look into this. I thought that scripts just populated a set > >> with the code points with the matching script property, no strings. > > > > I think you are correct about this when the exemplar set pattern string > > is a script name; however some of the exemplar set pattern strings do > > contain multicharacter strings. For example, Hungarian: > > > > [a-z\u00E1\u00E9\u00ED\u00F3\u00F6\u00FA\u00FC\u0151\u0171 > > {ccs}{cs}{ddz}{ddzs}{dz}{dzs}{ggy}{gy}{lly}{ly}{nny}{ny}{ssz} > > {sz}{tty}{ty}{zs}{zzs}] > > > > So all those groups of characters enclosed in curly braces, what is > > their meaning since they were contained in the range [a-z] at the > > beginning of the pattern string? Do they get normalized to some kind > > of > > diacritical/letter combination? Is this their normalized > > representation? > > > > My question is how do you transform (??) what is inside the curly > > braces > > to one or more code points that can be displayed by a font? Or do I > > just have a major misunderstanding about this: when any one of these > > combinations of code points, the "multicharacter string" is fed to a > > TrueType/OpenType layout engine, the layout engine will convert this > > string to a special glyph? And the only test that is *required* is for > > unique code points, not all these duplicates? > > > > Thanks, > > > > Elisha > > > > > > > > _______________________________________________ > > icu...@os... - icu4c-support mailing list > > To Un/Subscribe: > > http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c- > > support > > > > > > > > _______________________________________________ > > icu...@os... - icu4c-support mailing list > > To Un/Subscribe: > > http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c- > > support > > _______________________________________________ > icu...@os... - icu4c-support mailing list > To Un/Subscribe: > http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-support > |