Re: Question About Constructing Pattern Strings From API Results

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

A few items.

First, a UnicodeSet really is a set of Unicode strings, not just code
points. However, its implementation is designed to be particularly compact
and efficient in the storage and retrieval of individual code points. That
is the reason for the way that the iterator works.

I'll use the Java API for examples.

The simplest way to iterate is:

for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet); i.next(); )
{
    String s = i.getString();
    // do something with s
}

However, that forces the creation of a string object, so the more efficient
way is:

for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet); i.next(); )
{
    if (i.codepoint == i.IS_STRING) {
        // do something with i.string
    } else {
        // do something with i.codepoint
    }
}

or, if the calling code can deal efficiently with ranges of code points:

for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet);
i.nextRange(); ) {
    if (i.codepoint == i.IS_STRING) {
        // do something with i.string
    } else {
        // do something with all the code points from i.codepoint to
i.codepointEnd
    }
}

The only point you really have to be careful of is that
unicodeSet.complement() is documented as *not* equivalent to a set
complement. A set complement would include every string that is not in the
set, clearly a memory hog ;-). Instead, unicodeSet.complement() is defined
to be the equivalent of subtracting unicodeSet from the set composed of
U+0000..U+10FFFF, e.g.

unicodeSet = new UnicodeSet(0,0x10FFFF).removeAll(unicodeSet);

Second, the exemplar set lists the characters or sequences of characters
that are required for use with the language, plus those sequences typically
viewed as being separate characters in the locale. Typically if a sequence
is included, then it is a contraction in collation, but the reverse may not
be true. For example, アー may be treated as a contraction in Japanese
collation, but not as an exemplar character. Being an exemplar character
does not at all require that the sequences be handled as a ligature in
display. For example, 'ch' is an exemplar character for Slovak, but the
rendering of 'ch' doesn't differ from just a 'c' followed by an 'h'.

For determining whether a font contains the glyphs necessary for a given
UnicodeSet, it is sufficient to determine that it can handle all of the
individual code points listed, and can handle the sequences either as a
whole or as individual code points. It is pretty unlikely that a font would
handle a sequence and not be able to handle the individual code points, so
if I were testing, I would use the following:

    boolean fontHandlesCharacters(Font f, UnicodeSet unicodeSet) {
        for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet);
i.next(); ) {
            if (i.codepoint == i.IS_STRING) {
                if (i.codepoint > 0xFFFF) return false; // JDK can't do
supplementaries yet
                if (!f.canDisplay((char)i.codepoint)) return false;
            } else {
                int cp;
                for (int j = 0; j < i.string.length(); j +=
UTF16.getCharCount(cp)) {
                    cp = UTF16.charAt(i.string, j);
                    if (cp > 0xFFFF) return false; // JDK can't do
supplementaries yet
                    if (!f.canDisplay((char)cp)) return false;
                }
            }
        }
        return true;
    }

// disclaimer, I haven't compiled or tested any of these examples!

Another way to do this is to "flatten" the UnicodeSet (actually, this might
be a useful utility for us to add).

    UnicodeSet flatten(UnicodeSet unicodeSet) {
        UnicodeSet result = new UnicodeSet();
        for (UnicodeSetIterator i = new UnicodeSetIterator(unicodeSet);
i.nextRange(); ) {
            if (i.codepoint == i.IS_STRING) result.addAll(i.string); // adds
each code point
            else result.add(i.codepoint, i.codepointEnd);
        }
        return result;
    }

And then just use a simple loop:

    boolean fontHandlesCharacters2(Font f, UnicodeSet unicodeSet) {
        for (UnicodeSetIterator i = new
UnicodeSetIterator(flatten(unicodeSet)); i.next(); ) {
            if (i.codepoint > 0xFFFF) return false; // JDK can't do
supplementaries yet
            if (!f.canDisplay((char)i.codepoint)) return false;
        }
        return true;
    }

‎Mark

----- Original Message ----- 
From: "Deborah Goldsmith" <gol...@ap...>
To: "George Rhoten" <gr...@us...>
Cc: "'ICU Support'" <icu...@os...>
Sent: Friday, July 30, 2004 14:11
Subject: Re: Question About Constructing Pattern Strings From API Results

> Can you give an example of a process that would make use of
> multi-character strings from an exemplar set?
>
> Deborah
>
> On Jul 30, 2004, at 1:59 PM, George Rhoten wrote:
>
> > For your purposes, these grapheme clusters or contractions aren't very
> > useful for you.  For other things, like collation or anything that
> > deals
> > with alphabets, they are very important.  Unless any of these strings
> > contain combining characters, they should not get any special treatment
> > from a font.  For example, don't turn the AE grapheme cluster
> > (\u0041\u0045) into the AE ligature (\u00C6).
> >
> > Here is another example, in traditional Spanish, the letters ch and ll
> > are
> > each considered a single character (grapheme cluster), which are
> > different
> > from c, h and l.  These multi-codepoint characters can get title cased
> > or
> > collated differently.  Modern Spanish no longer uses these grapheme
> > clusters any more, at least that is what my old and new Spanish
> > dictionaries tell me.  Both of my Spanish dictionaries sort the words
> > differently because of this difference.
> >
> > The LDML specification also briefly goes over this topic too:
> > http://www.unicode.org/reports/tr35/
> >
> > George Rhoten
> > IBM Globalization Center of Competency/ICU  San José, CA, USA
> > ICU main website: http://oss.software.ibm.com/icu/index.html
> >
> >
> >
> > "Elisha Berns" <e....@co...>
> > Sent by: icu...@ww...
> > 07/30/2004 12:02 PM
> > Please respond to
> > e.berns
> >
> >
> > To
> > <an...@jt...>
> > cc
> > "'ICU Support'" <icu...@ww...>
> > Subject
> > RE: FW: Question About Constructing Pattern Strings From API Results
> >
> >
> >
> >
> >
> >
> > Thanks for the reply Andy,
> >
> > I'm starting to feel really stupid asking so many questions about this
> > thing, please forgive me; I really am trying to wind this up!
> >
> > You wrote:
> >
> >> I need to look into this.  I thought that scripts just populated a set
> >> with the code points with the matching script property, no strings.
> >
> > I think you are correct about this when the exemplar set pattern string
> > is a script name; however some of the exemplar set pattern strings do
> > contain multicharacter strings.  For example, Hungarian:
> >
> > [a-z\u00E1\u00E9\u00ED\u00F3\u00F6\u00FA\u00FC\u0151\u0171
> > {ccs}{cs}{ddz}{ddzs}{dz}{dzs}{ggy}{gy}{lly}{ly}{nny}{ny}{ssz}
> > {sz}{tty}{ty}{zs}{zzs}]
> >
> > So all those groups of characters enclosed in curly braces, what is
> > their meaning since they were contained in the range [a-z] at the
> > beginning of the pattern string?  Do they get normalized to some kind
> > of
> > diacritical/letter combination?  Is this their normalized
> > representation?
> >
> > My question is how do you transform (??) what is inside the curly
> > braces
> > to one or more code points that can be displayed by a font?  Or do I
> > just have a major misunderstanding about this:  when any one of these
> > combinations of code points, the "multicharacter string" is fed to a
> > TrueType/OpenType layout engine, the layout engine will convert this
> > string to a special glyph?  And the only test that is *required* is for
> > unique code points, not all these duplicates?
> >
> > Thanks,
> >
> > Elisha
> >
> >
> >
> > _______________________________________________
> > icu...@os... - icu4c-support mailing list
> > To Un/Subscribe:
> > http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-
> > support
> >
> >
> >
> > _______________________________________________
> > icu...@os... - icu4c-support mailing list
> > To Un/Subscribe:
> > http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-
> > support
>
> _______________________________________________
> icu...@os... - icu4c-support mailing list
> To Un/Subscribe:
>
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-support
>

Re: Question About Constructing Pattern Strings From API Results

Open Source C/C++/Java libraries from Unicode

Re: Question About Constructing Pattern Strings From API Results