From: Mark D. <mar...@jt...> - 2004-08-13 21:38:22
|
Elisha, we have been trying to clarify the use of exemplar characters, which are for more than just collation; as a matter of fact, are not really co-extensive with the tailored collation characters, although they will generally overlap a good deal. In addition, we have just adopted a change to add a new element that allows for the addition of auxiliary exemplar characters. Please look over the working draft, especially the section at: http://oss.software.ibm.com/cvs/icu/~checkout~/locale/docs/tr35.html#<characters> The cases where a script is used are in some cases simply data bugs; they should be replaced by a more explicit list of just what is needed for a given language. I think Indic is probably ok, but Hebrew etc should be fixed; but someone needs to propose the exact list. Mark ----- Original Message ----- From: "Mark Davis" <mar...@jt...> To: <cl...@un...> Sent: Thursday, August 12, 2004 17:47 Subject: Fw: [locale-bugs] incoming/200 > This bug is based on a misunderstanding of the exemplar characters, which > are for more than collation. (see latest LDML draft). The cases where a > script is used are in some cases simply data bugs; they should be replaced > by a more explicit list of just what is needed for a given language. > > Mark > > ----- Original Message ----- > From: <loc...@jt...> > To: <cld...@un...> > Sent: Thursday, August 12, 2004 17:31 > Subject: [locale-bugs] incoming/200 > > > > new message incoming/200 > > URL: http://www.jtcsv.com/cgibin/locale-bugs?findid=200 > > > > ====> ORIGINAL MESSAGE FOLLOWS <==== > > > > From: e....@co... > > Date: Thu Aug 12 20:31:21 2004 > > Subject: Exemplar Sets > > > > Full_Name: Elisha Berns > > Version: 1.1 > > Submission from: (NULL) (64.164.82.122) > > > > > > FEATURE REQUEST: > > > > Background: > > > > The Exemplar Sets may be the correct format for determining collation > rules for > > a locale's language, but they are not well formed nor well conceptualized > to > > determine font coverage for the locale's language. If an exemplar set is > used > > to generate the set of code points needed for standard, common text layout > for a > > language the resulting set typically is either too large or too small to > be > > accurate. > > > > Some exemplar sets are formed using the locale's language *script* name > which > > includes many more code points than are needed for standard writing in > that > > language. Other exemplar sets contain only the code points for the lower > case > > letters and collation sequences used in that language. If you generate > upper > > case variants for these code points you can get many code points never > used by > > the language. > > > > If one attempts to modify the exemplar set to include only commonly used > > characters often the modifications become complicated, unweildy and may > never > > work correctly. For example, the exemplar set for Hebrew (he), uses the > > complete script name [:Hebr:]. To eliminate unnecessary code points from > this > > set you can *attempt* to modify this set by the following set operations: > > [[[:dt=none:][:dt=canonical:]]&[:hebr:]] or perhaps this: > > [[[:dt=none:][:dt=canonical:]]&[:hebr:]&[:letter:]]. However, this is > only one > > of many examples where such gyrations are needed to limit the set > membership to > > commonly needed code points. To make matters worse, effectively applying > > character properties to modify these sets depends too much on having > intimate > > knowledge of these languages. > > > > Solution: > > > > It would far simpler and much more accurate to create a new type of > exemplar > > set, the Standard Writing Exemplar Set, whose express, stated purpose is > to > > provide the set of code points needed for standard, common writing (text > layout) > > in each locale's language. This, by design, includes lower and upper case > > characters and standard punctuation. > > > > Upper case characters are needed if one uses upper case letters when > commonly > > writing in the language. Punctuation is needed if punctuation characters > are > > used when commonly writing in the language. For example, the Standard > Writing > > Exemplar Set for English is [a-zA-Z.,;:!?()'"]. This type of proposed set > would > > directly supply the data for a test of font coverage for the locale's > language. > > If other types of code points are commonly used for mandatory ligatures or > > presentation forms they should be considered also. The idea is to > explicitly > > include those code points needed for common writing in the locale's > language and > > not leave set membership dependent upon set operations or some other type > of > > implicit mappings. > > > > Summary: > > > > To create a Standard Writing Exemplar Set which is differentiated from the > > current Exemplar Set both in its explicit purpose and in its actual set > > membership. The purpose is to provide the explicit data for performing > font > > coverage tests for locales. The membership rule to include/exclude code > points > > in the set is the test whether a code point is commonly needed for common > > writing in the language. Writing includes spelling, syntax and > punctutation. > > > > Elisha Berns > > 8/12/04 > > > > > > Elisha Berns 8/12/04 > > > > > > To remove yourself from this mail list, send an e-mail to > > ec...@un... and write "unsubscribe cldr-bugrfe" in the > > subject line. > > > > > > > |