From: Beni C. <cb...@te...> - 2003-08-06 16:59:17
|
However docutils doesn't interpret unicode characters as markup but in many cases it would make a lot of sense. Actually we've recently got one real unicode char handled as markup: a real em-dash in attributions. Here missing things I can think of: * Many unicode chars should be accepted for bullet lists (including BULLET (U+2022), of course). * Many unicode chars should be accepted for section adorements. Example: OVERLINE (U+203E). * When using OVERLINE below the section title, it would make sense to use underline above it in a double adorment style. Should we open the spec to different character for overline and underline? * Same characters should also be allowed in transitions. * Many punctuation characters should get same status for inline markup recognition as ASCII punctuation. This is tricky because the currently allowed puntuation was hand-picked, with end-of-sentence punctuation allowed only after end-strings. But consider e.g. Spanish, where questions and exclamations also have inverted question/exclamation marks at the *begining* of the sentence. * Should we allow superscript digits for footnote references? I think not, superscript digits are a hack... * Should we allow line drawing characters in tables? They certainly look neat if one has the nerve to draw them ;-). I've seen some editors that help with them but only on DOS (in IBM PC encoding). Obviously most if not all of the above cases are applicable to long lists of Unicode characters, maintaing which by hand in docutils would be a bad idea. If possible we should define the behavior in terms of Unicode character properties / block names. In cases where it's not possible, perhaps it's not worth it. So, is there some Unicode expert on this list that could spare us the research for choosing appropriate character properties for these roles? -- Beni Cherniavsky <cb...@tx...> |
From: David G. <go...@py...> - 2003-08-06 18:20:54
|
Beni Cherniavsky wrote: > However docutils doesn't interpret unicode characters as markup but > in many cases it would make a lot of sense. I agree. Patches welcome! > * When using OVERLINE below the section title, it would make sense > to use underline above it in a double adorment style. Should we > open the spec to different character for overline and underline? That would be a significant change. I'm not sure it'd be worthwhile. There are plenty of characters that can be used (especially if Unicode characters are added!) without complicating the spec or parser. > * Should we allow line drawing characters in tables? They certainly > look neat if one has the nerve to draw them ;-). I've seen some > editors that help with them but only on DOS (in IBM PC encoding). If the demand is there -- along with the patch contributions -- I don't see why not. Especially if reStructuredText ever takes off in Japan, such a feature would be very useful. I don't know how it is now, but when I was teaching English in Japan most teachers had a "wapuro" (word processor) for editing text, and these operated in a grid-editing fashion (like Emacs' Picture mode). Tables were built with line-drawing characters (via arrow keys). It helps that Japanese typically uses monospaced type. -- David Goodger http://starship.python.net/~goodger For hire: http://starship.python.net/~goodger/cv Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) |
From: Beni C. <cb...@te...> - 2003-08-07 07:26:02
|
David Goodger wrote on 2003-08-06: > Beni Cherniavsky wrote: > > However docutils doesn't interpret unicode characters as markup but > > in many cases it would make a lot of sense. > > I agree. Patches welcome! > OK. But I want to figure out what characters to add where first. What about my other points? > > * When using OVERLINE below the section title, it would make sense > > to use underline above it in a double adorment style. Should we > > open the spec to different character for overline and underline? > > That would be a significant change. I'm not sure it'd be worthwhile. I leave it to you to decide ;-). > There are plenty of characters that can be used (especially if Unicode > characters are added!) without complicating the spec or parser. > True. I'd say all non-alphabetic and non-control/whitespace should be allowed. > > * Should we allow line drawing characters in tables? They certainly > > look neat if one has the nerve to draw them ;-). I've seen some > > editors that help with them but only on DOS (in IBM PC encoding). > > If the demand is there -- along with the patch contributions -- I > don't see why not. Especially if reStructuredText ever takes off in > Japan, such a feature would be very useful. I don't know how it is > now, but when I was teaching English in Japan most teachers had a > "wapuro" (word processor) for editing text, and these operated in a > grid-editing fashion (like Emacs' Picture mode). Tables were built > with line-drawing characters (via arrow keys). It helps that Japanese > typically uses monospaced type. > This is the most complex of the proposed extensions to add. It will have to recognize all different line drawing chars at the proper places. No demand from me, and I won't do it ;-). -- Beni Cherniavsky <cb...@tx...> |
From: David G. <go...@py...> - 2003-08-07 13:16:04
|
[David Goodger] >> I agree. Patches welcome! [Beni Cherniavsky] > OK. But I want to figure out what characters to add where first. > What about my other points? I agree, in principle, with all your points, except for those I dealt with specifically. Details and implementation can wait for sufficient demand and motivation. No demand or motivation from me at present either. -- David Goodger |
From: Beni C. <cb...@te...> - 2003-08-07 13:53:56
|
Beni Cherniavsky wrote on 2003-08-06: I've read a bit on `Unicode character classes`__. I used `Zvon's character search`__ and good old grep over UnicodeData.txt a lot. __ http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values __ http://zvon.org/other/charSearch/PHP/search.php Here are my findings: > * Many unicode chars should be accepted for section adorements. > Example: OVERLINE (U+203E). > > * Same characters should also be allowed in transitions. > Anything with major class of `P` (punctuation) or `S` (symbol) should be allowed. These are the printable non-alphabetic categories. In the ASCII range this gives precisely the currently allowed set. > * Many punctuation characters should get same status for inline markup > recognition as ASCII punctuation. This is tricky because the > currently allowed puntuation was hand-picked, with end-of-sentence > punctuation allowed only after end-strings. But consider e.g. > Spanish, where questions and exclamations also have inverted > question/exclamation marks at the *begining* of the sentence. > This is mostly solved by the Punctuation major class. - `Pc` (connector minor class) should probably be excluded, it's for underscore and similar in spirit characters (however it contains the KATAKANA MIDDLE DOT which is said to function like a dot; Unicode has a special `Hyphen` property that includes all dashes + it, how unicodish). - `Pd` (dashes) and `Po` (other) should be allowed both before and after. `Po` is very big (see below) but I don't think it can cause problems here. - `Ps` and `Pe` are not wide: ``([{`` and ``)]}`` respectively. Do what we do now (also for ``<``, ``>``). - `Pi` and `Pf` include various quatation marks. Because they are all ambiguos, these classes are described by Unicode as "sometimes opening, sometimes closing". So allow them in both positions, as we do for ``'`` and ``"``. It seems that unicode doesn't make any separation for start/end-of-sentence punctuation, like question marks. I propose to retain the special-casing or ``.,;!?\`` and allow all characters of Punctuation/other class in both positions; the alternative would be hand-picking, which is a bad idea for so many characters. > * Many unicode chars should be accepted for bullet lists (including > BULLET (U+2022), of course). > This is tricky. I don't want too much false positives. For example, there are 13 characters with "BULLET" in their name. 5 have Punctuation/other class, 5 have Symbol/other class but 3 have Symbol/math class. Now, math symbols should probably be excluded because many can appear at the beginning of a paragraph that is a formula or just a sentence with math shorthands (think about the "exist" and "for any" qualifiers). OTOH, the Symbol/math category includes some things we do want for bullets: ASCII ``-`` and ``+``, many arrows and triangles (see about blocks_ below) and perhaps others. The problem is by what criterion to include the others. Character classes don't help here. There are 195 Puntuation/other characters (18 of them in ASCII!), of which by manual inspection, only these look appropriate: - 0040;COMMERCIAL AT (``@``, not sure about it) - 00B7;MIDDLE DOT - 2020;DAGGER - 2021;DOUBLE DAGGER - 2022;BULLET - 2023;TRIANGULAR BULLET - 2042;ASTERISM - 2043;HYPHEN BULLET - 204C;BLACK LEFTWARDS BULLET - 204D;BLACK RIGHTWARDS BULLET - 2051;TWO ASTERISKS ALIGNED VERTICALLY There are above 2069 Symbol/other characters. I'd be futile to hand-pick. However there is also a concept of _blocks - codepoint ranges, with much more fine-grained division of characters groups. I'd say the following blocks can be taken as a whole: - `Arrows 2190-21FF`__. __ http://www.unicode.org/charts/PDF/U2190.pdf - `Geometric Shapes 25A0-25FF`__. __ http://www.unicode.org/charts/PDF/U25A0.pdf - `Miscellanous symbols 2600-26FF`__. Some are borderline, e.g. chess pieces could be first things in lines of chess game recording; what saves us is that things only act as bullets if followed by a space. __http://www.unicode.org/charts/PDF/U2600.pdf - `Dingbats 2700-27BF`__ except perhaps for the numbers in white/black circles (2776-2793) (see below). __ http://www.unicode.org/charts/PDF/U25A0.pdf I don't see other significant groups that qualify. Phew ;-). ----- More points I missed: * Enumerated lists only allow ASCII characters now. This should be extended. The problem is that there are so many digits in Unicode. There are about 42 characters that express the number ONE! Luckily, Unicode defines a `Numeric_Type` and `Numeric_Value` properties for all such characters. Even more luckily, Python's `int()` correctly handles all decimal numbers! It does not handle characters that are "digits" but not "decimal": - 2460 CIRCLED DIGIT ONE - 2474 PARENTHESIZED DIGIT ONE - 2488 DIGIT ONE FULL STOP - 24f5 DOUBLE CIRCLED DIGIT ONE - 2776 DINGBAT NEGATIVE CIRCLED DIGIT ONE - 2780 DINGBAT CIRCLED SANS-SERIF DIGIT ONE - 278a DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE All these seem to be intended to be directly usable as list items without separators, should we allow them? Many of these go up to 20. It would be nifty to handle roman numeral characters (e.g. there is a single character for VII); I leave this to the author of the `roman` module ;-). Finally, to really i18n things, we should support letter-enumerated lists in other languages than english. This is tricky, because some language using the same alphabet assign different orders to it. I propose to allow each luanguage to define his order of enumeration, as part of the i18n modules, and allow only Latin and the document language's one. * Canonization of Reference Names. Do we handle unicode whitespace here? Should we pass the names through canonical (or compatibility) decomposition? Anything else to fix here? * Wide characters - do we handle those correctly w.r.t. tables? > * Should we allow superscript digits for footnote references? I think > not, superscript digits are a hack... > I'm dropping this idea. -- Beni Cherniavsky <cb...@tx...> |
From: David G. <go...@py...> - 2003-08-12 17:06:13
|
Beni Cherniavsky wrote: > I've read a bit on `Unicode character classes`__. Interesting results. Is there any way we can use Unicode character classes directly in the parser? I don't see any way to specify a character class in a regular expression, or get a set of characters belonging to a specific class cheaply. Perhaps the best way would be to write a helper program which analyzes UnicodeData.txt or the built-in Unicode database (via the unicodedata module), and produces data which can be copied into the parser source. It may be "by hand", but I can live with it. Again, I agree, in principle, with all your points. Exceptions are dealt with specifically. > - `Pc` (connector minor class) should probably be excluded, it's for > underscore and similar in spirit characters (however it contains > the KATAKANA MIDDLE DOT which is said to function like a dot; > Unicode has a special `Hyphen` property that includes all dashes + > it, how unicodish). When I write my name transliterated in Japanese (Katakana), I use the KATAKANA MIDDLE DOT (U+30FB) to separate my given name from my family name. >> * Many unicode chars should be accepted for bullet lists (including >> BULLET (U+2022), of course). ... > The problem is by what criterion to include the others. Probably subjective criteria only. > Character classes don't help here. There are 195 Puntuation/other > characters (18 of them in ASCII!), of which by manual inspection, > only these look appropriate: > > - 0040;COMMERCIAL AT (``@``, not sure about it) No; its meaning is too specific to be useful for a bullet. > - 00B7;MIDDLE DOT ... > - 2022;BULLET > - 2023;TRIANGULAR BULLET ... > - 2043;HYPHEN BULLET > - 204C;BLACK LEFTWARDS BULLET > - 204D;BLACK RIGHTWARDS BULLET Sure. > - 2020;DAGGER > - 2021;DOUBLE DAGGER Inappropriate, IMO. > - 2042;ASTERISM I looked up "asterism": "Three asterisks placed ... to direct attention to a particular passage." Inappropriate as a bullet, IMO. > - 2051;TWO ASTERISKS ALIGNED VERTICALLY Don't know about this one. Probably inappropriate. > There are above 2069 Symbol/other characters. Do you mean "more than 2069 characters" or "characters above code point U+2069"? > I'd be futile to hand-pick. But if we don't hand-pick, we're liable to include inappropriate symbols, as can be seen in the list above. I'd say either hand-pick, or leave out the entire class, since probably most of them have meanings beyond "bullet". > However there is also a concept of _blocks - codepoint > ranges, with much more fine-grained division of characters groups. > I'd say the following blocks can be taken as a whole: > > - `Arrows 2190-21FF`__. I wouldn't include these as bullets. The arrows in "Dingbats" may be appropriate, but these ones don't seem so. > - `Geometric Shapes 25A0-25FF`__. Many/most of these would be fine as bullets. Nothing obviously inappropriate. > - `Miscellanous symbols 2600-26FF`__. Some are > borderline, e.g. chess pieces could be first things in lines of > chess game recording; what saves us is that things only act as > bullets if followed by a space. Almost all of these symbols would be inappropriate IMO. Only the stars (U+2605, U+2606) seem like good bullets to me. > - `Dingbats 2700-27BF`__ except perhaps for the numbers in > white/black circles (2776-2793) (see below). I'd exclude a lot more than the numbers. Some look appropriate, some not. Many could be controversial (would you want to see the Star of David used as a list bullet?). > More points I missed: > > * Enumerated lists only allow ASCII characters now. This should be > extended. These are tricky. I can't see people using the graphical variations of Arabic numerals much. I could see extending the spec & parser for languages which use other number systems, like Japanese & Chinese, and many others; that would be an i18n issue. > All these seem to be intended to be directly usable as list items > without separators, should we allow them? Many of these go up to > 20. I'd rather not. Let's at least wait until someone files a bug report. :-) > It would be nifty to handle roman numeral characters (e.g. there is > a single character for VII); I leave this to the author of the > `roman` module ;-). Nifty, perhaps, but would anybody ever use it? Again, let's wait for demand. > Finally, to really i18n things, we should support letter-enumerated > lists in other languages than english. This is tricky, because some > language using the same alphabet assign different orders to it. I > propose to allow each luanguage to define his order of enumeration, > as part of the i18n modules, Seems reasonable. > and allow only Latin and the document language's one. Allow both simultaneously? Wouldn't this lead to the described ambiguities? I'd say that the English order should be the default, and other languages can override it; either replace it completely, or add to it. > * Canonization of Reference Names. Do we handle unicode whitespace > here? No, we don't. We should though. > Should we pass the names through canonical (or compatibility) > decomposition? What exactly does this mean? Can you provide examples? > Anything else to fix here? (Possibly part of the above.) Non-ASCII alphabetic characters (accented characters) are inadequately normalized. A reference name like in the target "_`Montreal, Quebec`" ought to be normalized to "montreal-quebec", not to "montr-al-qu-bec" as is done now. Further afield, names in non-alphabetic languages (like Japanese or Arabic) ought to be transliterated. Or, perhaps they ought to be left alone. The "unicodedata.decomposition" function and "isalnum" Unicode string method look like they may be useful here. > * Wide characters - do we handle those correctly w.r.t. tables? No, Docutils doesn't know anything about wide characters. See the second item under <http://docutils.sf.net/spec/notes.html#bugs>. -- David Goodger http://starship.python.net/~goodger For hire: http://starship.python.net/~goodger/cv Docutils: http://docutils.sourceforge.net/ (includes reStructuredText: http://docutils.sf.net/rst.html) |
From: Beni C. <cb...@te...> - 2003-08-12 18:36:48
Attachments:
composed.txt
|
David Goodger wrote on 2003-08-12: > Beni Cherniavsky wrote: > > I've read a bit on `Unicode character classes`__. > > Interesting results. Is there any way we can use Unicode character > classes directly in the parser? I don't see any way to specify a > character class in a regular expression, or get a set of characters > belonging to a specific class cheaply. Perhaps the best way would be > to write a helper program which analyzes UnicodeData.txt or the > built-in Unicode database (via the unicodedata module), and produces > data which can be copied into the parser source. It may be "by hand", > but I can live with it. > The classes thing helped in only some of the cases, the others need hardcoding anyway. Can we use codepoint ranges in the parser? I believe all the groups below can be compiled to rather short sets of ranges. > Again, I agree, in principle, with all your points. Exceptions are > dealt with specifically. > > > - `Pc` (connector minor class) should probably be excluded, it's for > > underscore and similar in spirit characters (however it contains > > the KATAKANA MIDDLE DOT which is said to function like a dot; > > Unicode has a special `Hyphen` property that includes all dashes + > > it, how unicodish). > > When I write my name transliterated in Japanese (Katakana), I use the > KATAKANA MIDDLE DOT (U+30FB) to separate my given name from my family > name. > I know zero Japanse, I just quoted what I saw ;-). So we include it. > >> * Many unicode chars should be accepted for bullet lists (including > >> BULLET (U+2022), of course). > ... > > The problem is by what criterion to include the others. > > Probably subjective criteria only. > > > Character classes don't help here. There are 195 Puntuation/other > > characters (18 of them in ASCII!), of which by manual inspection, > > only these look appropriate: > > > > - 0040;COMMERCIAL AT (``@``, not sure about it) > > No; its meaning is too specific to be useful for a bullet. > OK. It's not pretty anyway. > > - 00B7;MIDDLE DOT > ... > > - 2022;BULLET > > - 2023;TRIANGULAR BULLET > ... > > - 2043;HYPHEN BULLET > > - 204C;BLACK LEFTWARDS BULLET > > - 204D;BLACK RIGHTWARDS BULLET > > Sure. > > > - 2020;DAGGER > > - 2021;DOUBLE DAGGER > > Inappropriate, IMO. > > > - 2042;ASTERISM > > I looked up "asterism": "Three asterisks placed ... to direct > attention to a particular passage." Inappropriate as a bullet, IMO. > > > - 2051;TWO ASTERISKS ALIGNED VERTICALLY > > Don't know about this one. Probably inappropriate. > I was picking by the look. Going by meaning is a good idea as it allows less characters. > > There are above 2069 Symbol/other characters. > > Do you mean "more than 2069 characters" or "characters above code > point U+2069"? > More than 2069, don't ask me where I took the number from (I see 2496). > > I'd be futile to hand-pick. > > But if we don't hand-pick, we're liable to include inappropriate > symbols, as can be seen in the list above. I'd say either hand-pick, > or leave out the entire class, since probably most of them have > meanings beyond "bullet". > OK. Some chracters, notably dingbats have no useful meaning at all. > > However there is also a concept of _blocks - codepoint > > ranges, with much more fine-grained division of characters groups. > > I'd say the following blocks can be taken as a whole: > > > > - `Arrows 2190-21FF`__. > > I wouldn't include these as bullets. The arrows in "Dingbats" may be > appropriate, but these ones don't seem so. > Given the duplication with the arrows in dingbats, you have a point. These have a meaning of "real" arrows (what's that?). > > - `Geometric Shapes 25A0-25FF`__. > > Many/most of these would be fine as bullets. Nothing obviously > inappropriate. > > > - `Miscellanous symbols 2600-26FF`__. Some are > > borderline, e.g. chess pieces could be first things in lines of > > chess game recording; what saves us is that things only act as > > bullets if followed by a space. > > Almost all of these symbols would be inappropriate IMO. Only the > stars (U+2605, U+2606) seem like good bullets to me. > OK, only the stars go. > > - `Dingbats 2700-27BF`__ except perhaps for the numbers in > > white/black circles (2776-2793) (see below). > > I'd exclude a lot more than the numbers. Some look appropriate, some > not. Many could be controversial (would you want to see the Star of > David used as a list bullet?). > Yes, in a Zionistic presentation, why not ;-). I don't know. Looking at presentations people make, they use just about anything as bullets, certainly all imaginable dingbats. But why should such practices have place in reST? The only real benefit, except for empty claims of "industry-quality unicode support", would be the ability to select appropriate bullets in some writers. Since currently no writers pay any attention to the bullet kind in the source (does any?), it can be left until someone really wants it... Perhaps the best approach is to allow [-+*], the real BULLET characters, and that's it. All rest would be dumped in to the notes/alternatives files... > > More points I missed: > > > > * Enumerated lists only allow ASCII characters now. This should be > > extended. > > These are tricky. I can't see people using the graphical variations > of Arabic numerals much. I could see extending the spec & parser for > languages which use other number systems, like Japanese & Chinese, and > many others; that would be an i18n issue. > OK. > > All these seem to be intended to be directly usable as list items > > without separators, should we allow them? Many of these go up to > > 20. > > I'd rather not. Let's at least wait until someone files a bug > report. :-) > > > It would be nifty to handle roman numeral characters (e.g. there is > > a single character for VII); I leave this to the author of the > > `roman` module ;-). > > Nifty, perhaps, but would anybody ever use it? Again, let's wait for > demand. > > > Finally, to really i18n things, we should support letter-enumerated > > lists in other languages than english. This is tricky, because some > > language using the same alphabet assign different orders to it. I > > propose to allow each luanguage to define his order of enumeration, > > as part of the i18n modules, > > Seems reasonable. > > > and allow only Latin and the document language's one. > > Allow both simultaneously? Wouldn't this lead to the described > ambiguities? I'd say that the English order should be the default, > and other languages can override it; either replace it completely, or > add to it. > Latin-based languages should override; in other scripts there are no conflicts (e.g. sometimes latin enumarations are seen in Hebrew documents). Another reason: not allowing them to coexist would render existing documents in other languages backwards-incompatible if they contain any latin enumerations. In languages where there are conflicts, I guess people stayed away from latin enumerations so far. > > * Canonization of Reference Names. Do we handle unicode whitespace > > here? > > No, we don't. We should though. > > > Should we pass the names through canonical (or compatibility) > > decomposition? > > What exactly does this mean? Can you provide examples? > - 00F1;LATIN SMALL LETTER N WITH TILDE is cannonically equivallent to the two characters: - 006E;LATIN SMALL LETTER N - 0303;COMBINING TILDE Compatibility decomposition is a more aggressive non-reversible process, converting "compatibility charaters" (ones that are only there for round-trip compatibility with obscure encodings) to more sane characters (e.g. many variants on decimal digits would be converted into simple ASCII digits). I tend to think this is none of our business. Cannonical equivallence is nicer because it's really equivallent, in all known encodings. So if you write a reference target with one and the refernce with another, will it work? I'm attaching a micro-test file, that fails. [Just now discovered that quicktest.py doesn't check links, now I can't trust any test I previously did ;]. Moreover, I noticed that I can't write the n-tilde with a trailing underscore as a simple reference, I must use backticks - with both forms. This should be fixed: any sequence of letters and combining characters should be considered a single word. > > Anything else to fix here? > > (Possibly part of the above.) Non-ASCII alphabetic characters > (accented characters) are inadequately normalized. A reference name > like in the target "_`Montreal, Quebec`" ought to be normalized to > "montreal-quebec", not to "montr-al-qu-bec" as is done now. Yes, this is part of the need to redefine "word". Perhaps we should should take a look at Nameprep (RFC 3491, encodings.idna module in Python 2.3). > Further afield, names in non-alphabetic languages (like Japanese or > Arabic) ought to be transliterated. Or, perhaps they ought to be > left alone. > Transliterated seems a bad idea. Only a few languages (if any) have simple standard transliteration algorithms. It's better to know that Ierushalaim written in Hebrew can't ever match Jerusalem in English than to wonder whether it will match or not... > The "unicodedata.decomposition" function and "isalnum" Unicode > string method look like they may be useful here. > Yes. I see no reasons not to decompose the whole document on input. Currently what you write is what you get (checked with the attached file on html.py) but there is no legal reason for somebody to fine-control the output - it's really equivallent. At most, we might want to control the normalization form used on output (e.g. HTML should be NFC (Norm. Form C = precomposed) by W3C's recommendations). > > * Wide characters - do we handle those correctly w.r.t. tables? > > No, Docutils doesn't know anything about wide characters. See the > second item under <http://docutils.sf.net/spec/notes.html#bugs>. > I'll not touch these issues because I don't speak wide languages... -- Beni Cherniavsky <cb...@tx...> |
From: Mark N. <no...@so...> - 2003-08-12 18:48:58
|
David Goodger wrote: > > Finally, to really i18n things, we should support letter-enumerated > > lists in other languages than english. This is tricky, because some > > language using the same alphabet assign different orders to it. I > > propose to allow each luanguage to define his order of enumeration, > > as part of the i18n modules, > > Seems reasonable. Another complication is that some languages consider digraphs (two characters) as comprising a single letter. For example, the Welsh alphabet is A B C CH D DD E F FF G NG H I J L LL M N O P PH R RH S T TH U W Y where CH is considered to be a different letter from C, etc. (Notice that NG comes right after G, which can make looking words up in a Welsh dictionary tricky.) --Mark |