From: Peter S. <pe...@ja...> - 2003-02-21 07:18:10
|
Okay. I think this is the last pre-processor. It looks like a tool used when cross-compiling to OS2 to convert from latin-11 to whatever character set OS2 uses. Other than ascii 15, 20, and 21 (decimal) being illegal the target charset seems to be the same as latin1 for 7-bit ascii. It further seems that the only places (or almost the only) places that high ascii characters are used in the CLISP source is in the German comments. Here's a breakdown of the characters that would be converted by cv_lt_pc that actually occur in the .d files by count, decimal character value, and description. These characters have legal translations according to cv_lt_pc: count char desc 1201 188 VULGAR FRACTION ONE QUARTER 781 164 CURRENCY SIGN 334 182 PILCROW SIGN 19 191 INVERTED QUESTION MARK 2 197 LATIN CAPITAL LETTER A WITH RING ABOVE 2 187 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 2 168 DIAERESIS 2 181 MICRO SIGN 1 184 CEDILLA 1 171 LEFT-POINTING DOUBLE ANGLE QUOTATION MARK 1 160 NO-BREAK SPACE While these don't and, as far as I can tell, would cause the build to fail because cv_lt_pc would exit with a non-zero value: count char desc 375 195 LATIN CAPITAL LETTER A WITH TILDE 6 169 COPYRIGHT SIGN 1 148 not defined in latin-1 1 143 not defined in latin-1 1 149 not defined in latin-1 So, I'm wondering if maybe it'd be okay, while we're waiting for the German comments to be translated to at least transliterate them into 7-bit ascii. This would allow me to get rid of the infrastructure that sometimes runs the .d files through cv_lt_pc. If that's acceptable, all I need is for some German speaker to tell me how to transliterate these characters. (I'm guessing that in some cases two or more characters get transliterated to a single character. For instance the "CURRENCY SIGN" always seems to be preceded by "LATIN CAPITAL LETTER A WITH TILDE"--I assume those two chars together transliterate into a single German character (which we could then rerepresent as the nearest 7-bit ascii.) There's another program in the os2 directory, cv_pc_lt.c. I don't see anywhere it's used. Am I missing it? -Peter -- Peter Seibel pe...@ja... |
From: Peter S. <pe...@ja...> - 2003-02-23 21:41:07
|
Peter Seibel <pe...@ja...> writes: > So, I'm wondering if maybe it'd be okay, while we're waiting for the > German comments to be translated to at least transliterate them into > 7-bit ascii. No one has replied on this question? Sam? Bruno? > If that's acceptable, all I need is for some German speaker to tell > me how to transliterate these characters. (I'm guessing that in some > cases two or more characters get transliterated to a single > character. For instance the "CURRENCY SIGN" always seems to be > preceded by "LATIN CAPITAL LETTER A WITH TILDE"--I assume those two > chars together transliterate into a single German character (which > we could then rerepresent as the nearest 7-bit ascii.) And I still need some help with this. (Sorry to be so needy. I really want to get through this conversion so I can get into actually looking at what all this source code *does*.) -Peter -- Peter Seibel pe...@ja... |
From: Dan K. <da...@ch...> - 2003-02-23 23:42:19
|
> > So, I'm wondering if maybe it'd be okay, while we're waiting for the > > German comments to be translated to at least transliterate them into > > 7-bit ascii. If Sam and Bruno haven't replied, they aren't paying attention or have no opinion. So though I am quite junior here, allow me to express one! I definitely believe that would be all right, a very worthy goal. Go ahead and do it, submit it as a finished patch. That'll force people to speak up if they have serious objections. As for how to transliterate, I'm not sure. You say that the "currency sign" is always followed by the "latin capital letter a with tilde", which makes me strongly suspect that you're not viewing the file properly: why would a punctuation mark be used as part of a ligature? I did some googling for you and didn't come up with anything informative enough to use. I know that there are ISO standards for transliteration (which is a joke - to read fluidly, a transliteration must take into account the history of the word, ie which languages it has passed through "recently", rather than just its most recent form). Using those, however awkward the resulting text might appear to a bilingual reader, would certainly be indisputably correct and would not lose any information. Now to see whether by any chance they are freely available... ... I've now checked the ISO catalog and found that there is no standard for transliteration German to English because it is already written with Latin characters, albiet they have lots of accents over them. Here's something, proving that it pays to go past the first page of search results: http://www.ex.ac.uk/german/dict/output-start.html This shows how umlauts are equivalent to ligatures with E; that is, for any vowel with an umlaut (two dots) over it, you may replace it with that vowel followed by the letter e. It also shows that a weird character which looks like a capital Beta can be transliterated as "ss". A little research tells me that that character is #\LATIN_SMALL_LETTER_SHARP_S, oddly enough. So far, all these characters are in ISO-Latin-1, so you should in all likelihood be using that encoding to view German text. Ironically enough, my X installation is all set up to view Japanese text, and whatever I do I cannot coax it into accurately displaying ISO-Latin-1 characters (except for Opera which uses its own system). So I may not be able to provide much advice, here... ... Now that I look at it, #\currency_sign is in Latin-1 also. It sure does look bizzare. Maybe it so happens that the word it begins is always a written-out number? You should probably replace it with whatever the name of the German currency is, or even with the word "currency". Also have a look at http://www.tm.informatik.uni-frankfurt.de/~razi/steak/steak.html in case it's any use. You may have to pass the homepage through babelfish. | Dan Knapp, Knight of the Random Seed | http://brain.mics.net/~dankna/ | ONES WHO DOES NOT HAVE TRIFORCE CAN'T GO IN. |
From: Peter S. <pe...@ja...> - 2003-02-24 00:58:56
|
Dan Knapp <da...@ch...> writes: > > > So, I'm wondering if maybe it'd be okay, while we're waiting for th= e > > > German comments to be translated to at least transliterate them int= o > > > 7-bit ascii. >=20 > If Sam and Bruno haven't replied, they aren't paying attention or hav= e > no opinion. So though I am quite junior here, allow me to express one! >=20 > I definitely believe that would be all right, a very worthy goal. Go > ahead and do it, submit it as a finished patch. That'll force people > to speak up if they have serious objections. >=20 > As for how to transliterate, I'm not sure. You say that the > "currency sign" is always followed by the "latin capital letter a with > tilde", which makes me strongly suspect that you're not viewing the > file properly: why would a punctuation mark be used as part of a > ligature? No idea. FWIW, I analyzed the files with a perl script that matched all the special characters (i.e. those characters that are translated specially by cv_lt_pc); Emacs shows me glyphs that seem to match the names (i.e. latin capital letter a with tilde looks like an A with a ~). Now I'm totally confused, I found a page with a chart of the DOSLatinUS codepage (CP437)[1] and all cv_lt_pc seems to be doing is translating some of the non-7-bit ascii chars to their corresponding elements in that code page, if they exist. So in both latin-1 and the PC encoding, the comments are going to have weird characters like the "currency sign" and "vulgar fraction one quarter". Perhaps these characters are junk that got inserted somewhere along the way as the file was edited in on some system that used some other encoding. Because it seems that all the special German characters (i.e. the vowels with umlauts and the "latin small letter sharp s" can be naturally encoded in latin-1 (but not 7-bit ascii) but those aren't the characters that are occuring the in CLISP source. > This shows how umlauts are equivalent to ligatures with E; that is, > for any vowel with an umlaut (two dots) over it, you may replace it > with that vowel followed by the letter e. It also shows that a weird > character which looks like a capital Beta can be transliterated as > "ss". So it seems that maybe we have a two-step problem. Figure out how to translate these weirdo characters into the proper latin-1 encoding of the desired German characters. (After this phase the code should be only 7-bit ascii plus the latin characters: 304 196 C4 =C4 LATIN CAPITAL LETTER A WITH DIAERESIS 313 203 CB =CB LATIN CAPITAL LETTER E WITH DIAERESIS 317 207 CF =CF LATIN CAPITAL LETTER I WITH DIAERESIS 326 214 D6 =D6 LATIN CAPITAL LETTER O WITH DIAERESIS 334 220 DC =DC LATIN CAPITAL LETTER U WITH DIAERESIS 344 228 E4 =E4 LATIN SMALL LETTER A WITH DIAERESIS 353 235 EB =EB LATIN SMALL LETTER E WITH DIAERESIS 357 239 EF =EF LATIN SMALL LETTER I WITH DIAERESIS 366 246 F6 =F6 LATIN SMALL LETTER O WITH DIAERESIS 374 252 FC =FC LATIN SMALL LETTER U WITH DIAERESIS 337 223 DF =DF LATIN SMALL LETTER SHARP S Then we can transliterate those into 7-bit ascii. (According to Chicago's Manual of Style (14th ed) we can use the A-with-umlaut -> AE scheme for upper-case vowels, and ss for the capital Beta. But you're not supposed to translate lower-case a-with-umlaut to ae. They don't say what you *are* supposed to do.) Or--quite likely--I'm still totally confused about something; I certainly have only a tenuous grasp of all this codepage stuff. -Peter [1] <http://czyborra.com/charsets/codepages.html> --=20 Peter Seibel pe...@ja... |
From: Henrik M. <hen...@we...> - 2003-02-24 01:23:33
|
Dan Knapp <da...@ch...> writes: > This shows how umlauts are equivalent to ligatures with E; that > is, for any vowel with an umlaut (two dots) over it, you may replace > it with that vowel followed by the letter e. It also shows that a > weird character which looks like a capital Beta can be transliterated > as "ss". A little research tells me that that character is > #\LATIN_SMALL_LETTER_SHARP_S, oddly enough. So far, all these > characters are in ISO-Latin-1, so you should in all likelihood be > using that encoding to view German text. This is correct, basically the non-ascii characters in german are =E4, =C4 =3D> ae, Ae =F6, =D6 =3D> oe, Oe =FC, =DC =3D> ue, Ue =DF =3D> ss (lowercase only) (I hope you can read this) The sharp s, btw, originally was a ligature of an `s' and a `z'. Of course, it doesn't look anything like that. That is because in older german typefaces, the `s' in some cases looked like an `f' without the horizontal bar, and the `z' like a mixture of a `3' and a `g'. That is also the reason why there is no capital sz. There only is one word that actually starts with sz anyway, AFAIK, that is Szene (scene). > ... Now that I look at it, #\currency_sign is in Latin-1 also. > It sure does look bizzare. Maybe it so happens that the word it > begins is always a written-out number? You should probably replace > it with whatever the name of the German currency is, or even with > the word "currency". Neither is the international currency sign likely to occur in comments, german or not, nor are tildes over letters used in german, so this has to be an artifact of a wrong encoding. If it helps, here are some transliterated comments, one for each Umlaut I found (the \303 stuff is what Emacs displays here, it may be useful for automating this): affi.d, line 2: # Joerg Hoehle 11.8.1996 (=F6, \303\206 =3D> oe) amiga.d, line 135 # Oeffnet eine Datei. (=F6, \303\226 =3D> Oe) amiga.d, line 1 # Include-File fuer AMIGA-Version von CLISP (=FC, \303\274 =3D> ue) arisparc.d, line 5 # Parameter-Uebergabe: in Registern %o0-%o5. (=DC, \303\234 =3D> Ue) amiga.d, line 129 # Setzt die naechste Fehlernummer. (=E4, \303\244 =3D> ae) amiga.d, line 144 # Schliesst eine Datei. (=DF, \303\237 =3D> ss) I didn't find an occurence of capital A Umlaut (=C4) right now, but scince it's the only one missing, it should be easy to identify. hth Henrik |
From: Peter S. <pe...@ja...> - 2003-02-24 05:25:53
|
Henrik Motakef <hen...@we...> writes: > Dan Knapp <da...@ch...> writes: >=20 > > This shows how umlauts are equivalent to ligatures with E; that > > is, for any vowel with an umlaut (two dots) over it, you may > > replace it with that vowel followed by the letter e. It also shows > > that a weird character which looks like a capital Beta can be > > transliterated as "ss". A little research tells me that that > > character is #\LATIN_SMALL_LETTER_SHARP_S, oddly enough. So far, > > all these characters are in ISO-Latin-1, so you should in all > > likelihood be using that encoding to view German text. >=20 > This is correct, basically the non-ascii characters in german are >=20 > =E4, =C4 =3D> ae, Ae > =F6, =D6 =3D> oe, Oe > =FC, =DC =3D> ue, Ue > =DF =3D> ss (lowercase only) >=20 > (I hope you can read this) >=20 > The sharp s, btw, originally was a ligature of an `s' and a `z'. Of > course, it doesn't look anything like that. That is because in older > german typefaces, the `s' in some cases looked like an `f' without > the horizontal bar, and the `z' like a mixture of a `3' and a `g'. > That is also the reason why there is no capital sz. There only is > one word that actually starts with sz anyway, AFAIK, that is Szene > (scene). >=20 > > ... Now that I look at it, #\currency_sign is in Latin-1 also. > > It sure does look bizzare. Maybe it so happens that the word it > > begins is always a written-out number? You should probably replace > > it with whatever the name of the German currency is, or even with > > the word "currency". >=20 > Neither is the international currency sign likely to occur in > comments, german or not, nor are tildes over letters used in german, > so this has to be an artifact of a wrong encoding. Interesting. As far as I can tell--given that Unix just treats files as arrays of octets--what's actually in the source is two bytes for each special german character. For instance, below, where you have: (=E4, \303\244 =3D> ae), that \303 is octal for the latin-1 character "LATIN CAPITAL LETTER A WITH TILDE" and \244 is the latin-1 character "CURRENCY SIGN". Ah, ha! Duh. I'm a dope. Okay, so the source code is UTF-8 encoded unicode. And, IIRC, unicode 00-ff is the same as latin-1. So the character "latin small o with diaeresis" (a.k.a o-with-an-umlaut) is F6 (hex) in unicode. But in UTF-8 that gets turned into: F6 =3D> 11110110 =3D> 11000011 10110110 =3D> C3 B6 =3D> \303 \266 Which is what we see in the files. It all makes sense now. So I'll decode the unicode back into latin-1 and then transliterate all the vowels with umlauts to vowel+e. And the sz ligature to 'ss'. There are also some (!) French comments in amigaaux.d. I assume it won't do too much damage to translate "Je pr=E9f=E8re" (those e's have accents over them, in case your mail reader doesn't render latin-1) to "Je prefere" (no accents). I'm sure the Academy would disapprove but what are you going to do? -Peter --=20 Peter Seibel pe...@ja... |
From: Peter S. <pe...@ja...> - 2003-02-24 06:29:13
|
Peter Seibel <pe...@ja...> writes: > So I'll decode the unicode back into latin-1 and then transliterate > all the vowels with umlauts to vowel+e. And the sz ligature to 'ss'. > There are also some (!) French comments in amigaaux.d. I assume it > won't do too much damage to translate "Je pr=E9f=E8re" (those e's have > accents over them, in case your mail reader doesn't render latin-1) > to "Je prefere" (no accents). I'm sure the Academy would disapprove > but what are you going to do? Okay, so with my new understanding of things, I reanalyzed the source code. After I deleted some dead code from charstrg.d that happened to have comments in it with characters that can't be converted to latin-1 we're left with only thirteen non-7-bit-ASCII characters in the .d files. They are shown below along with their HEX value (in case they don't show up properly for some folks). To the right of the '->' in the table below is the string I propose to replace each character with. If there's anyone who strongly objects to this plan, please speak up now. =FC (FC) (1078 instances) -> ue =E4 (E4) (766 instances) -> ae =F6 (F6) (348 instances) -> oe =DC (DC) (224 instances) -> Ue =DF (DF) (92 instances) -> ss =BF (BF) (19 instances) -> ? =E9 (E9) (5 instances) -> e =B5 (B5) (2 instances) -> micro =D6 (D6) (2 instances) -> Oe =E8 (E8) (2 instances) -> e =AB (AB) (1 instance) -> " =BB (BB) (1 instance) -> " =FB (FB) (1 instance) -> u -Peter --=20 Peter Seibel pe...@ja... |
From: Bruno H. <ha...@il...> - 2003-02-24 21:55:52
|
Peter Seibel writes: > It looks like a tool used when cross-compiling to OS2 to convert > from latin-11 to whatever character set OS2 uses. Right. And noone has cared about the OS/2 support for ca. 4 years. You can just remove it (but please in a separate cvs commit as the .d to .c conversion). The source code is in UTF-8 now, not in ISO-8859-1. If you use emacs or xemacs-21.5 (or vim 6 :-)) you've no problem with it. > This would allow me to get rid of the infrastructure that > sometimes runs the .d files through cv_lt_pc. No developer these days uses the PC character set (CP437). So you can just remove this preprocessor and accept the fact that OS/2 user will see mojibake on their screen. > There's another program in the os2 directory, cv_pc_lt.c. I don't see > anywhere it's used. Am I missing it? It's the converse. I used it to port back changes I had made in DOS mode to the official sources. Move it to /dev/null. Bruno |
From: Peter S. <pe...@ja...> - 2003-02-25 03:03:23
|
Bruno Haible <ha...@il...> writes: > Peter Seibel writes: > > It looks like a tool used when cross-compiling to OS2 to convert > > from latin-11 to whatever character set OS2 uses. > > Right. And noone has cared about the OS/2 support for ca. 4 years. > You can just remove it (but please in a separate cvs commit as the .d > to .c conversion). Do you mean, delete that whole os2/ dir and associated foo from makemake.in? > The source code is in UTF-8 now, not in ISO-8859-1. If you use emacs > or xemacs-21.5 (or vim 6 :-)) you've no problem with it. Yeah, I finally figured that out. Duh. So I've got the conversion mechanism in place in utils/d2c to convert to ISO-8859-1 and from there to 7-bit ascii (via transliterating). I can take that out of the conversion process if it's considered a good thing to leave the source in UTF-8. FWIW, xemacs 21.1--which until a recently was the stable xemacs release--shows the German comments as mojibake (nice word ;-)). Dunno about 21.4. Since we can now just as easily have it either way (actually, any of three ways) would we rather have the source in: a) UTF-8 b) ISO-8859-1 c) 7-bit ascii. -Peter -- Peter Seibel pe...@ja... |
From: Dan K. <da...@ch...> - 2003-02-25 06:42:52
|
I definitely vote for 7-bit ascii. At worst an editor encountering an encoding it doesn't know will display garbage characters, but that's distracting and could be confusing. | Dan Knapp, Knight of the Random Seed | http://brain.mics.net/~dankna/ | ONES WHO DOES NOT HAVE TRIFORCE CAN'T GO IN. |
From: Bruno H. <ha...@il...> - 2003-02-25 13:07:32
|
Peter Seibel writes: > Do you mean, delete that whole os2/ dir and associated foo from > makemake.in? Yes. And ask J=F6rg for the date when we can do the same excision for the AmigaOS support. > So I've got the conversion > mechanism in place in utils/d2c to convert to ISO-8859-1 and from > there to 7-bit ascii (via transliterating). I can take that out of th= e > conversion process if it's considered a good thing to leave the sourc= e > in UTF-8. Yes please leave the source in UTF-8. It'd would be a shame if you were starting to misspell J=F6rg's name on each occurrence, after the months of work that I and the vim and xemacs people have put into proper support of UTF-8. Also, maybe sometime Sam will want to put a Russian joke into the sources :-) > FWIW, xemacs 21.1--which until a recently was the stable > xemacs release--shows the German comments as mojibake (nice word ;-))= . xemacs 21.1 and 21.4 need the Mule-UCS add-on for the same capability, like Emacs 20. > Since we can now just as easily have it either way (actually, any of > three ways) would we rather have the source in: >=20 > a) UTF-8 > b) ISO-8859-1 > c) 7-bit ascii. a) !!! Bruno |
From: Bruno H. <ha...@il...> - 2003-02-25 13:14:31
|
Dan Knapp writes: > I definitely vote for 7-bit ascii. At worst an editor encountering > an encoding it doesn't know will display garbage characters, but > that's distracting and could be confusing. That's a poor attitude. Following this recommendation we would still be in a 7-bit characters world of 1983. We are switching to C99 and having sources in UTF-8 because we _like_ these and want them to become more standard. Yes be prepared to get some editor to disply garbage for them - but then go out and report a bug for this editor so it gets fixed. Testing the limits of technology means stretching the limits of technology. Bruno |
From: Randall R S. <rrs...@cr...> - 2003-02-25 15:58:37
|
<html> <body> Bruno,<br><br> At 05:13 2003-02-25, Bruno Haible wrote:<br> <blockquote type=cite class=cite cite>Dan Knapp writes:<br> > I definitely vote for 7-bit ascii. At worst an editor encountering<br> > an encoding it doesn't know will display garbage characters, but<br> > that's distracting and could be confusing.<br><br> That's a poor attitude. Following this recommendation we would still be in a 7-bit characters world of 1983.</blockquote><br> <font size=5><b>Absolutely!</b></font> Don't get <tt><font face="Lucida Console" size=4>stuck</font></tt> in the <font color="#808080">past</font>.<br><br> <br> <blockquote type=cite class=cite cite>We are switching to C99 and having sources in UTF-8 because we _like_ these and want them to become more standard. Yes be prepared to get some editor to disply garbage for them - but then go out and report a bug for this editor so it gets fixed.<br><br> Testing the limits of technology means stretching the limits of technology.</blockquote><br> <font size=4><b><u>I</u></b> </font><tt>agree</tt><font size=4> </font><font size=4 color="#000080"><i>whole</font><font size=4 color="#FF0000">heart</font><font size=4 color="#000080">edly</i></font><font size=4>.<br><br> <br> </font><blockquote type=cite class=cite cite>Bruno</blockquote><br><br> Randall "ransom note" Schulz</body> </html> |
From: Peter S. <pe...@ja...> - 2003-02-25 18:45:52
|
Bruno Haible <ha...@il...> writes: > Dan Knapp writes: > > I definitely vote for 7-bit ascii. At worst an editor encountering > > an encoding it doesn't know will display garbage characters, but > > that's distracting and could be confusing. > > That's a poor attitude. Following this recommendation we would still > be in a 7-bit characters world of 1983. > > We are switching to C99 and having sources in UTF-8 because we > _like_ these and want them to become more standard. Yes be prepared > to get some editor to disply garbage for them - but then go out and > report a bug for this editor so it gets fixed. > > Testing the limits of technology means stretching the limits of > technology. Fair enough. As long as I can axe the os2 directory, my problems are solved. My next checkin (sometime later today probably) will take out the charset conversion foo. And I guess I'll go download xemacs 21.5. ;-) -Peter -- Peter Seibel pe...@ja... |
From: Bruno H. <ha...@il...> - 2003-02-25 21:12:37
|
Peter Seibel writes: > I guess I'll go download xemacs 21.5. ;-) Quite unstable for me (i.e. dumps core nearly immediately). I'd try xemacs 21.4 + Mule-UCS or GNU emacs 21 instead. Bruno |