From: Alex R. <sh...@al...> - 2004-08-11 03:52:39
|
Greg, On Tue, Aug 10, 2004 at 07:02:39PM -0700, Greg Kuperberg wrote: > On Tue, Aug 10, 2004 at 08:12:39PM -0500, Alex Roitman wrote: > > Pleas do. I'd be curious to see how this is done. Can it also take care= of > > cyrillic, chinese, and other non-latin charsets?=20 >=20 > I will send it in the next message. You could easily extend the filter > to Cyrillic, although it is intend more strictly for de-accenting > than for transliteration. Transliterating Hebrew is pretty > much hopeless because the vowels are missing. Japanese > is even worse, and I suspect that Chinese and Arabic would also > be hard. [attachment] Thanks for the code. I guess I was not careful in reading your previous message :-) I can see that there are dictionaries mapping the unicode characters to either de-accented characters or the TeX commands for those characters. Now, trying to think where exactly we can use these tools (please point me to the right direction if I'm missing the obvious): 1. We don't really want to de-accent letters anywhere, do we? It seems innocuous for a mostly latin text to have occasional "apres" instead of "apr=E8s", but it would be totally wrong for French text. Even worse, in some languages the presense/absense of an umlaut can completely=20 change pronunciation and/or meaning. It seems that if the user entered= =20 the non-ascii data then it should be preserved as such, in both screen= =20 output and reports.=20 This IMHO would go for all report formats, including plain text and PDF. Now, there's a problem with the reportlab-generated PDFs (the PDF format option) when the text is not in iso-8859-1. We even contemplated removing this format in favor of the gnomeprint-based one, as it also has some other drawbacks. But the iso-8859-1 users (which heavily use accents, cedillas, and umlauts, btw) wanted to have an option for lean PDFs. These PDFs use standard PS fonts so no font information has to be embedded in the file, but this only supports iso-8859-1 charset. 2. As for the TeX commands, they would be suitable for the LaTeX output format. Except that we are using utf8 package shipped with teTeX that can do it for us :-) 3. Back to the first point. The gnome-print plugin is really the proper way for the majority of the users. The more advanced people can live just fine with LaTeX. The persistent ones can live with OOo and export into PDF from within OOo. But the gnome-print is integrated with the rest of the desktop, can generate nice preview, and supports unicode without any effort on our part. All we need to do is to use the fonts which cover the characters found in the text. The freefont covers most of the UCS, is freely available and easy to install. I don't really see a problem in telling the users "install package X to have feature Y" if X is available. On Debian, ttf-freefont is in the Recommended field (or should be anyway :-). We might provide a better message -- e.g. a dialog instead of a console output.=20 Is there any good usecase to de-accent non-ascii letters? I have to confess that I myself do not use non-ascii and am likely unaware of some subtleties. I'd be happy to learn :-) > I don't know if you feel loyal to conventional graph theory terms; > if you do, "connected components" is a better term than > "partition". Yes it is a partition, but that is a general > term that denotes an arbitrary grouping. Connected components does sound better to me.=20 Alex --=20 Alexander Roitman http://ebner.neuroscience.umn.edu/people/alex.html Dept. of Neuroscience, Lions Research Building 2001 6th Street SE, Minneapolis, MN 55455 Tel (612) 625-7566 FAX (612) 626-9201 |