From: Gilles D. <gr...@sc...> - 2001-11-29 18:11:18
|
According to Iosif Fettich: > Just an addition to make aware the problem to whoever would attack the > translations pitfalls: > > it's rather long since I worked out a patch to make htdig fit in our > needs. I never got the time to really get involved and put some serious > work in this; however, the problem is till there and if anyone will get > involved, maybe it would be worth being aware of it. > > I'm speaking for Romanian as language used in indexed documents. > > Since there always are problems with ISO-8859-2 chars, many users actually > choose not to use them at all. In consequence, spellings with accented > chars or with there unaccented counterparts are used in mixed fashion. > > The approach we took was to simply transform _all_ accented chars in their > unaccented counterparts before indexing; the same of course before > searching. > > Without the ability to do that, I'm afraid that our indexing wouldn't > have been as successfull as it proved to be. > > It's true that, in this way, we cannot search for example only for the > accented words, not showing the others - but users proved to be much > more resilient in getting some more (slightly missed) hits, than not > getting the relevant ones... > > Even if kept only as an option, the possibility to work like that > definitely should be present in future versions of htdig. The problem is that transforming accented characters to unaccented is an encoding-specific task, so to do even this much, htdig would need to be aware of which encoding is used, and tranform characters appropriately. This takes us back to the idea of an accent_map attribute that would allow users to specify the transformations they need, but that's a big job that wouldn't integrate that neatly into the current code. (Sigh!) It would also be desirable to have a configurable list of entities to decode to specific characters, rather than the current hard-coded set of iso-8859-1 entities. That would allow you to set htdig up to properly use entities for iso-8859-2 or other encodings, converting the entities to the 8-bit encoding of your choice. At a bare minimum, we need an attribute for enabling or disabling the decoding of SGML entities for iso-8859-1 characters, in both 3.1.6 and 3.2.0b4. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |