Re: [htdig-dev] HTML Translations

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

According to Iosif Fettich:
> Just an addition to make aware the problem to whoever would attack the 
> translations pitfalls:
> 
> it's rather long since I worked out a patch to make htdig fit in our 
> needs. I never got the time to really get involved and put some serious 
> work in this; however, the problem is till there and if anyone will get 
> involved, maybe it would be worth being aware of it.
> 
> I'm speaking for Romanian as language used in indexed documents.
> 
> Since there always are problems with ISO-8859-2 chars, many users actually 
> choose not to use them at all. In consequence, spellings with accented 
> chars or with there unaccented counterparts are used in mixed fashion.
> 
> The approach we took was to simply transform _all_ accented chars in their 
> unaccented counterparts before indexing; the same of course before 
> searching.
> 
> Without the ability to do that, I'm afraid that our indexing wouldn't 
> have been as successfull as it proved to be.
> 
> It's true that, in this way, we cannot search for example only for the 
> accented words, not showing the others - but users proved to be much 
> more resilient in getting some more (slightly missed) hits, than not 
> getting the relevant ones...
> 
> Even if kept only as an option, the possibility to work like that 
> definitely should be present in future versions of htdig.

The problem is that transforming accented characters to unaccented is an
encoding-specific task, so to do even this much, htdig would need to be
aware of which encoding is used, and tranform characters appropriately.
This takes us back to the idea of an accent_map attribute that would
allow users to specify the transformations they need, but that's a big
job that wouldn't integrate that neatly into the current code.  (Sigh!)

It would also be desirable to have a configurable list of entities to
decode to specific characters, rather than the current hard-coded set of
iso-8859-1 entities.  That would allow you to set htdig up to properly
use entities for iso-8859-2 or other encodings, converting the entities
to the 8-bit encoding of your choice.  At a bare minimum, we need an
attribute for enabling or disabling the decoding of SGML entities for
iso-8859-1 characters, in both 3.1.6 and 3.2.0b4.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930