Re: [htdig-dev] HTML Translations

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Just an addition to make aware the problem to whoever would attack the 
translations pitfalls:

it's rather long since I worked out a patch to make htdig fit in our 
needs. I never got the time to really get involved and put some serious 
work in this; however, the problem is till there and if anyone will get 
involved, maybe it would be worth being aware of it.

I'm speaking for Romanian as language used in indexed documents.

Since there always are problems with ISO-8859-2 chars, many users actually 
choose not to use them at all. In consequence, spellings with accented 
chars or with there unaccented counterparts are used in mixed fashion.

The approach we took was to simply transform _all_ accented chars in their 
unaccented counterparts before indexing; the same of course before 
searching.

Without the ability to do that, I'm afraid that our indexing wouldn't 
have been as successfull as it proved to be.

It's true that, in this way, we cannot search for example only for the 
accented words, not showing the others - but users proved to be much 
more resilient in getting some more (slightly missed) hits, than not 
getting the relevant ones...

Even if kept only as an option, the possibility to work like that 
definitely should be present in future versions of htdig.

Thank you.

Iosif Fettich