Re: [htdig-dev] HTML Translations

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

According to Ionut Nistor:
> On Sat, 2001-11-24 at 06:44, Geoff Hutchison wrote:
> > The point should be made here that the attributes are no longer as
> > significant (and indeed obsolete in 3.2.0bX and later) because htsearch is
> > now doing The Right Thing (TM) and decoding/encoding *all* SGML entities
> > as appropriate.
> Ah, great ! So from 3.2 no more translations in htdig, right ? Only
> escapings in htsearch.

Sorry, you're both wrong about this.  Ionut, what Geoff said is htdig
decodes and encodes all SGML entities as appropriate.  That's something
quite different than saying "no more translations in htdig".  The fact
is htdig 3.2 decodes all the same SGML entities as 3.1.x does, with the
exception of &trade;.  That doesn't solve the problem you had originally
reported.

Geoff, what 3.2 does is also not "The Right Thing" either, as there are
a few remaining problems:

1) htdig only handles the 4 basic character entity references (ASCII
characters) &lt;, &gt;, &amp;, and &quot;, as well as the ISO-8859-1
characters.  Other entities, such as Greek, match and other symbols,
as well as other accented characters (e.g. &scaron;) are not converted,
but the "&" in them is converted to "&amp;" by htsearch.  This is a
problem in 3.1 and 3.2.

2) Because other accented characters aren't converted, dealing with
non-ISO-8859-1 accents is a problem for word matches.  Even if the
indexing system has working locales, and the source documents use the
appropriate encoding, only encoded accented characters will be matched
in the word search.  SGML-encoded characters in the source documents
won't be treated as equivalent to their single character encodings.
Again, this is a problem with both 3.1 and 3.2.

3) When using non-Latin-1 encodings, e.g. ISO-8859-2, htdig still
translates entities like &eacute; to the ISO-8859-1 8-bit character,
and it goes in the database that way.  So, if displayed using a different
encoding in htsearch (3.1.x) it won't display as a e with acute accent,
but as whatever character has that same encoding in Latin 2.  In 3.2,
htsearch maps the upper-half of the character set back to SGML entities,
so this problem won't occur, but a much worse problem does occur - all
properly encoded Latin 2 characters are mapped to Latin 1 SGML entities.
This is still a big, unresolved bug in 3.2.

The "right thing" to do would be to either not decode SGML entities
at all, but somehow compensate for that in the word matching, or to
decode all standard or proposed entities UNAMBIGUOUSLY so that you can
map them back correctly in htsearch.  This means not being limited to
256 characters in a single byte.  htsearch would then have to be aware
of the encoding used on output, and map the characters to the correct
single character or SGML encoding as appropriate.

-- 
Gilles R. Detillieux              E-mail: <gr...@sc...>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930