From: Gilles D. <gr...@sc...> - 2001-11-27 22:47:37
|
According to Ionut Nistor: > On Sat, 2001-11-24 at 06:44, Geoff Hutchison wrote: > > The point should be made here that the attributes are no longer as > > significant (and indeed obsolete in 3.2.0bX and later) because htsearch is > > now doing The Right Thing (TM) and decoding/encoding *all* SGML entities > > as appropriate. > Ah, great ! So from 3.2 no more translations in htdig, right ? Only > escapings in htsearch. Sorry, you're both wrong about this. Ionut, what Geoff said is htdig decodes and encodes all SGML entities as appropriate. That's something quite different than saying "no more translations in htdig". The fact is htdig 3.2 decodes all the same SGML entities as 3.1.x does, with the exception of ™. That doesn't solve the problem you had originally reported. Geoff, what 3.2 does is also not "The Right Thing" either, as there are a few remaining problems: 1) htdig only handles the 4 basic character entity references (ASCII characters) <, >, &, and ", as well as the ISO-8859-1 characters. Other entities, such as Greek, match and other symbols, as well as other accented characters (e.g. š) are not converted, but the "&" in them is converted to "&" by htsearch. This is a problem in 3.1 and 3.2. 2) Because other accented characters aren't converted, dealing with non-ISO-8859-1 accents is a problem for word matches. Even if the indexing system has working locales, and the source documents use the appropriate encoding, only encoded accented characters will be matched in the word search. SGML-encoded characters in the source documents won't be treated as equivalent to their single character encodings. Again, this is a problem with both 3.1 and 3.2. 3) When using non-Latin-1 encodings, e.g. ISO-8859-2, htdig still translates entities like é to the ISO-8859-1 8-bit character, and it goes in the database that way. So, if displayed using a different encoding in htsearch (3.1.x) it won't display as a e with acute accent, but as whatever character has that same encoding in Latin 2. In 3.2, htsearch maps the upper-half of the character set back to SGML entities, so this problem won't occur, but a much worse problem does occur - all properly encoded Latin 2 characters are mapped to Latin 1 SGML entities. This is still a big, unresolved bug in 3.2. The "right thing" to do would be to either not decode SGML entities at all, but somehow compensate for that in the word matching, or to decode all standard or proposed entities UNAMBIGUOUSLY so that you can map them back correctly in htsearch. This means not being limited to 256 characters in a single byte. htsearch would then have to be aware of the encoding used on output, and map the characters to the correct single character or SGML encoding as appropriate. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |