From: Gilles D. <gr...@sc...> - 2001-10-09 21:29:14
|
According to Thomas Netousek: > I am running htdig-3.2.0-0.b3.4 from the RedHat linux 7.1 distribution and > I am indexing documents > which have all types of funny characters like e.g. single quotes spelled > as ’ > > I have seen other reports about the parser failing for &amp, so I am > wondering if this could > be sort of a similar problem ? > > Btw, I am also running htdig-3.1.5 on another machine with translate_... > set to true and it works > like a charm there. I believe 3.2.0b3 will not translate numeric entities where the number is larger than 255. 3.1.5 does, but it's a bug, because it only used 8 bit characters internally, so it only keeps the bottom 8 bits of this number. Because in 3.2.0b3 the numeric entity isn't converted, the "&" goes into the excerpt literally, and so it's turned into an & entity on output so it should display literally as "&", so you will see the numeric entity. Given the 8-bit character set limitations in both 3.1 and 3.2, I thing that 3.2's behaviour is the lesser of two evils when it comes to handling numeric entities above 255. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |