From: Gilles D. <gr...@sc...> - 2003-01-14 17:50:15
|
According to Schallehn Volker: > We are running ht://Dig 3.1.6. The Website uses unicode characters for > displaying german umlauts etc. What htdig does is to transform a unicode > character for example "ä" into "&#x00E4;" We tried translate_amp > with both options "true" and "false", but without success. Is there any way > to prevent the "&"-character to be translated to "&"? Right now, the only way is to change the code in encodeSGML() so it doesn't change the "&" to "&", around lines 980-981 in htsearch/Display.cc. That will solve this problem, but it may introduce a new one, as there may be contexts in your documents in which the "&" needs to be retranslated into "&". The fundamental problem here is that when htdig indexes a document, it doesn't clearly distinguish between entities it converts and those it doesn't. E.g.:, an HTML guide may say something like: You can encode the < character as &lt;, and encode ä as &#x00E4;. When that is indexed by htdig, the excerpt stored in the database will be: You can encode the < character as <, and encode ä as ä. Now, htsearch changes this to... You can encode the < character as &lt;, and encode &#x00E4; as &#x00E4;. ... but with the modification above to encodeSGML it would output this (as HTML)... You can encode the < character as <, and encode ä as ä. Ultimately, the proper fix might be to change htdig such that any & character it encounters that's not part of an entity it changes would go into the database as some other unique character, which htsearch could then convert back into the & character without forcing the conversion to &. That's a bit more involved a change, though. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |