From: Andreas J. <and...@ru...> - 2006-06-21 21:53:37
|
On Mon, Jun 19, 2006 at 03:14:54PM +0300, Kintzel Levente wrote: > Hi! Hi Levi. > I know that htdig doesn't support UTF8 characters (only 8 bits characters). > My question is that "doesn't support" what does it means exactly? > > That means that the search doesn't work well for characters with accents or > special characters? Yes. In other words: If you are seraching for a word with accent(s) or umlaut(s) you will not get hits for pages where these words are UTF-8 encoded. > Or htdig cannot return the indexed pages with correct > content if it contains UTF chars? No. You get the hits but the UTF-8 chars are iso-8859-x interpreted (look ugly). > More exactly, my web pages contain UTF8 characters, and I want to user htdig > for search. Let's suppose that it is OK if it doesn't search for accented > characters, only for simple characters, but the returned pages contains bad > characters. Where an UTF character was before, now there are two characters. > Is it a consequence of the fact that htdig cannot handle UTF characters, or > is it a configuration problem made by me? I've written a patch for this problem. The patch simply looks if the page is UTF-8 encoded by looking at the content-type meta tag. If so, all doublebyte chars are converted to their 8bit counterpart. All other chars are replaced by a quotationmark "?". The search from and the htdig templates (header, match, nomatch, etc) must be 8bit encoded. I've tested this for german umlauts but it should also works for other 8bit locales. You can find the patch in the htdig patch archive: ftp://ftp.ccsf.org/htdig-patches/3.2.0b6/UTF8.patch.0 Regards, Andreas -- ! Andreas Jobs Network Operating Center ! ! Ruhr-Universitaet Bochum ! ! The only way to clean a compromised system is to flatten and rebuild. ! |