From: Andreas J. <and...@ru...> - 2005-04-27 21:50:55
|
Hi, I recently had the following problem: Due to the use of a CMS some of our pages are now UTF-8 encoded. Since we are a german university our pages may contain german umlauts ;-) I use ht://Dig to index all servers on the campus. The problem is/was, that we cannot find words with umlauts on those UTF-8 pages. First workaround: add accept-charset="ISO-8859-1" to the ht://Dig search form. Now we can find words with umlauts on old (non UTF-8)pages but not one the new (UTF-8) pages. Attached you'll find a patch, that does a simple UTF-8 to 8bit ASCII conversion. All non-convertable characters are are mapped to a questionmark(?). ReadBody may not be the best place to add this code (and it should be added to ReadChunkedBody as well), but it was the easiest way to achieve my goal. One may give me a hint for a better place :-) Comments welcome .... Andreas -- ! Andreas Jobs Network Operating Center ! ! Ruhr-Universitaet Bochum ! ! The only way to clean a compromised system is to flatten and rebuild. ! |