[htdig-dev] Simple UTF-8 support patch

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I recently had the following problem: Due to the use of a CMS some of our pages
are now UTF-8 encoded. Since we are a german university our pages may contain
german umlauts ;-) I use ht://Dig to index all servers on the campus. The
problem is/was, that we cannot find words with umlauts on those UTF-8 pages.

First workaround: add accept-charset="ISO-8859-1" to the ht://Dig search form.
Now we can find words with umlauts on old (non UTF-8)pages but not one the new
(UTF-8) pages.

Attached you'll find a patch, that does a simple UTF-8 to 8bit ASCII
conversion. All non-convertable characters are are mapped to a questionmark(?). 

ReadBody may not be the best place to add this code (and it should be added to
ReadChunkedBody as well), but it was the easiest way to achieve my goal. One
may give me a hint for a better place :-)

Comments welcome ....

Andreas

-- 
! Andreas Jobs                                 Network Operating Center !
!                                              Ruhr-Universitaet Bochum !
! The only way to clean a compromised system is to flatten and rebuild. !