From: Gilles D. <gr...@sc...> - 2001-11-22 16:45:38
|
According to Bernier, Melanie: > > I have installed htdig and I have a little problem with German Umlaut= . I > > can search for words with Umlaut without any problem. When I search = for > > say 'C34644' (a file containing Umlaut), the results from htdig comes= back > > with strange characters instead of Umlaut (for example, I get a circl= e (=D8) > > instead of =FC, or I get a bit =C4 instead of a small =E4), and it se= ems to > > return that kind of results only for word documents. What could be t= he > > problem? The problem is MS Word doesn't use ISO-8859-1 (Latin 1) encoding for characters with accents. The doc2html.pl script uses catdoc to decode the Word documents into plain text, which works fine for ASCII text, but when accents are involved it doesn't automatically map to the encoding you want. With catdoc, you have -s and -d options to specify the source and destination character sets. I've found that by using catdoc -scp1250 -d8859-1 file.doc I can get accents to come out correctly on one of the few Word documents I have that contain accents. This document happens to use cp1250 as its internal character set. You may need to experiment to find the right options for your documents. When you figure out the right options, you can put them into the command line for catdoc in doc2html.pl. --=20 Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~g= rdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |