From: Gilles D. <gr...@sc...> - 2001-12-20 20:25:15
|
According to Roman Maeder: > Matthias.Emmert2@START.de said: > > After some code-walking -thanks to open source!- I recognized that a > > "locale: de_DE.ISO8859-1" in the htdig.conf file will help. I was > > just fondling around with the LC_* environment vars before. You shouldn't have to resort to walking through the code for this. There is documentation: http://www.htdig.org/attrs.html#locale http://www.htdig.org/FAQ.html#q5.8 http://www.htdig.org/FAQ.html#q4.10 > > I suggest for upcoming htdig versions to introduce a > > "setlocale(LC_ALL, "");" in the beginning of the htdig/htsearch mains, > > since this would set the program locale according the env vars ;-) > > hmm, this would make it hard to ensure that the environment used > for digging (probably under cron) and for searchig (under your http server) > are the same. The config file seems to me better place to specify > collation and character class info. I think what Matthias was suggesting was to use the enviroment variable initially, so that it would set the default locale, which could then be overridden with the config attribute. That's how I interpreted it, as he never said anything about removing support for the locale attribute. This way, you can use either technique, which I think is a good idea. In htdig, though, the code would have to set LC_TIME back to "C" after setting LC_ALL, so that If-Modified-Since headers come out in the standard format and not a locale-dependent one. > Also, for efficiency, you probably always want to use the "C" locale > for collation. It doesn't really matter what collation sequence you > use as long as the one used for building the index is the same as the > one used for searching it. For htmerge, it's indeed very important to set LC_COLLATE to C if you're in a different locale, provided your system "sort" program is locale-aware. Version 3.1.5 or older of htmerge has a problem handling accented characters otherwise. In 3.1.6, htmerge is fixed not to lose words in the word database when the wordlist is sorted according to a different locale, but it will run slower and produce a bigger database than if you sort using the C locale. The rundig script in 3.1.6 sets LC_COLLATE correctly, but if you run htmerge from other scripts, you should take care to do likewise. Actually, the same goes for a lot of shell scripts that use the sort program, either directly or indirectly. I'm in the process of migrating some stuff from Red Hat 4.2 to 7.2, and many of my shell scripts are breaking because Red Hat's sort program has been locale-aware since 6.x. I still contend that making sort be locale-aware by default was a really bad design decision for this very reason. This feature should have been enabled by a command-line option, somewhat akin to the -f option, because conceptually the two are quite similar (accented characters are "folded" into non-accented counterparts). The 3.2 versions of htdig will be immune to LC_COLLATE changes, as they don't use an external sort program. The DB package doesn't use LC_COLLATE. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |