From: <be...@ne...> - 2002-07-26 08:54:21
|
Hi all I was searching a solution how to index OpenOffice-Writer-Documents. I didn't found an answer to this question, so I made a simple solution, I hope this is also usable for others and I think there are some improvemets possible. Do the following: Add to your mime.types (/opt/www/conf) the line application/vnd.sun.xml.writer sxw I had to this to let htdig point to the right parser. In your htdig.conf (/opt/www/conf) add following: mime_types: /opt/www/conf/mime.types Of course, you can also choose the mime.types of your Web-Server. and in your external_parsers - section: application/vnd.sun.xml.writer->text/plain /opt/www/bin/oounzip I chose the name oounzip for my script, could be anything else Now, create the oounzip-script and make it executable. The script has following content: #!/bin/sh rm /tmp/htdig/ooffice/content.xml unzip $1 content.xml -d /tmp/htdig/ooffice cat /tmp/htdig/ooffice/content.xml It extracts only the file content.xml out of the zipped sxw-File. So only the body will be indexed, not the header and so far. For me it's enough. Please note, you have to use unzip, gunzip does not extract the sxw-Files correctly. Also note that the xml is parsed as plain-text, so unfortunately the xml-tags are also parsed, but I'm sure there is a solution for this. Never looked at the xmlsearch-package in the contrib-section. The last step is to create the directory ooffice in your /tmp/htdig-directory (/tmp/htdig/ooffice) I don't take any responsibility for any damage occured using this howto. There should be no damage, we stil gout our data here :). Hope it works for others to, feedback is very welcome greeting to everybody. David Berger NetMon GmbH http://www.netmon.ch |