[htdig] Indexing OpenOffice-Documents

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi all

I was searching a solution how to index OpenOffice-Writer-Documents. I didn't found an answer to this question, so I made a simple solution, I hope this is also usable for others and I think there are some improvemets possible. 

Do the following:

Add to your mime.types (/opt/www/conf) the line

application/vnd.sun.xml.writer sxw

I had to this to let htdig point to the right parser.

In your htdig.conf (/opt/www/conf) add following:

mime_types: /opt/www/conf/mime.types

Of course, you can also choose the mime.types of your Web-Server.

and in your external_parsers - section:

application/vnd.sun.xml.writer->text/plain /opt/www/bin/oounzip

I chose the name oounzip for my script, could be anything else

Now, create the oounzip-script and make it executable. The script has following content:

#!/bin/sh

rm /tmp/htdig/ooffice/content.xml
unzip $1 content.xml -d /tmp/htdig/ooffice
cat /tmp/htdig/ooffice/content.xml

It extracts only the file content.xml out of the zipped sxw-File. So only the body will be indexed, not the header and so far. For me it's enough. Please note, you have to use unzip, gunzip does not extract the sxw-Files correctly. Also note that the xml is parsed as plain-text, so unfortunately the xml-tags are also parsed, but I'm sure there is a solution for this. Never looked at the xmlsearch-package in the contrib-section.

The last step is to create the directory ooffice in your /tmp/htdig-directory (/tmp/htdig/ooffice)

I don't take any responsibility for any damage occured using this howto. There should be no damage, we stil gout our data here :).

Hope it works for others to, feedback is very welcome

greeting to everybody.
David Berger

NetMon GmbH 
http://www.netmon.ch