I was searching a solution how to index OpenOffice-Writer-Documents. I didn't found an answer to this question, so I made a simple solution, I hope this is also usable for others and I think there are some improvemets possible.
Do the following:
Add to your mime.types (/opt/www/conf) the line
I had to this to let htdig point to the right parser.
In your htdig.conf (/opt/www/conf) add following:
Of course, you can also choose the mime.types of your Web-Server.
and in your external_parsers - section:
I chose the name oounzip for my script, could be anything else
Now, create the oounzip-script and make it executable. The script has following content:
unzip $1 content.xml -d /tmp/htdig/ooffice
It extracts only the file content.xml out of the zipped sxw-File. So only the body will be indexed, not the header and so far. For me it's enough. Please note, you have to use unzip, gunzip does not extract the sxw-Files correctly. Also note that the xml is parsed as plain-text, so unfortunately the xml-tags are also parsed, but I'm sure there is a solution for this. Never looked at the xmlsearch-package in the contrib-section.
The last step is to create the directory ooffice in your /tmp/htdig-directory (/tmp/htdig/ooffice)
I don't take any responsibility for any damage occured using this howto. There should be no damage, we stil gout our data here :).
Hope it works for others to, feedback is very welcome
greeting to everybody.
According to berger@...:
> and in your external_parsers - section:
> application/vnd.sun.xml.writer->text/plain /opt/www/bin/oounzip
> Also note that the xml is parsed as plain-text, so unfortunately the
> xml-tags are also parsed, but I'm sure there is a solution for this.
If you define the script output as text/html instead of text/plain,
then htdig's HTML parser should treat the XML tags as unknown HTML tags
and just ignore them.
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)