Not a question, but an answer.
I found a way to parse .doc, .pdf, .xls, well any format if you can
transform it to text.
I explain it on my web site, in french, but hey, that's my natural
laguage. Translations are welcome.
The small picture is this :
- use mod_rewrite (apache) to rewrite URL when user-agent in htdig AND
uri ends with .doc or .xls or pdf etc...
- send it to a PHP page (but could be perl, or whatever)
- open the file SCRIPT_URI (original URL) and convert it to text via
pdftotext or catdoc etc...
this way documents are seen by htdig as text and are indexed as such.
The original url (http://somehost.com/foo.doc) is preserved.
the good news is : you can do this with any search engine + eventually
an external robot can parse your .doc.
hope it helps some of you.
i've searched how to use external_parsers, but not only it don't work,
but i don't understand why.
Get latest updates about Open Source Projects, Conferences and News.