Re: [mmapps-users] Lucene 1.8.0 Word and Excel

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Ted,

The lucenemodule can extract these file formats automatically out of the 
box.
It uses external libraries for extraction:

-pdfbox 0.72, for pdf files see http://www.pdfbox.org/
-poi 2.5.1, for word/excel/powerpoint see http://poi.apache.org/
-tm_extractors 0.4, for word documents see http://www.textmining.org/

Wouter

Ted Vinke wrote:
> Hi!
>
> I've downloaded lucenemodule-1.8.0 and got a question
>
> The Luce FAQ entry at 
> http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-37523379241b88fd90bcd1de81b74e7ec8843f72 
> says we need to parse things ourselves, but I read "Now PDF and Word are 
> supported" on the Lucene MMBase homepage at 
> http://mmapps.sourceforge.net/lucenemodule/samples.html.
>
> Can anybody tell how I can figure out what versions of Word, Excel, 
> Powerpoint en PDF are supported out of the box and for what of 
> forementioned formats I need still to write an additional parser?
>
> Kind regards,
> Ted
>