|
From: Wouter H. <wh...@xs...> - 2007-10-15 10:57:56
|
Hi Ted, The lucenemodule can extract these file formats automatically out of the box. It uses external libraries for extraction: -pdfbox 0.72, for pdf files see http://www.pdfbox.org/ -poi 2.5.1, for word/excel/powerpoint see http://poi.apache.org/ -tm_extractors 0.4, for word documents see http://www.textmining.org/ Wouter Ted Vinke wrote: > Hi! > > I've downloaded lucenemodule-1.8.0 and got a question > > The Luce FAQ entry at > http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-37523379241b88fd90bcd1de81b74e7ec8843f72 > says we need to parse things ourselves, but I read "Now PDF and Word are > supported" on the Lucene MMBase homepage at > http://mmapps.sourceforge.net/lucenemodule/samples.html. > > Can anybody tell how I can figure out what versions of Word, Excel, > Powerpoint en PDF are supported out of the box and for what of > forementioned formats I need still to write an additional parser? > > Kind regards, > Ted > |