From: Christiaan F. <chr...@ad...> - 2009-09-01 09:36:39
|
Antoni Mylka wrote: > Aperturians, > > I've just fixed the feature request 2836084. Right now the word > extractor uses classes from textmining.org to support files from Word > 2.0, 6.0, 95 and 97. I worked with example files from nutch and Great! I have tested it on several Word 6 docs that I have and they all work fine now. In the past there were problems with using textmining.jar and poi jar files side-by-side due to classpath conflicts (textmining.jar included a partial set of POI classes that did not correspond with any public POI release, as far as I could see) but apparently that has been solved now. > textmining.org and I haven't committed them. If no-one has anything > against I would: > > - commit those files to our codebase (nutch is Apache, textmining.org is > LGPL) dunno if it's a good idea to commit LGPL files to our repo, what > do you think? If not, then it would be nice to procure some other > examples from old Word versions. I think putting non-BSD files in our code base is not a good idea. Anyone have an old copy of Word 6 available? :) Or perhaps you can still save documents in this older format using recent Word/OpenOffice versions? > - update to POI 3.5-beta6 which is said to drastically improve support > for the newest office 2007 files. Just to make this clear: we currently do not use POI for handling Office 2007 files (the so-called OpenXML format). Instead I wrote some code from scratch myself a few years ago that opens up the ZIP file and processes some XML files in it, because no OSS libraries were available at that time for reading this format. I do would like to switch to POI though - provided that its quality is good enough - as this will most likely result in more accurate text extraction. The POI update breaks compilation due to some class and method name changes. I have attached a patch that I made a while ago, when I played with the latest POI beta. Line numbers are probably not correct anymore, but you get the idea just by reading the file. Regards, Chris -- |