Re: [Aperture-devel] Extending the Word extractor.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Antoni Mylka wrote:
> Aperturians,
> 
> I've just fixed the feature request 2836084. Right now the word
> extractor uses classes from textmining.org to support files from Word
> 2.0, 6.0, 95 and 97. I worked with example files from nutch and

Great! I have tested it on several Word 6 docs that I have and they all 
work fine now.

In the past there were problems with using textmining.jar and poi jar 
files side-by-side due to classpath conflicts (textmining.jar included a 
partial set of POI classes that did not correspond with any public POI 
release, as far as I could see) but apparently that has been solved now.

> textmining.org and I haven't committed them. If no-one has anything
> against I would:
> 
> - commit those files to our codebase (nutch is Apache, textmining.org is
> LGPL) dunno if it's a good idea to commit LGPL files to our repo, what
> do you think? If not, then it would be nice to procure some other
> examples from old Word versions.

I think putting non-BSD files in our code base is not a good idea. 
Anyone have an old copy of Word 6 available? :) Or perhaps you can still 
save documents in this older format using recent Word/OpenOffice versions?

> - update to POI 3.5-beta6 which is said to drastically improve support
> for the newest office 2007 files.

Just to make this clear: we currently do not use POI for handling Office 
2007 files (the so-called OpenXML format). Instead I wrote some code 
from scratch myself a few years ago that opens up the ZIP file and 
processes some XML files in it, because no OSS libraries were available 
at that time for reading this format. I do would like to switch to POI 
though - provided that its quality is good enough - as this will most 
likely result in more accurate text extraction.

The POI update breaks compilation due to some class and method name 
changes. I have attached a patch that I made a while ago, when I played 
with the latest POI beta. Line numbers are probably not correct anymore, 
but you get the idea just by reading the file.

Regards,

Chris
--