rhooley - 2008-05-06

Hi Everyone,

I'm currently using Aperture to do a plain text extraction on documents for storage in a Lucene index.  I'm using sample code from the example as follows:

      MimeTypeIdentifier identifier = new MagicMimeTypeIdentifier();
      ExtractorRegistry extractorRegistry = new DefaultExtractorRegistry();

      // read as many bytes of the file as desired by the MIME type identifier
      FileInputStream stream = new FileInputStream(source);
      BufferedInputStream buffer = new BufferedInputStream(stream);
      byte[] bytes = IOUtil.readBytes(buffer, identifier.getMinArrayLength());

      // let the MimeTypeIdentifier determine the MIME type of this file
      String mimeType = identifier.identify(bytes, source.getPath(), null);

      // skip when the MIME type could not be determined
      if (mimeType == null)
        throw new Exception(UNSUPTYP);

      // create the RDFContainer that will hold the RDF model (I.E. The full text)
      URI uri = new URIImpl(source.toURI().toString());
      Model model = RDF2Go.getModelFactory().createModel();
      RDFContainer container = new RDFContainerImpl(model, uri);
      // determine and apply an Extractor that can handle this MIME type
      Set factories = extractorRegistry.get(mimeType);
      if (factories != null && !factories.isEmpty())
        // just fetch the first available Extractor
        ExtractorFactory factory = (ExtractorFactory) factories.iterator().next();
        Extractor extractor = factory.get();

        // apply the extractor on the specified file
        // (just open a new stream rather than buffer the previous stream)
        stream = new FileInputStream(source);
        buffer = new BufferedInputStream(stream, 8192);
        extractor.extract(uri, buffer, null, mimeType, container);

      // add the MIME type as an additional statement to the RDF model
      container.add(NIE.mimeType, mimeType);
      // report the output to System.out
      StringWriter contentWriter = new StringWriter();
      return contentWriter.toString();

It works great 99% of the time, however whenever I hit a password-protected Microsoft Word for Mac document, the heaps space blows up and I get an OutOfMemoryError.  The files being parsed are quite small; just a few hundred kb.  Normal (non-password-protected) Mac Word documents are fine, as are password-protected documents generated natively on a PC.  It just seems to be the mac-generated password-protected documents that cause the problem.

Has anyone got any suggestions?