Text Extraction best practice

2010-09-24
2013-05-13
  • Chris Bamford
    Chris Bamford
    2010-09-24

    Let me add a bit more detail….
    I have some code based on the wiki example which was stable under 1.4.0.  When moving to 1.5.0, simple MS Office documents started causing my unit tests to fail.  For example, on closer inspection of one of them I find that this line:

         extractor.extract(container.getDescribedUri(), bis, Charset.forName("UTF-8"), mimeType, container);

    is throwing a Throwable as follows:

        java.lang.NoClassDefFoundError: org/apache/poi/EncryptedDocumentException

    … only the file ISN'T encrypted  :(

    Have I missed something?
    Thanks!

     
  • Chris Bamford
    Chris Bamford
    2010-09-24

    I realise this is a missing class…  however, when I add an extra Aperture JAR (one of the runtime ones I think), the problem just moves to a missing method.  Clearly I am not using the correct pom entries.  Could you please advise on which ones I need?  I am doing MIME detection with identify() then text extraction with extract() on many different file types.

     
  • Chris Bamford
    Chris Bamford
    2010-09-24

    Looks like I have solved it.  I needed to explicitly add the latest versions of poi and poi-scratchpad to my pom, hope this is right.
    If anyone wants to see the pom entries I used to do this, please ask.
    - Chris

     
  • Antoni Mylka
    Antoni Mylka
    2010-09-28

    That's weird, could you post the pom entries before and after your change.

    The latest POI should be included by the 1.5.0 pom. If it isn't, then we have a bug.

     
  • Chris Bamford
    Chris Bamford
    2010-09-30

    I lost track of the 'before" entries as I was chopping and changing between 1.4 and 1.5, but here are my working 1.5 entries:

    <!- Aperture Dependencies ->
            <dependency>
              <groupId>org.semanticdesktop.aperture</groupId>
              <artifactId>aperture</artifactId>
              <version>1.5.0</version>
              <type>pom</type>
            </dependency>

            <dependency>
                <groupId>org.apache.poi</groupId>
                <artifactId>poi-scratchpad</artifactId>
                <version>3.7-20100617171931</version>
            </dependency>

            <dependency>
                <groupId>org.apache.poi</groupId>
                <artifactId>poi</artifactId>
                <version>3.7-20100617171931</version>
            </dependency>

            <dependency>
              <groupId>org.semanticdesktop.aperture</groupId>
              <artifactId>aperture-core</artifactId>
              <version>1.5.0</version>
            </dependency>

            <dependency>
              <groupId>org.semanticdesktop.aperture</groupId>
              <artifactId>aperture-runtime-default</artifactId>
              <version>1.4.0</version>
              <type>pom</type>
            </dependency>

            <dependency>
              <groupId>org.semanticdesktop.aperture</groupId>
              <artifactId>aperture-runtime-optional</artifactId>
              <version>1.4.0</version>
              <type>pom</type>
            </dependency>

    If I comment out the poi and poi-scratchpad entries I get the following Error when extracting text from "old style" Word docs (i.e. .doc rather than .docx):

    java.lang.NoClassDefFoundError: org/apache/poi/EncryptedDocumentException
    at org.semanticdesktop.aperture.extractor.word.WordExtractorFactory.get(WordExtractorFactory.java:36)
    at com.mimecast.extractor.text.TextExtractor.extract(TextExtractor.java:153)
    at com.mimecast.extractor.text.TextExtractor.extract(TextExtractor.java:73)
    at textExtract.ExtractorUnitTests.EncryptedDocFileTest(ExtractorUnitTests.java:182)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
            ….

    I have tried just about every combination of including / excluding the Aperture deps, but so far haven't found a way around this.
    I hope you can point out my error.

    Thanks

    - Chris

     
  • Antoni Mylka
    Antoni Mylka
    2010-09-30

    Instead of

    <dependency> 
      <groupId>org.semanticdesktop.aperture</groupId> 
      <artifactId>aperture</artifactId> 
      <version>1.5.0</version> 
      <type>pom</type> 
    </dependency>

    try

    <dependency> 
      <groupId>org.semanticdesktop.aperture</groupId> 
      <artifactId>aperture-core</artifactId> 
      <version>1.5.0</version> 
    </dependency>

     
  • Chris Bamford
    Chris Bamford
    2010-09-30

    Unfortunately, this makes no difference to my situation  :(
    For the avoidance of any doubt this is what I just tried:

            <dependency>
              <groupId>org.semanticdesktop.aperture</groupId>
              <artifactId>aperture-core</artifactId>
              <version>1.5.0</version>
            </dependency>

            <dependency>
              <groupId>org.semanticdesktop.aperture</groupId>
              <artifactId>aperture-runtime-default</artifactId>
              <version>1.4.0</version>
              <type>pom</type>
            </dependency>

            <dependency>
              <groupId>org.semanticdesktop.aperture</groupId>
              <artifactId>aperture-runtime-optional</artifactId>
              <version>1.4.0</version>
              <type>pom</type>
            </dependency>

    Is this what you meant?

     
  • Antoni Mylka
    Antoni Mylka
    2010-09-30

    I meant a dependency on aperture-core in the 1.5.0 version, just like on the wiki page

    http://sourceforge.net/apps/trac/aperture/wiki/MavenDependency

    You could add a dependency on Aperture to your pom the way it's described on the wiki page. And then issue

    mvn dependency:tree -Dverbose=true > dependency-tree.txt

    Then either paste the file on this forum or send it to me. We'll see what's wrong with your dependencies.

     
  • Chris Bamford
    Chris Bamford
    2010-10-01

    Hiya.  OK, my POM now has the same lines as the MavenDependency page (aperture-core and aperture-runtime-optional, both 1.5.0) and my Word .doc tests still fail with java.lang.NoClassDefFoundError: org/apache/poi/EncryptedDocumentException.
    The contents of 'dependency-tree.txt' are quite large (~270 lines)!  If you'd rather I sent it to you, please provide an address - otherwise I'll post it here.
    Thanks

    - Chris

     
  • Antoni Mylka
    Antoni Mylka
    2010-10-01

    Please post it here unless it contains some sensitive information (like names of components of your proprietary application).

    If so, send it to mylka  sourceforge  net

     
  • Antoni Mylka
    Antoni Mylka
    2010-10-02

    These lines are most troubling:

    org.apache.poi:poi:jar:3.0.2-FINAL-20080204:compile (version managed from 3.7-20100617171931)  |
    org.apache.poi:poi-scratchpad:jar:3.0.2-FINAL-20080204:compile (version managed from 3.7-2010061717

    They mean that the aperture pom contains references to correct, new version of POI, yet those version numbers are overridden somewhere with references to version 3.0.2. Please double-check that your project does not contain any direct dependency on poi 3.0.2, it can come up in the <dependency> or <dependencyManagement> section of your pom, or if your pom has a parent pom, look in there.

    The pom file is here:

    https://aperture.svn.sourceforge.net/svnroot/aperture/aperture/tags/aperture-1.5.0/core/pom.xml

    It contains a named property:

    <properties>
      <poi.version>3.7-20100617171931</poi.version>
    </properties>

    This property is referred in the POI dependency sections:

    <!- Apache POI ->
    <dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>${poi.version}</version>
    <exclusions>
    <exclusion> 
    <groupId>commons-logging</groupId> 
    <artifactId>commons-logging</artifactId> 
    </exclusion>
    <exclusion> 
    <groupId>log4j</groupId> 
    <artifactId>log4j</artifactId> 
    </exclusion>
    </exclusions>
    </dependency>

    You override the poi.version, somewhere in your pom, or in the parent pom of your pom.

     
  • Chris Bamford
    Chris Bamford
    2010-10-02

    Thanks Mylka. I'll investigate our parent Pom.

     
  • Chris Bamford
    Chris Bamford
    2010-10-07

    This is now fixed - I have removed the offending dependency from our parent POM.
    Thanks so much!

     
  • Antoni Mylka
    Antoni Mylka
    2010-10-07

    Great,
    If you have any other questions, create new threads.