Menu

#149 Less content extracted with

Unassigned
closed
nobody
None
1
2018-12-17
2018-12-14
Tim Allison
No

We recently upgraded to 2.2.0 on Apache Tika, and we noticed we're getting less text in a handful of files -- one attached.

commoncrawl2/77/77HHLTAGLEP7S3FDW4KQTCEW3NFEXYZH

1 Attachments

Discussion

  • Tim Allison

    Tim Allison - 2018-12-14

    Clearly this has an exception, but it looks like we were able to extract more before hitting the exception with the earlier version of Jackcess.

     
  • Tim Allison

    Tim Allison - 2018-12-17

    Y. Sorry. This is a duplicate (I think) of 150. The difference is that with the file in bug 149, we used to get a similar looking exception in version 2.1.12 of Jackcess:

    "org.apache.tika.parser.CompositeParser","org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.JackcessParser"],"X-TIKA:EXCEPTION:runtime":"java.lang.IllegalStateException: invalid page number 2003\n\tat com.healthmarketscience.jackcess.impl.PageChannel.validatePageNumber(PageChannel.java:203)\n\tat com.healthmarketscience.jackcess.impl.PageChannel.readPage(PageChannel.java:214)\n\tat
    ...

    With 2.2.0, we get a similar stacktrace, but it is happening earlier...
    java.lang.IllegalStateException: invalid page number 1777\n\tat com.healthmarketscience.jackcess.impl.PageChannel.validatePageNumber(PageChannel.java:203)\n\tat com.healthmarketscience.jackcess.impl.PageChannel.readPage(PageChannel.java:214)\n\tat com.healthmarketscience.jackcess.impl.LongValueColumnImpl.readLongValue(LongValueColumnImpl.java:204)\n\tat com.healthmarketscience.jackcess.impl.LongValueColumnImpl.read(LongValueColumnImpl.java:96)\n\tat com.healthmarketscience.jackcess.impl.ColumnImpl.read(ColumnImpl.java:689)\n\tat com.healthmarketscience.jackcess.impl.TableImpl.getRowColumn(TableImpl.java:847)\n\tat com.healthmarketscience.jackcess.impl.TableImpl.getRow(TableImpl.java:753)\n\tat com.healthmarketscience.jackcess.impl.TableImpl.getRow(TableImpl.java:733)\n\tat com.healthmarketscience.jackcess.impl.CursorImpl.getCurrentRow(CursorImpl.java:699)\n\tat

    In bug 150, we didn't get an exception at all in 2.1.12, but we do now. But, yes, looking at the stacktraces, they both point to the validate page number step. Sorry!

     

    Last edit: Tim Allison 2018-12-17
  • James Ahlborn

    James Ahlborn - 2018-12-17
    • status: open --> closed
     
  • James Ahlborn

    James Ahlborn - 2018-12-17
     

Log in to post a comment.

MongoDB Logo MongoDB