Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#55 java.lang.IllegalArgumentException: Comparison method violates its general contract!

v1.0_(example)
closed-fixed
None
5
2015-02-09
2014-02-27
EMS
No

I've got a PDF that PDF Clown can't extract text from. Here is the exception:

java.lang.IllegalArgumentException: Comparison method violates its general contract!
    at java.util.TimSort.mergeLo(TimSort.java:747) ~[na:1.7.0_51]
    at java.util.TimSort.mergeAt(TimSort.java:483) ~[na:1.7.0_51]
    at java.util.TimSort.mergeCollapse(TimSort.java:408) ~[na:1.7.0_51]
    at java.util.TimSort.sort(TimSort.java:214) ~[na:1.7.0_51]
    at java.util.TimSort.sort(TimSort.java:173) ~[na:1.7.0_51]
    at java.util.Arrays.sort(Arrays.java:659) ~[na:1.7.0_51]
    at java.util.Collections.sort(Collections.java:217) ~[na:1.7.0_51]
    at org.pdfclown.tools.TextExtractor.sort(TextExtractor.java:671) ~[pdfclown-0.1.2.jar:0.1.2]
    at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:303) ~[pdfclown-0.1.2.jar:0.1.2]

Apparently that exception typically means that a comparison method isn't transitive. I looked into it, and that certainly seems to be the case with the TextStringPositionComparator used by TextExtractor.

Consider the following made-up example (I don't know enough about PDFs to know if the numbers make any sense, but the concept applies):

Box A:
height = 4
x=0
y=0

Box B:
height = 10
x=0
y=2

Box C:
height = 4
x=0
y=10

Because of the threshold used in the isOnTheSameLine method:
A == B, B == C, but A != C

Presumably my PDF has a situation somewhat like this, and Collections.sort is not happy about it.

It seems like this would only occur with 3 overlapping lines of text, which is probably pretty unusual. In my case, it's the result of OCR software running on an old report. Still, the Comparator ought to be transitive. What circumstances require two text blocks to be considered equal if their x values are the same but their y values are not?

Discussion

  • We have also seen this problem. Any update on when this could be fixed?

     
    • status: open --> closed-fixed
     
  • Thank you very much!