I've got a PDF that PDF Clown can't extract text from. Here is the exception:
java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeLo(TimSort.java:747) ~[na:1.7.0_51] at java.util.TimSort.mergeAt(TimSort.java:483) ~[na:1.7.0_51] at java.util.TimSort.mergeCollapse(TimSort.java:408) ~[na:1.7.0_51] at java.util.TimSort.sort(TimSort.java:214) ~[na:1.7.0_51] at java.util.TimSort.sort(TimSort.java:173) ~[na:1.7.0_51] at java.util.Arrays.sort(Arrays.java:659) ~[na:1.7.0_51] at java.util.Collections.sort(Collections.java:217) ~[na:1.7.0_51] at org.pdfclown.tools.TextExtractor.sort(TextExtractor.java:671) ~[pdfclown-0.1.2.jar:0.1.2] at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:303) ~[pdfclown-0.1.2.jar:0.1.2]
Apparently that exception typically means that a comparison method isn't transitive. I looked into it, and that certainly seems to be the case with the TextStringPositionComparator used by TextExtractor.
Consider the following made-up example (I don't know enough about PDFs to know if the numbers make any sense, but the concept applies):
height = 4
height = 10
height = 4
Because of the threshold used in the isOnTheSameLine method:
A == B, B == C, but A != C
Presumably my PDF has a situation somewhat like this, and Collections.sort is not happy about it.
It seems like this would only occur with 3 overlapping lines of text, which is probably pretty unusual. In my case, it's the result of OCR software running on an old report. Still, the Comparator ought to be transitive. What circumstances require two text blocks to be considered equal if their x values are the same but their y values are not?