Hi all,
Daniel asked me for my patch for the rotation-issue described in https://sourceforge.net/forum/message.php?msg_id=4992032
Attention, I didn't apply the newest patches to the classes PDFStreamEngine and PageDrawer.
There are 4 more probably affected classes calling the page.findRotation method which I didn't change, because I'm didn't have to use them (until now).
org.pdfbox.util.operator.pagedrawer.Invoke
org.pdfbox.util.TextPositionComparator
org.pdfbox.examples.pdmodel.PrintURLs
org.pdfbox.examples.util.PrintImageLocations
I've attached a pdf in DINA4-landscape. The text is missplaced whenever I try to print or display (using the pdfbox-PDFReader and convertToImage within my application) it with pdfbox. The acrobat reader has no problems with my documents.
After my patch everything works fine. Perhaps it is a point of discussion, if the convertToImage method has to rotate the image or if the user has to do it. The PDFPagePanel didn't do it (yet).
Andreas
rotation_patch incl. testpdf
Logged In: YES
user_id=1737686
Originator: NO
I've just tried your sample PDF w/ the latest code -- prior to application of your patch. It doesn't work.
I'll work on incorporating your change for a full regression test in the next hour or so.
Logged In: YES
user_id=2069622
Originator: YES
Hi Daniel,
I've just added my patch to the newest sources you send me earlier this day. I guess it works. During testing I've found another problem concernign graphics within landscape-docs. I found the solution in patching the class org.pdfbox.util.operator.pagedrawer.Invoke in the same way I've patched the others. And consequently to be strict I've also patched the new methods in org.pdfbox.pdfviewer.PageDrawer
For my everthings works fine inlc. the 4PP-pdf.
I've attached the patched files and another testpdf with a embedded graphic.
Andreas
File Added: pdfbox_rotation_patch_2.zip
rotation-patch 2 incl. new testpdf
Logged In: YES
user_id=1737686
Originator: NO
Your code works w/ the 4PP test ... and with the other rendering stuff I've tried so far.
However ... the text extraction test fails with it. I can't figure that one out ... ideas?
Logged In: YES
user_id=2069622
Originator: YES
Can you give me some more details? I never do any textextractions with pdfbox. Perhaps you'll provide with the code for test program, or is it part of pdfbox, so that I can find it in the cvs?
However, it has to wait until tomorrow
Logged In: YES
user_id=1737686
Originator: NO
If you've got the whole project set up, try
ant testextract
I'll see if I can narrow it down some.
Logged In: YES
user_id=1737686
Originator: NO
The extraction problem seems to have to do w/ the changes to PDFStreamEngine.
If I revert that file, extraction succeeds. Unfortunately ... with that reverted but your other changes in place, image rendering hangs.
Will work on it more ... probably tomorrow.
Logged In: YES
user_id=1737686
Originator: NO
Correction ... it doesn't hang ... it's just slow on the first PDF to render ... maybe just due to the first one I'm sending it.
Will look more tomorrow.
Logged In: YES
user_id=2069622
Originator: YES
I've found one bug. While deleting the if rules for the rotation, I've deleted line 394 which is still needed.
I've attached the corrected file
File Added: PDFStreamEngine.java
Corrected PDFStreamEngine
Logged In: YES
user_id=2069622
Originator: YES
I forgot to mention that I can't run the test suite. When I try to get the whole project, I realized that I'm behind a firewall here in my office. Consequently my cvs-client doesn't work. I've to do it from home. :-(
I've only tested one file: 601501018.pdf
There are additional blanks and they disapper after adding the missing line. But starting at page 21, when the document orientation changes from portrait to landscape, there are additional cr or lf. Hmmmm ??
Logged In: YES
user_id=2069622
Originator: YES
I've continued testing and I guess the problem is somewhere starting in org.pdfbox.util.PDFTextStripper.showCharacter(..). Obviously it handles the coordinates for rotated pages somehow in an other way than the implementation of the showCharacter() in org.pdfbox.pdfviewer.PageDrawer.
But for the moment I don't understand what's happening in the TextStripper, perhaps I'll find out later.
I hope this hint helps ...
Logged In: YES
user_id=1737686
Originator: NO
I've put a couple more hours into this, and I don't know the answer.
I do know the text extraction is the more mature side of this library.
For the moment, I'll be skipping over your changes to PDFStreamEngine.
Thanks for the other changes!
Logged In: YES
user_id=2069622
Originator: YES
Hi Daniel,
I guess I've solved the problem. The textposition-handling has to be adjusted within the method PDFTextStripper.flushText(). Of course my former changes to the class PDFStreamEngine are needed. During debugging I found a bug in the class TextPositionComparator (line 82). I solved it by removing the rotation if-clauses. Whenever you compare two Textpositions, it is needless to look at the rotation because they are on the same page so that the comparison is independent of the rotation.
Furthermore my PDFTextStripper-patch seems to correct some minor problems, which are described in https://sourceforge.net/forum/message.php?msg_id=4976730.
I've tested the following cases:
Garcia2003b__Correlative_exploration_of_EEG_Signals.pdf works 100%
test_rotate_270.txt doesn't work 100%, but my patch corrected a bug in lines 251-257, 278/279, 502/503, 574/575 and the other differences are some kind of special-character-issues. I guess you have to correct the input at first.
I've attached my changes based on the newest versions of both classes.
Bugfix for PDFTextStripper
Logged In: YES
user_id=2069622
Originator: YES
File Added: pdfbox_rotation_patch_3.zip