[PDFBox-user] FW: Extracting text from pdf
Brought to you by:
benlitchfield
From: Richard B. <rb...@br...> - 2006-02-27 09:47:26
|
The developer below must have following my posts about Text Extraction and figured I was the Braman of Text Extraction, which I am not yet :P I wanted to post this code because I think it's a good way to open a dialog about PDF Text Extraction and strategies for how it might be done. This developer is using PDF Box, but the concepts are the same no matter what library you use. Ben has just been greatful enough to put a TextStripper and a Text Stripper by area in PDFBox. Hopfully someone can explain what is going on and what is going wrong here and what things that can be done to improve this code. The developers problem is when he runs his code to extract the text from the pdf file, which consists of table and some columns, some columns in of the table are joining together. Attached are the input file and output files. If you look at page 2 of the PDF and the output in simple 1 , you can see what he is talking about. The EntryCode and Entry Description columns get concatenated togeter as do ValueDate and Entry Amoutn. Whats different about these two coumns . It appears that they are not spaced out very far (EntryCode is the only right justified column), and I think creator put them in as one fragment and the stripper treats them as one as well. I think that is why Christian and Tamir came up with that whitespace algoritm for doing thse columns, so that subtelties like this can be caught and dealt with. Here are the most relevant lines of code in PDF box to start parsing text. I put in some comments and the reference to the API for easy reference: //We get an outline of the PDF and the child of the first Outline item PDDocumentOutline root = document.getDocumentCatalog().getDocumentOutline(); PDOutlineItem item = root.getFirstChild(); //Then it sibling PDOutlineItem item1 = item.getNextSibling(); //We then start a loop to iterate through the Outline Items //One file gets written for each loop iteration //For each outline item we are going to get a stripper //This will extract text from a specified region in the PDF. //http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripperByArea.ht ml //http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html stripper=new PDFTextStripperByArea(); //Get the list of regions that have been setup. //returns a list of regions List reg = stripper.getRegions(); // This will get the word separator. Returns a string with the word separator // wordsep = stripper.getWordSeparator(); //Set the bookmark where text extraction should start, inclusive. //Passes in the base PDFOutline in the document stripper.setStartBookmark(item); //Set the desired line separator for output text. stripper.setLineSeparator("\n"); //Set the desired word separator for output text. stripper.setWordSeparator(" "); //Set the desired page separator for output text. stripper.setPageSeparator("\n\n\n\n"); //Why is this being set again? stripper.setWordSeparator(" "); //Set the bookmark where the text extraction should stop. stripper.setEndBookmark(item1); //Write the text out to the output file stripper.writeText( document, output ); //changes output file i++; //Move to next sibling item = item.getNextSibling(); item1 = item1.getNextSibling(); //The rest of the code iterates through the child node of the root /Output line item and its siblings, even though one could get confused by the statement child = child.getNextSibling(); // because it appears to return a sibling called a child. /I was also slight confused by this lien: stripper.setShouldSeparateByBeads(stripper.shouldSeparateByBeads()); //as it seems that true or false should be passed in, not its current value //To quote tamir //Text in a PDF is held as a series of text fragments. These fragments may be written to //the PDF file in any order. Each text fragment usually //contains one full line of text although changes in formatting and the inclusion of //certain symbols require the line to be separated into separate fragments. Some PDF file //creators place each word or character as a separate fragment. //I noticed the developer never used the Lists of regions he create, maybe this should be used? //I don't think it would help solve his problem //I noticed that there were some other method calls in stripper that the developer didn't use such as: //The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. setSortByPosition(boolean newSortByPosition) ; setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue) //By default the text stripper will attempt to remove text that overlaps each other. //This was not changed //Complete code*********************************************** import java.util.List; import java.io.FileOutputStream; import java.io.OutputStreamWriter; import java.io.Writer; import org.pdfbox.exceptions.InvalidPasswordException; import org.pdfbox.pdmodel.PDDocument; import org.pdfbox.pdmodel.interactive.documentnavigation.outline.PDDocumentOutl ine; import org.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem; import org.pdfbox.util.PDFText2HTML; import org.pdfbox.util.PDFTextStripper; import org.pdfbox.util.PDFTextStripperByArea; /** * This is the main program that simply parses the pdf document and transforms it * into text. */ public class ExtractText { /** * private constructor. */ private ExtractText() { //static class } public static void main( String[] args ) throws Exception { try { int i =1; String wordsep = null; String str = null; boolean flag = false; Writer output = null; PDDocument document = null; document = PDDocument.load( "53 Nostro Ofc Cofc Daily Position_AUS.pdf" ); PDDocumentOutline root = document.getDocumentCatalog().getDocumentOutline(); PDOutlineItem item = root.getFirstChild(); PDOutlineItem item1 = item.getNextSibling(); while( item1 != null ) { System.out.println( "Item:" + item.getTitle() ); System.out.println( "Item1:" + item1.getTitle() ); output = new OutputStreamWriter(new FileOutputStream( "simple"+i+".txt" ) ); PDFTextStripperByArea stripper= null; stripper=new PDFTextStripperByArea(); List reg = stripper.getRegions(); System.out.println(reg.size()); wordsep = stripper.getWordSeparator(); stripper.setStartBookmark(item); stripper.setLineSeparator("\n"); stripper.setWordSeparator(" "); stripper.setPageSeparator("\n\n\n\n"); stripper.setWordSeparator(" "); stripper.setEndBookmark(item1); //str = stripper.getText(document); //output.write( str, 0, str.length()); stripper.writeText( document, output ); i++; item = item.getNextSibling(); item1 = item1.getNextSibling(); } PDOutlineItem child = item.getFirstChild(); PDOutlineItem child1 = new PDOutlineItem(); while( child != null ) { child1 = child; child = child.getNextSibling(); } System.out.println( "Item:" + item.getTitle() ); System.out.println( "Item1:" + child1.getTitle() ); output = new OutputStreamWriter(new FileOutputStream( "simple"+i+".txt" ) ); PDFTextStripperByArea stripper= null; stripper=new PDFTextStripperByArea(); System.out.println("The word separator is"+flag); stripper.setLineSeparator("\n"); stripper.setPageSeparator("\n\n\n\n"); stripper.setWordSeparator(" "); stripper.setStartBookmark(item); stripper.setEndBookmark(child1); stripper.setShouldSeparateByBeads(stripper.shouldSeparateByBeads()); stripper.writeText( document, output ); output.close(); document.close(); } catch(Exception ex) { System.out.println(ex); } } } -----Original Message----- From: Srinivas Krishna [mailto:Sri...@sa...] Sent: Monday, February 27, 2006 12:19 AM To: rb...@br... Subject: Extracting text from pdf Hi Mr. Braman, I am Srinivas from pune, India. Working for Saama Technologies. As I am facing a problem in extracting text from pdf files using pdfbox, I need some help regarding the problem from you. It would be great help if you look on to this problem. My problem is like when i extract a text from the pdf file which consists of table some columns in of the table are joining together. It is not happening with all the columns, only with some columns. For your reference I will be sending the pdf file, the text file i have extracted and the java code which i used for extraction. So please have a look at all these and let me know what exactly its happening there,if you come to know. Actually the files attached are 1 java file, 1 pdf file and remaining are the text file each consists of the contents of each bookmark in the pdf file. With Regards. Srinivas Krishna software developer Saama technologies india Pvt.Ltd, Pune. |