[PDFBox-user] FW: Extracting text from pdf

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

The developer below must have following my posts about Text Extraction
and figured I was the Braman of Text Extraction, which I am not yet :P
I wanted to post this code because I think it's a good way to open a
dialog about PDF Text Extraction and strategies for how it might be
done.  This developer is using PDF Box, but the concepts are the same no
matter what library you use.  Ben has just been greatful enough to put a
TextStripper and a Text Stripper by area in PDFBox. Hopfully someone can
explain what is going on and what is going wrong here and what things
that can be done to improve this code.

The developers problem is when he runs his code to extract the text from
the pdf file, which consists of table and some columns, some columns in
of the table are joining together.  Attached are the input file and
output files.  If you look at page 2 of the PDF and the output in simple
1 , you can see what he is talking about.  The EntryCode and Entry
Description columns get concatenated togeter as do ValueDate and Entry
Amoutn.  Whats different about these two coumns .  It appears that they
are not spaced out very far (EntryCode is the only right justified
column), and I think creator put them in as one fragment and the
stripper treats them as one as well.  I think that is why Christian and
Tamir came up with that whitespace algoritm for doing thse columns, so
that subtelties like this can be caught and dealt with.

Here are the most relevant lines of code in PDF box to start parsing
text. I put in some comments and the reference to the API for easy
reference:

//We get an outline of the PDF and the child of the first Outline item
PDDocumentOutline root =
document.getDocumentCatalog().getDocumentOutline();
PDOutlineItem item = root.getFirstChild();
//Then it sibling
PDOutlineItem item1 = item.getNextSibling();

//We then start a loop to iterate through the Outline Items
//One file gets written for each loop iteration
//For each outline item we are going to get a stripper

//This will extract text from a specified region in the PDF. 
//http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripperByArea.ht
ml
//http://www.pdfbox.org/javadoc/org/pdfbox/util/PDFTextStripper.html

stripper=new PDFTextStripperByArea(); 

//Get the list of regions that have been setup. 
//returns a list of regions
List reg = stripper.getRegions();

// This will get the word separator. Returns a string with the word
separator
//
wordsep = stripper.getWordSeparator();

//Set the bookmark where text extraction should start, inclusive.
//Passes in the base PDFOutline in the document
stripper.setStartBookmark(item); 

//Set the desired line separator for output text. 
stripper.setLineSeparator("\n");

//Set the desired word separator for output text. 
stripper.setWordSeparator("  ");

//Set the desired page separator for output text. 
stripper.setPageSeparator("\n\n\n\n");

//Why is this being set again?
stripper.setWordSeparator("   ");

//Set the bookmark where the text extraction should stop.
stripper.setEndBookmark(item1);

//Write the text out to the output file
stripper.writeText( document, output );
//changes output file
i++;

//Move to next sibling
item = item.getNextSibling();
item1 = item1.getNextSibling();

//The rest of the code iterates through the child node of the root 
/Output line item and its siblings, even though one could get confused
by the statement

child = child.getNextSibling();
// because it appears to return a sibling called a child.

/I was also slight confused by this lien:
stripper.setShouldSeparateByBeads(stripper.shouldSeparateByBeads()); 
//as it seems that true or false should be passed in, not its current
value

//To quote tamir
//Text in a PDF is held as a series of text fragments. These fragments
may be written to
//the PDF file in any order. Each text fragment usually
//contains one full line of text although changes in formatting and the
inclusion of
//certain symbols require the line to be separated into separate
fragments. Some PDF file
//creators place each word or character as a separate fragment.

//I noticed the developer never used the Lists of regions he create,
maybe this should be used?
//I don't think it would help solve his problem

//I noticed that there were some other method calls in stripper that the
developer didn't use such as:

//The order of the text tokens in a PDF file may not be in the same as
they appear visually on the screen. 
setSortByPosition(boolean newSortByPosition) ;

setSuppressDuplicateOverlappingText(boolean
suppressDuplicateOverlappingTextValue) 
//By default the text stripper will attempt to remove text that overlaps
each other. 
//This was not changed

//Complete code***********************************************

import java.util.List;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import org.pdfbox.exceptions.InvalidPasswordException;
import org.pdfbox.pdmodel.PDDocument;
import
org.pdfbox.pdmodel.interactive.documentnavigation.outline.PDDocumentOutl
ine;
import
org.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem;
import org.pdfbox.util.PDFText2HTML;
import org.pdfbox.util.PDFTextStripper;
import org.pdfbox.util.PDFTextStripperByArea;

/**
 * This is the main program that simply parses the pdf document and
transforms it
 * into text.
 */
public class ExtractText
{
    /**
     * private constructor.
    */
    private ExtractText()
    {
        //static class
    }

    public static void main( String[] args ) throws Exception
    {
    	try
        {
        	int i =1;
        	String wordsep = null;
        	String str = null;
        	boolean flag = false;
               	Writer output = null;
        	PDDocument document = null;
        	document = PDDocument.load( "53 Nostro Ofc Cofc Daily
Position_AUS.pdf" );

        	PDDocumentOutline root =
document.getDocumentCatalog().getDocumentOutline();
		PDOutlineItem item = root.getFirstChild();
		PDOutlineItem item1 = item.getNextSibling();

		while( item1 != null )
      		{       
      			System.out.println( "Item:" + item.getTitle() );
      			System.out.println( "Item1:" + item1.getTitle()
);
      			output = new OutputStreamWriter(new
FileOutputStream( "simple"+i+".txt" ) );
      			PDFTextStripperByArea stripper= null;
      			stripper=new PDFTextStripperByArea(); 
      			List reg = stripper.getRegions();
      			System.out.println(reg.size());

           			wordsep = stripper.getWordSeparator();

            		stripper.setStartBookmark(item);

            		stripper.setLineSeparator("\n");
            		stripper.setWordSeparator("  ");
            		stripper.setPageSeparator("\n\n\n\n");
            		stripper.setWordSeparator("   ");
            		stripper.setEndBookmark(item1);
            		//str = stripper.getText(document);
            		//output.write( str, 0, str.length()); 

            		stripper.writeText( document, output );
            		i++;
      			item = item.getNextSibling();
      	        	item1 = item1.getNextSibling();

      		}
      			PDOutlineItem child = item.getFirstChild();
      			PDOutlineItem child1 = new PDOutlineItem();
          		while( child != null )
          		{
          			child1 = child; 
          			child = child.getNextSibling();

          		}
          		System.out.println( "Item:" + item.getTitle() );
          		System.out.println( "Item1:" + child1.getTitle()
);
      		output = new OutputStreamWriter(new FileOutputStream(
"simple"+i+".txt" ) );
           		PDFTextStripperByArea stripper= null;
      		stripper=new PDFTextStripperByArea(); 

           		System.out.println("The word separator
is"+flag);

            		stripper.setLineSeparator("\n");

            		stripper.setPageSeparator("\n\n\n\n");
            		stripper.setWordSeparator("  ");
            		stripper.setStartBookmark(item);
            		stripper.setEndBookmark(child1);

stripper.setShouldSeparateByBeads(stripper.shouldSeparateByBeads());
            		stripper.writeText( document, output );

            		output.close();  
	            	document.close();
        }
         catch(Exception ex)
        {
        	System.out.println(ex);
        }
    }    
}

-----Original Message-----
From: Srinivas Krishna [mailto:Sri...@sa...] 
Sent: Monday, February 27, 2006 12:19 AM
To: rb...@br...
Subject: Extracting text from pdf

Hi Mr. Braman, 

I am Srinivas from pune, India. Working for Saama Technologies. As I am
facing a problem in extracting text from pdf files using pdfbox, I need
some help regarding the problem from you. It would be great help if you
look on to this problem. 

    My problem is like when i extract a text from the pdf file which
consists of table some columns in of the table are joining together. It
is not happening with all the columns, only with some columns. For your
reference I will be sending the pdf file, the text file i have extracted
and the java code which i used for extraction. So please have a look at
all these and let me know what exactly its happening there,if you come
to know. Actually the files attached are 1 java file, 1 pdf file and
remaining are the text file each consists of the contents of  each
bookmark in the pdf file.

With Regards.
Srinivas Krishna 
software developer 
Saama technologies india Pvt.Ltd, Pune.