Processing horizontally first then horizontally
Brought to you by:
benlitchfield
I would like to see the implementation of coalescing
where all words will be appended horizontally first then
vertically. If this features is implemented properly all the
fields of a table will be extracted and printed correctly
as in the original PDF document.
Sample: Page 2 of PDFBox References. All Content of
column Project Name will be extracted before Colum
License.
===========
Centric CRM
(http://www.centriccrm.com)
Free To Use But
Restricted/Commercial
The Most Advanced Open
Source CRM Software.
=============
Thanks,
-tan
Logged In: YES
user_id=601708
text in a pdf document is drawn at x/y locations. Which
means there is no relationship to text drawn in a column. If
you can propose an algorithm to determine columns of text
then I will implement it. As a side note, there is no such
thing as a 'table' in a pdf document, only lines drawn between
two points and text drawn at x/y locations. The only way
a 'column' of could be determined is by analyzing lines on the
PDF document, not an easy thing to do.
Ben Litchfield
Logged In: YES
user_id=683822
Ben,
Thanks for quick response. Generally speaking, I highly
appreciated your effort in developing such a wonderful open-
source package.
I am interesting in developing a PDF to RTF converter. Its
main features include keeping all text attributes such as
strikethru, underlined, fonts attributes, and spacing. In the
past, I successfully developed an application in C++ using
XPDF package and added code to do what I want.
Now I would like to implement these features using PDFBox
to deploy the application in a J2EE environment.
Here's the basic algorithm they use in XPDF. First, they
build a link list of string nodes. These string nodes contain x-
y coordinates of text strings. Like your TextPosition
instance, however their string nodes also contain all
information about their coordinates including LowerLeft X,Y
and UpperRight X-Y. They call yMin, yMax and xMin, xMax.
They store all these Strings nodes in major y-x axis.
Then they coalesce and merge all string nodes with the
same Y-coordinate first, therefore I was able to extract and
convert into RTF and maintain the same content and format
of PDF file.
I am trying to figure out how to add extra information to your
TextPosition class, so later on, I will be able to traverse thru
major y-axis and build a list of these string nodes.
If you can provide me information needed to obtain all
information about coordinates or position of a text string, I
think I will be able to implement these features. I will
contribute these codes to your project.
I uploaded a header file from XPDF, a sample PDF file which I
tried to convert and an RTF file.
I am not trying to convert "TABLE" from PDF file. I
understand that concept does not exist in PDF.
Thanks,
Tan V. Nguyen
This is the header file from PDFtoHTML
Logged In: YES
user_id=683822
I uploaded an RTF file converted from PDF file using my
applicatin developed in C++.
PDFBox has moved to Apache. Please log issue there.
http://pdfbox.apache.org