Menu

#85 Processing horizontally first then horizontally

closed
5
2010-04-07
2005-08-24
No

I would like to see the implementation of coalescing
where all words will be appended horizontally first then
vertically. If this features is implemented properly all the
fields of a table will be extracted and printed correctly
as in the original PDF document.

Sample: Page 2 of PDFBox References. All Content of
column Project Name will be extracted before Colum
License.

===========
Centric CRM
(http://www.centriccrm.com)
Free To Use But
Restricted/Commercial
The Most Advanced Open
Source CRM Software.
=============

Thanks,

-tan

Discussion

  • Ben Litchfield

    Ben Litchfield - 2005-08-25

    Logged In: YES
    user_id=601708

    text in a pdf document is drawn at x/y locations. Which
    means there is no relationship to text drawn in a column. If
    you can propose an algorithm to determine columns of text
    then I will implement it. As a side note, there is no such
    thing as a 'table' in a pdf document, only lines drawn between
    two points and text drawn at x/y locations. The only way
    a 'column' of could be determined is by analyzing lines on the
    PDF document, not an easy thing to do.

    Ben Litchfield

     
  • Tan V. Nguyen

    Tan V. Nguyen - 2005-08-25

    Logged In: YES
    user_id=683822

    Ben,

    Thanks for quick response. Generally speaking, I highly
    appreciated your effort in developing such a wonderful open-
    source package.
    I am interesting in developing a PDF to RTF converter. Its
    main features include keeping all text attributes such as
    strikethru, underlined, fonts attributes, and spacing. In the
    past, I successfully developed an application in C++ using
    XPDF package and added code to do what I want.
    Now I would like to implement these features using PDFBox
    to deploy the application in a J2EE environment.

    Here's the basic algorithm they use in XPDF. First, they
    build a link list of string nodes. These string nodes contain x-
    y coordinates of text strings. Like your TextPosition
    instance, however their string nodes also contain all
    information about their coordinates including LowerLeft X,Y
    and UpperRight X-Y. They call yMin, yMax and xMin, xMax.
    They store all these Strings nodes in major y-x axis.

    Then they coalesce and merge all string nodes with the
    same Y-coordinate first, therefore I was able to extract and
    convert into RTF and maintain the same content and format
    of PDF file.
    I am trying to figure out how to add extra information to your
    TextPosition class, so later on, I will be able to traverse thru
    major y-axis and build a list of these string nodes.

    If you can provide me information needed to obtain all
    information about coordinates or position of a text string, I
    think I will be able to implement these features. I will
    contribute these codes to your project.
    I uploaded a header file from XPDF, a sample PDF file which I
    tried to convert and an RTF file.
    I am not trying to convert "TABLE" from PDF file. I
    understand that concept does not exist in PDF.

    Thanks,

    Tan V. Nguyen

     
  • Tan V. Nguyen

    Tan V. Nguyen - 2005-08-25

    This is the header file from PDFtoHTML

     
  • Tan V. Nguyen

    Tan V. Nguyen - 2005-08-25

    Logged In: YES
    user_id=683822

    I uploaded an RTF file converted from PDF file using my
    applicatin developed in C++.

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07

    PDFBox has moved to Apache. Please log issue there.

    http://pdfbox.apache.org

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07
    • status: open --> closed
     

Log in to post a comment.