Menu

#111 Text Extraction with Formatting

closed
5
2010-04-07
2006-11-02
No

Is it possible to extract text from a PDF without
ignoring the formatting?

HTML tags might be used for example. I thought the
PDFText2Html class would do the trick but it does not.
Thank you for reading.

Discussion

  • Ben Litchfield

    Ben Litchfield - 2006-11-02

    Logged In: YES
    user_id=601708

    HTML tags are not used to format a PDF document. Font information is available but can be tricky to get what you
    want. You will need to extend PDFTextStripper and override writeCharacters to get formatting such as bold/italic.
    Is that what you are looking for?

    Ben

     
  • Anonymous

    Anonymous - 2006-11-02

    Logged In: YES
    user_id=1562185

    That's exactly what I am looking for. But is this not a
    priority issue for the PDFBox package? It would take me
    quite a time to extend the stripper on my own. One of the
    PDFBox developers might do it better I think.

    If you insist that it's a user's issue and PDFBox developers
    would not invest their time in such an extension, could you
    at least tell me whether you have any links to any
    information regarding this matter?

     
  • Ben Litchfield

    Ben Litchfield - 2006-11-02

    Logged In: YES
    user_id=601708

    Specifically are you looking only for bold & italic or other things?

     
  • Anonymous

    Anonymous - 2006-11-02

    Logged In: YES
    user_id=1562185

    Uhmm... well bold, italic, underlined etc... would be a good
    beginning but my ultimate wish would be something like
    quoted below:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

    <pdf2xml>
    <page number="1" position="absolute" top="0" left="0"
    height="1262" width="892">
    <fontspec id="0" size="16" family="Times" color="#000000"/>
    <fontspec id="1" size="16" family="Times" color="#000000"/>
    <fontspec id="2" size="16" family="Times" color="#000000"/>
    <text top="110" left="106" width="137" height="18"
    font="0"><i>She </i>told <b>me</b>. äµß </text>
    </page>
    </pdf2xml>

    I think I have made a mistake by naming it "Text Extraction
    with Formatting"... I should have put my question under a
    more fitting title, something like "PDF to (HTML/)XML
    Conversion with formatting".

    Thank you very much for your prompt replies. ^_^

     
  • Raimi Rufai

    Raimi Rufai - 2007-04-26

    Logged In: YES
    user_id=1776491
    Originator: NO

    Hi Ben,
    <p>
    I've extended PDFText2Html to handle bold, new lines (with &lt;br&gt; tags). However, I'm having trouble figuring out how to handle underlines.
    </p>

    <p>
    Also, I don't know how to post updates.
    </p>

    Regards,

    Raimi

     
  • Anonymous

    Anonymous - 2007-04-26

    Logged In: YES
    user_id=1562185
    Originator: YES

    @ rrufai
    what is the trouble you have with handling underlines?

    You might send a compiled 32-bit windows or linux binary personally to me. (I'm a user of pdftohtml.)

     
  • Anonymous

    Anonymous - 2007-04-26

    Logged In: YES
    user_id=1562185
    Originator: YES

    @ rruffai

    > You might send a compiled 32-bit windows or linux binary personally to me.
    > (I'm a user of pdftohtml.)

    I messed things up. This was also PDFBox. Hehe, sorry.

     
  • Raimi Rufai

    Raimi Rufai - 2007-04-26

    Logged In: YES
    user_id=1776491
    Originator: NO

    What email address should I send it to?

     
  • Raimi Rufai

    Raimi Rufai - 2007-04-27

    Logged In: YES
    user_id=1776491
    Originator: NO

    It's sent.

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07

    PDFBox has moved to Apache. Please log issue there.

    http://pdfbox.apache.org

     
  • Ben Litchfield

    Ben Litchfield - 2010-04-07
    • status: open --> closed
     

Log in to post a comment.