HTML tags are not used to format a PDF document. Font information is available but can be tricky to get what you
want. You will need to extend PDFTextStripper and override writeCharacters to get formatting such as bold/italic.
Is that what you are looking for?
Ben
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2006-11-02
Logged In: YES
user_id=1562185
That's exactly what I am looking for. But is this not a
priority issue for the PDFBox package? It would take me
quite a time to extend the stripper on my own. One of the
PDFBox developers might do it better I think.
If you insist that it's a user's issue and PDFBox developers
would not invest their time in such an extension, could you
at least tell me whether you have any links to any
information regarding this matter?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think I have made a mistake by naming it "Text Extraction
with Formatting"... I should have put my question under a
more fitting title, something like "PDF to (HTML/)XML
Conversion with formatting".
Thank you very much for your prompt replies. ^_^
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Ben,
<p>
I've extended PDFText2Html to handle bold, new lines (with <br> tags). However, I'm having trouble figuring out how to handle underlines.
</p>
<p>
Also, I don't know how to post updates.
</p>
Regards,
Raimi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2007-04-26
Logged In: YES
user_id=1562185
Originator: YES
@ rrufai
what is the trouble you have with handling underlines?
You might send a compiled 32-bit windows or linux binary personally to me. (I'm a user of pdftohtml.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2007-04-26
Logged In: YES
user_id=1562185
Originator: YES
@ rruffai
> You might send a compiled 32-bit windows or linux binary personally to me.
> (I'm a user of pdftohtml.)
I messed things up. This was also PDFBox. Hehe, sorry.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Logged In: YES
user_id=601708
HTML tags are not used to format a PDF document. Font information is available but can be tricky to get what you
want. You will need to extend PDFTextStripper and override writeCharacters to get formatting such as bold/italic.
Is that what you are looking for?
Ben
Logged In: YES
user_id=1562185
That's exactly what I am looking for. But is this not a
priority issue for the PDFBox package? It would take me
quite a time to extend the stripper on my own. One of the
PDFBox developers might do it better I think.
If you insist that it's a user's issue and PDFBox developers
would not invest their time in such an extension, could you
at least tell me whether you have any links to any
information regarding this matter?
Logged In: YES
user_id=601708
Specifically are you looking only for bold & italic or other things?
Logged In: YES
user_id=1562185
Uhmm... well bold, italic, underlined etc... would be a good
beginning but my ultimate wish would be something like
quoted below:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml>
<page number="1" position="absolute" top="0" left="0"
height="1262" width="892">
<fontspec id="0" size="16" family="Times" color="#000000"/>
<fontspec id="1" size="16" family="Times" color="#000000"/>
<fontspec id="2" size="16" family="Times" color="#000000"/>
<text top="110" left="106" width="137" height="18"
font="0"><i>She </i>told <b>me</b>. äµß </text>
</page>
</pdf2xml>
I think I have made a mistake by naming it "Text Extraction
with Formatting"... I should have put my question under a
more fitting title, something like "PDF to (HTML/)XML
Conversion with formatting".
Thank you very much for your prompt replies. ^_^
Logged In: YES
user_id=1776491
Originator: NO
Hi Ben,
<p>
I've extended PDFText2Html to handle bold, new lines (with <br> tags). However, I'm having trouble figuring out how to handle underlines.
</p>
<p>
Also, I don't know how to post updates.
</p>
Regards,
Raimi
Logged In: YES
user_id=1562185
Originator: YES
@ rrufai
what is the trouble you have with handling underlines?
You might send a compiled 32-bit windows or linux binary personally to me. (I'm a user of pdftohtml.)
Logged In: YES
user_id=1562185
Originator: YES
@ rruffai
> You might send a compiled 32-bit windows or linux binary personally to me.
> (I'm a user of pdftohtml.)
I messed things up. This was also PDFBox. Hehe, sorry.
Logged In: YES
user_id=1776491
Originator: NO
What email address should I send it to?
Logged In: YES
user_id=1776491
Originator: NO
It's sent.
PDFBox has moved to Apache. Please log issue there.
http://pdfbox.apache.org