[pdftohtml] "Simpler" Output

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

First off, a big "thanks" for delivering something that compiles cleanly on 
OS/X with no fuss.

I'm really impressed with the fidelity of the pdftohtml output.  In fact 
the fidelity is so high that it's causing me some problems :)

I'm trying to create a service that is similar to that provided by 
www.citeseer.com  Basically this service will create a citation database 
automatically given a set of source documents.  It turns out that the 
project is currently using pstotxt, which is no longer being supported.  I 
was hoping to transition to using pdftohtml.

The problem is that the pdftohtml output is, well, just too detailed.  For 
my purposes what I really need is the text, font size, and font style 
information; the layout information is really noise.  Is there any easy way 
to obtain this subset of the information (especially in XML form)?  If not,
  how hard would it be to add in this functionality?  I would figure that it 
would be easy, given that it's a matter of throwing information away as 
opposed to keeping it, but I don't know.

I could possible do this with some post-precessing of the XML, but doing it 
in the conversion strikes me as a better approach.  Being able to specify 
"keep sizing, discard positioning, discard style" on the command line would 
be pretty cool.

Anyways, thanks for any help and for making such a useful tool.

Jason Foster