[pdftohtml] "Simpler" Output
Status: Beta
Brought to you by:
meshko
|
From: Jason F. <jaf...@uw...> - 2002-07-20 01:23:57
|
First off, a big "thanks" for delivering something that compiles cleanly on OS/X with no fuss. I'm really impressed with the fidelity of the pdftohtml output. In fact the fidelity is so high that it's causing me some problems :) I'm trying to create a service that is similar to that provided by www.citeseer.com Basically this service will create a citation database automatically given a set of source documents. It turns out that the project is currently using pstotxt, which is no longer being supported. I was hoping to transition to using pdftohtml. The problem is that the pdftohtml output is, well, just too detailed. For my purposes what I really need is the text, font size, and font style information; the layout information is really noise. Is there any easy way to obtain this subset of the information (especially in XML form)? If not, how hard would it be to add in this functionality? I would figure that it would be easy, given that it's a matter of throwing information away as opposed to keeping it, but I don't know. I could possible do this with some post-precessing of the XML, but doing it in the conversion strikes me as a better approach. Being able to specify "keep sizing, discard positioning, discard style" on the command line would be pretty cool. Anyways, thanks for any help and for making such a useful tool. Jason Foster |