Re: [pdftohtml] "Simpler" Output
Status: Beta
Brought to you by:
meshko
|
From: Mikhail K. <me...@cs...> - 2002-07-20 01:39:58
|
Hi Jason, I'm very busy right now, so I won't be able to add any such options in the near future (next couple of months?) Have you tried pdftohtml -xml ? It outputs xml with approximately the same level of detail as html output but you can of course postprocess it etc. Let me know if you need the xml output tweaked, I'll try to find some time... > First off, a big "thanks" for delivering something that compiles cleanly on > OS/X with no fuss. > > I'm really impressed with the fidelity of the pdftohtml output. In fact > the fidelity is so high that it's causing me some problems :) > > I'm trying to create a service that is similar to that provided by > www.citeseer.com Basically this service will create a citation database > automatically given a set of source documents. It turns out that the > project is currently using pstotxt, which is no longer being supported. I > was hoping to transition to using pdftohtml. > > The problem is that the pdftohtml output is, well, just too detailed. For > my purposes what I really need is the text, font size, and font style > information; the layout information is really noise. Is there any easy way > to obtain this subset of the information (especially in XML form)? If not, > how hard would it be to add in this functionality? I would figure that it > would be easy, given that it's a matter of throwing information away as > opposed to keeping it, but I don't know. > > I could possible do this with some post-precessing of the XML, but doing it > in the conversion strikes me as a better approach. Being able to specify > "keep sizing, discard positioning, discard style" on the command line would > be pretty cool. > > Anyways, thanks for any help and for making such a useful tool. > > Jason Foster > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Pdftohtml-general mailing list > Pdf...@li... > https://lists.sourceforge.net/lists/listinfo/pdftohtml-general > |