Re: [pdftohtml] output is (mostly) nonsense
Status: Beta
Brought to you by:
meshko
From: Mikhail K. <me...@cs...> - 2006-10-04 13:54:06
|
This might mean that the PDF does not contain the proper text at all. Unfortunately a lot of PDF generators create files with font subsets and assign arbitrary codes to the letters used. pdftohtml can't handle that... > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > This is a resend. Since writing it, I found your archives, and I now > see why you don't allow non-member posting. I assume you'll get through > all that spam to find the real posts some time in the next couple > years... in the mean time, I'll post this as a member. > > - -- > Hi, > > Please excuse me if there is an archive for this list; I couldn't find > one or links to one on http://pdftohtml.sourceforge.net/. > > I'm using pdftohtml for the first time, and having looked through the > man page, and tried many different configurations of command line > options, I'm getting nothing like the pdf document. > > The html is fine, index and links and all, but the content of the pages > looks like the following (I have a screenshot I could send, if that > would help): > > ! > " # > $ $% > ! ! > !!&$" > $! &$" > $! & > $ > " ! > $ $ ' > $$ > Every once in a while (almost once per page, but not quite), there is a > line, and sometimes a paragraph, of text from the pdf. The number of > pages is correct. > > I tried with -enc UTF-8, but it looks like there isn't a switch for > input encoding, if I felt adventurous enough to play with that. > > Anyway, I'm assuming there is something straightforward that I'm > missing, but I'm not sure what, and I haven't found this discussed. > > btw, I'm running Ubuntu 6.06. > > - -- > Kent Rasmussen > SIL Eastern Congo Group Linguist > 020 608593/4/5 x130 > 0733-710235(office) > 0722-620510(office) > 0735-539687(Personal) > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > > iD8DBQFFI1p5c7tUjlKyxNMRAui6AKCQAomj+1Z0KUSwD+GmytDBsHQGpwCgmYQ1 > 7R8iDG2q8Hi4DJ8OS48lF7s= > =1Lo6 > -----END PGP SIGNATURE----- > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Pdftohtml-general mailing list > Pdf...@li... > https://lists.sourceforge.net/lists/listinfo/pdftohtml-general > |