pdfreflow documentation

Help
2010-05-25
2013-04-24
  • Pdfreflow is a command line tool that reflows PDF text. Its input is the XML output of pdftohtml, to which it reflows the text, removes page number, header, footers, and hyphenation, and generates an HTML file output.

    In the attached .zip file is a version that will run under Windows XP, Mac OSX 10.5 Leopard, and Ubuntu 8.04 (and later). There is also pdfreflow.html, which is documentation of how to use the command, and how to find a prebuilt version of pdftohtml from Poppler.

    The open source of pdfreflow is copyrighted under GNU GPL, and source is available at SourceForge.

    Synopsis

    pdfreflow 

    Description

    Pdfreflow, in conjunction with pdftohtml, will convert a PDF into a reflowed HTML file. Pdfreflow operates on the XML output from pdftohtml (from the Poppler utilities), converting it into an HTML file. To get the XML input for pdfreflow, use pdftohtml as follows:

    pdftohtml -xml mybook.pdf
    

    The output of pdftohtml is in the file mybook.xml.

    General Usage

    Pdfreflow is oriented for operating on ebook PDFs, text based only, with minimal formatting, the kind of formatting you would get reading a fiction novel. By default pdfreflow expects justified text, but you can specify the input is rag right with the following option:

    pdfreflow --ragright mybook.xml
    

    The output of pdfreflow is in the file mybook.html.

    You might not want to reflow every page in your ebook. To specify which pages are NOT to be reflowed, use the following option:

    pdfreflow --dontreflow="1-6,10,198-201" mybook.xml
    

    The ‑‑dontreflow option takes a comma separated list of page ranges. The first page in a book is page 1. Also, the page number is not the printed page number, but the page number that shows in the thumbnail view of PDF viewers like Acrobat, Preview, Evince, etc.

    Cropping

    While pdfreflow does its best to remove page numbers, headers and footers, you may have to assist by specifying the cropping options, ‑‑top=TOP_Y and ‑‑bottom=BOTTOM_Y. To find the Y values of a header or footer, you need to look inside the .xml file and find line of text that contains the header or footer. A sample entry looks as follows:

    <text top="36" left="203" width="209" height="11" font="0">Self Knowledge</text><text top="506" left="506" width="209" height="11" font="0">Self Realization</text>
    pdfreflow --top=36 --bottom=506 mybook.xml
    

    In this example, every text line that has a "top" value less than or equal to 36 will be cropped, and every text line that has a "top" value that is greater than or equal to 506 will be cropped.

    Centered Text

    Pdfreflow does its best to detect centered text. Sometimes, especially with rag right text, it is hard to detect the center point. To improve the center detection, you can specify a line in your document that is centered by specifying the page number and line number of a centered line. For example, if the 2nd line on page 3 is a centered line, you specify this with page:line argument to the ‑‑center option as follows (page numbers and line numbers both start at 1).

    pdfreflow --center=3:2 mybook.xml
    

    To discover the line number to specify for the ‑‑center option, you can used the ‑‑print options to print out the contents of a page with linenumbers to the output.

    pdfreflow --print=3 mybook.xml
    

    Reflow Specified Pages

    It is also possible to only reflow a subset of the ebook by specifying the ‑‑first=FIRSTPAGE and ‑‑last=LASTPAGE options. This is useful if a book has sections with vastly different formatting. Create a different HTML file for each differently formatted section, and either concatenate the files together, or if you are creating an e-book, this step is not necessary as it is possible to specify multiple HTML files as input to ebook creation software.

    pdfreflow --first=1 --last=100 mybook.xml
    cp mybook.html section1.html
    pdfreflow --first=101 --last=200 mybook.xml
    cp mybook.html section2.html
    

    Files

    If the filename command line argument is specified, file suffix is replace with .html and the ouput is written to that file, i.e. an input file of mybook.xml has an output file mybook.html. If no input file is specified, standard in used as the input, and standard out is the output.

    pdfreflow < mybook.xml > out.html
    

    Options
    Here is the usage output for pdfreflow.

    usage: pdfreflow [options] [inputfile] 
    Options:
           --absolute            font sizes are the same as the original document
                                    (not the default)
      -b, --bottom=MAXTOP crop text whose top is greater than or equal to maxtop
      -c, --center=SPEC argument is page:line, ie 2:1 is line 1 on page 2
                                    is a centered line (sometimes this hint is needed)
      -d, --dontreflow=PAGES don't reflow comma separated page ranges,
                                    i.e. "1,2,4-9,100"
      -f, --first=FIRSTPAGE starting page (default is 1)
      -l, --last=LASTPAGE   ending page (default is last page of the document)
          ‑‑nonfiction      for books that use block quoting at the same
                                    inset as the paragraph indent
      -r, --ragright             text is rag-right, NOT justify (default is justify)
      -t, --top=MINTOP  crop text whose top is less than or equal to mintop
          ‑‑shortlines      paragraphs end with short lines (only necessary
                                    for rag right documents with no paragraph
                                    indent and no after paragraph spacing.
           --showdebug        print debugging options
      -v, --version             print current version
      -?, --help                 print this help
    

    Example
    Options can be combined. An example using a combination of the options in the description section is:

    pdfreflow --dontreflow="1-6,10,198-201" --top=36 --bottom=506 mybook.xml
    

    Troubleshooting

    While pdfreflow tries it best, sometimes it can not correctly reflow all documents. Here are some tips to get a better output document.

    Paragraph are too large

    If your book does not have paragraph indenting or vertical spacing after every paragraph, too much text may be reflowed into each paragraph. You might try the ‑‑shortlines option. The argument is a percentage between 1 and 100. If 0 is specified, you get the default value (currently 80). This percentage is used against the longest line width in the document, and lines that are shorter than this percentage are considered the end of a paragraph.

    pdfreflow --shortlines=0 mybook.xml
    

    Paragraph are incorrectly reflowed

    If your input document is not justified, make sure you specified the ‑‑ragright option.

    Pdfreflow is configured to deal with fiction, which often has indented paragraphs and/or vertical spacing after a pararaph. If your book has indenting, but is not fiction with dialog, try using the ‑‑nonfiction option.

    pdfreflow --nonfiction mybook.xml
    

    If your book has vastly differently formatted sections, you might try look at the Reflow Specified Pages section above.

    Limitations

    • Only simple book formats are supported. This is not a general purpose reflower for a MS Word or desktop publishing document. Pictures are not supported.

    • Mutiple columns are not supported.

    • Footnotes will cause problems. At this point they just show up wherever they are in the paragraph, potentially splitting a paragraph into two pieces.

    Getting pdfreflow

    There are binaries for Windows XP, Ubuntu 8.04, and Mac OSX 10.5 (and later) attached to this post. The open source of pdfreflow is copyrighted under GNU GPL, and source is available at SourceForge.

    Getting pdftohtml

    To get a copy of pdftohtml, without building it from source, here are some options:

    Ubuntu: Use Synaptic Package Manager to fetch poppler-utils

    Macintosh: Download Calibre for Mac. There is a copy of pdftohtml inside of Calibre.app under /Applications/calibre.app/Contents/Frameworks/

    PATH=$PATH:/Applications/calibre.app/Contents/Frameworks
    htmltopdf -xml mybook.pdf
    

    Windows: Download Calibre for Windows. There is a copy of pdftohtml inside of Calibre under C:\Progam Files\Calibre2. Make sure to add C:\Progam Files\Calibre2 and C:\Progam Files\Calibre2\DLLs to your path, ie:

    PATH=%PATH%;C:\Progam Files\Calibre2;C:\Progam Files\Calibre2\DLLs
    htmltopdf -xml mybook.pdf
    

    Prana