#17 Pure text support?


This program is the closest I've comed to a pure pdf to
text converter. If I could just get the program to skip
the HTML formatting it would be perfect. Maybe a -text
command line argument can be added in the future?


  • Nobody/Anonymous

    Logged In: NO

    The xpdf package already comes with a command line utility
    that does this: pdftotext

    The beauty of pdftohtml/xml is that you get to keep all the
    information about formatting and layout so that you have
    more useful information.

  • David A. Gatwood

    Logged In: YES

    I don't know about the person who filed this bug, but personally, I'd like
    something halfway in-between the two.

    What I'd like is fundamentally a converter that preserves the spirit of the
    original flow rather than the letter of the original formatting.

    * Paragraph coalescing. Each paragraph wrapped in a <p> tag. Formatting
    changes marked up by wrapping individual words with <span> within it.

    * Table coalescing. Tabular-looking data wrapped into an actual <table> tag
    using a nearest-match algorithm.

    * Bullet list coalescing. Symbol font 'm' is a bullet. If it starts a line and it is
    indented more than the previous line, wrap it in <ul><li>.

    * Numbered list coalescing. If it is indented farther than the previous line and
    starts with a number, it's a list. Wrap it in <ol><li>.

    * Term-and-definition coalescing. Hanging lists and similar coalesced into
    term-and-definition lists. Wrap it in <dl><dt><dd>.

    * Section heading detection. A bold-faced chunk of text with everything
    under it indented is a section heading. Top level is <H1>, next nesting level
    is <H2>, and so on.

    With such a feature, it would be far more practical to use this to generate
    intermediate HTML suitable for conversion into other, non-formatting-
    oriented formats such as DocBook XML.

    While it might be possible to implement such enhancements as a wrapper for
    this tool as-is, it is certainly not trivial.... From the way you describe it, it
    sounds like it might be easier to add some formatting to pdftotext than to try
    to make this do what I want, but having never seen the output of pdftotext, I
    could be wrong.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks