Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Extracting Words from PDF

Help
MG Balaji
2009-02-16
2013-04-24
  • MG Balaji
    MG Balaji
    2009-02-16

    Hi,

    pdftohtml is a excellent tool. I have downloaded "pdftohtml-0.39-win32" version, and tried converting some pdfs into xml. It is extracting the words as line by line with its top, left, width and height informaion.

    But I want, to extract word by word with top, left, width and height info. Is it possible?. Can anyone tell how can i get this.

    Thanks...

     
    • Matthew Potter
      Matthew Potter
      2009-08-18

      Have you heard anything or figured anything out regarding this? I know there is a tool "pdftoxml" which uses pdftohtml but I can't seem to compile it on any of my Macs. Pdftohtml works but it does line by line rather than word by word.