Extracting Words from PDF

MG Balaji
  • MG Balaji

    MG Balaji - 2009-02-16


    pdftohtml is a excellent tool. I have downloaded "pdftohtml-0.39-win32" version, and tried converting some pdfs into xml. It is extracting the words as line by line with its top, left, width and height informaion.

    But I want, to extract word by word with top, left, width and height info. Is it possible?. Can anyone tell how can i get this.


    • Matthew Potter

      Matthew Potter - 2009-08-18

      Have you heard anything or figured anything out regarding this? I know there is a tool "pdftoxml" which uses pdftohtml but I can't seem to compile it on any of my Macs. Pdftohtml works but it does line by line rather than word by word.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks