pdftohtml is a excellent tool. I have downloaded "pdftohtml-0.39-win32" version, and tried converting some pdfs into xml. It is extracting the words as line by line with its top, left, width and height informaion.
But I want, to extract word by word with top, left, width and height info. Is it possible?. Can anyone tell how can i get this.
Have you heard anything or figured anything out regarding this? I know there is a tool "pdftoxml" which uses pdftohtml but I can't seem to compile it on any of my Macs. Pdftohtml works but it does line by line rather than word by word.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.