Question: match pdf output with pdf source
Brought to you by:
aldobu
I have a pdf file. I run mutool clean -d test.pdf out.pdf to decompress the streams.
I would like to know which part of the pdf is responsible of rendering some part of it.
Ideal situation: Clicking a letter oon the pdf, I get the exact position of the letter in the source.
Hi, is it possible to archive this?
Thanks
Hi,
interesting request, but I'm not sure I completely understand it.
Let's split the request;
Q1) Indicate a point (x,y) on page K, and then extract the word, or rather the indicated letter.
ANSWER: Yes, this is feasible, but it will require some new methods in tclMuPdf.
Currently tclMuPdf can only extract the text of a 'block' or a 'line' (see "pageObj lines" and "pageObj text ..." methods), or it can search for the position (bbox) of a given word or letter on one page, but honestly this is quite complicated for the stated purpose.
Anyway, this is an interesting request, and I'll think about it...
Q2) After identifying the word or letter at position (x,y), find where this word or letter is physically present in the 'source' pdf. ...
ANSWER: !! I honestly think it's impossible, also because the text in a PDF is often annotated in compressed form.
E.g. this PDF shows the text "Hello World", but if you open the PDF with a text-editor, you will not find the words "Hello World"...
Hi Thank to consider this!
Running
mutool -d Hello.pdf out.pdfI got the attached pdf, which has the string in clear text in the source.Maybe we could get the object or stream corresponding to the (x,y) coordinates, and if the stream is not compressed we could get the exact location.
Regarding the first topic
this is has been solved in tclMuPDF2.5 with the introduction of the new "textfrombbox" method.
regarding the second topic
.. honestly I don't understand the motivation (a practical use case) for this request. Please open a new ticket adding more details.
I'm going to close this ticket.