Question: match pdf output with pdf source

Brought to you by: aldobu

#7 Question: match pdf output with pdf source

Milestone: tclMuPdf 2.x

Status: Done

Owner: nobody

Labels: None

Updated: 2025-05-17

Created: 2024-04-30

Creator: Eloi

Private: No

I have a pdf file. I run mutool clean -d test.pdf out.pdf to decompress the streams.
I would like to know which part of the pdf is responsible of rendering some part of it.

Ideal situation: Clicking a letter oon the pdf, I get the exact position of the letter in the source.
Hi, is it possible to archive this?

Thanks

Discussion

Aldo Buratti - 2024-05-01

Hi,
interesting request, but I'm not sure I completely understand it.

Let's split the request;
Q1) Indicate a point (x,y) on page K, and then extract the word, or rather the indicated letter.

ANSWER: Yes, this is feasible, but it will require some new methods in tclMuPdf.
Currently tclMuPdf can only extract the text of a 'block' or a 'line' (see "pageObj lines" and "pageObj text ..." methods), or it can search for the position (bbox) of a given word or letter on one page, but honestly this is quite complicated for the stated purpose.

Anyway, this is an interesting request, and I'll think about it...

Q2) After identifying the word or letter at position (x,y), find where this word or letter is physically present in the 'source' pdf. ...

ANSWER: !! I honestly think it's impossible, also because the text in a PDF is often annotated in compressed form.

E.g. this PDF shows the text "Hello World", but if you open the PDF with a text-editor, you will not find the words "Hello World"...

Hello.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Eloi - 2024-05-02

Hi Thank to consider this!

Running mutool -d Hello.pdf out.pdf I got the attached pdf, which has the string in clear text in the source.

Maybe we could get the object or stream corresponding to the (x,y) coordinates, and if the stream is not compressed we could get the exact location.

out.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aldo Buratti - 2024-05-06

status: New --> Accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aldo Buratti - 2025-05-17

Regarding the first topic

Q1) Indicate a point (x,y) on page K, and then extract the word, or rather the indicated letter.
this is has been solved in tclMuPDF2.5 with the introduction of the new "textfrombbox" method.

regarding the second topic

Q2) After identifying the word or letter at position (x,y), find where this word or letter is physically present in the 'source' pdf. ...
.. honestly I don't understand the motivation (a practical use case) for this request. Please open a new ticket adding more details.

I'm going to close this ticket.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aldo Buratti - 2025-05-17

status: Accepted --> Done
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.