Menu

#7 Question: match pdf output with pdf source

tclMuPdf 2.x
Done
nobody
None
2025-05-17
2024-04-30
Eloi
No

I have a pdf file. I run mutool clean -d test.pdf out.pdf to decompress the streams.
I would like to know which part of the pdf is responsible of rendering some part of it.

Ideal situation: Clicking a letter oon the pdf, I get the exact position of the letter in the source.
Hi, is it possible to archive this?

Thanks

Discussion

  • Aldo Buratti

    Aldo Buratti - 2024-05-01

    Hi,
    interesting request, but I'm not sure I completely understand it.

    Let's split the request;
    Q1) Indicate a point (x,y) on page K, and then extract the word, or rather the indicated letter.

    ANSWER: Yes, this is feasible, but it will require some new methods in tclMuPdf.
    Currently tclMuPdf can only extract the text of a 'block' or a 'line' (see "pageObj lines" and "pageObj text ..." methods), or it can search for the position (bbox) of a given word or letter on one page, but honestly this is quite complicated for the stated purpose.

    Anyway, this is an interesting request, and I'll think about it...

    Q2) After identifying the word or letter at position (x,y), find where this word or letter is physically present in the 'source' pdf. ...

    ANSWER: !! I honestly think it's impossible, also because the text in a PDF is often annotated in compressed form.

    E.g. this PDF shows the text "Hello World", but if you open the PDF with a text-editor, you will not find the words "Hello World"...

     
  • Eloi

    Eloi - 2024-05-02

    Hi Thank to consider this!

    Running mutool -d Hello.pdf out.pdf I got the attached pdf, which has the string in clear text in the source.

    Maybe we could get the object or stream corresponding to the (x,y) coordinates, and if the stream is not compressed we could get the exact location.

     
  • Aldo Buratti

    Aldo Buratti - 2024-05-06
    • status: New --> Accepted
     
  • Aldo Buratti

    Aldo Buratti - 2025-05-17

    Regarding the first topic

    • Q1) Indicate a point (x,y) on page K, and then extract the word, or rather the indicated letter.
      this is has been solved in tclMuPDF2.5 with the introduction of the new "textfrombbox" method.

    regarding the second topic

    • Q2) After identifying the word or letter at position (x,y), find where this word or letter is physically present in the 'source' pdf. ...
      .. honestly I don't understand the motivation (a practical use case) for this request. Please open a new ticket adding more details.

    I'm going to close this ticket.

     
  • Aldo Buratti

    Aldo Buratti - 2025-05-17
    • status: Accepted --> Done
     

Log in to post a comment.

MongoDB Logo MongoDB