Menu

Searching within PDFs

nurit
2021-09-17
2021-09-18
  • nurit

    nurit - 2021-09-17

    Hey Hypernomicon Community,

    Thank you for this amazing software!!

    There's a feature I've been wondering about: can we search within readable
    PDFs? My limited experience with Python has told me it's not too wild to
    extract the text from a readable PDF, but not sure if either (a) this
    already exists or (b) it's way too complicated. In any case, I'd find it
    very useful (I imagine many others would so too).

    Thank you,
    Nurit

     
  • Jason Winning

    Jason Winning - 2021-09-17

    Thanks for your message. This is a feature I've been planning to add since almost the start of this project, and it is definitely feasible (in fact the application already uses third-party libraries to extract text from PDFs for purposes of getting information like ISBN to auto-fill bibliographic details, and there are other libraries available for indexing). I will get to it sooner or later (sooner if more people express interest).

    In the meantime, what I do (as a regular user of the application) when I want to search the full-text of certain PDFs is to use the Queries tab to select the works that I want to search, then click the "Files" dropdown in the top-right of the tab, and select "Clear Search Results Folder and Add All Results". This copies the PDFs listed into the "Search results" folder of your database (this is what that folder is for). With those PDFs isolated to a folder, it is easy to use a third-party application to do any kind of full-text search you want. On Windows I use a freeware application called Agent Ransack which is loaded with features and probably more powerful than any built-in functionality added to Hypernomicon would be. It also searches many file types besides PDF (alternatives to Agent Ransack exist for Mac as well). I have meant to create a video showing this workflow but haven't gotten around to it yet. This way of doing things serves my purposes well enough; that's why I haven't added full-text search to the application but I would do so if there is popular demand (one benefit of the built-in approach is that it would continually maintain an index so that thousands of entire PDFs could be searched very quickly; some tools like Agent Ransack don't do that so searches take longer).

    Of course, the more information you add in the form of database records, the less frequently you will (hopefully) need to rely on brute-force full-text searching through multiple PDFs. That is probably the other reason why I haven't added this feature yet; the other features of the application make me feel the need to do the above workflow less and less often. But occasionally you do need to be able to search through multiple PDFs (sometimes even through ALL of your PDFs) and at those times I find myself wishing there were a convenient way to do that within Hypernomicon.

     
  • nurit

    nurit - 2021-09-18

    Thank you for your answer! I have DevonThink so I think that should do the trick for the time being but, as you say, it'd be awesome to just use Hypernomicon for all needs.

    PS: we're spreading the word of Hypernomicon here at NIU!

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.