pdf-doc-search Code

Pure Python tool for searching PDFs saved on local storage

Brought to you by: skrishnanv

Tree [517f28] master /

History

HTTPS access

File	Date	Author	Commit
pdf_searcher	2023-06-16	V Srikrishnan	[517f28] bugfix for setp.py installation
.gitignore	2023-05-02	V Srikrishnan	[db88ca] bug fixes
README.md	2023-05-03	V Srikrishnan	[42dae2] fix readme
install.sh	2023-05-02	V Srikrishnan	[3538d2] simple install script for gnu/linux
requirements.txt	2023-05-01	V Srikrishnan	[3e5d95] pdf word processing
setup.py	2023-06-16	V Srikrishnan	[517f28] bugfix for setp.py installation

Read Me

Motivation

This project was conceived as a long-standing desire
to search easily within locally stored PDF files.
When I say "locally stored", i mean all network
drives as well(basically the files which are
accessible from the local computer with a reachable
file path). I am aware of a feature called "Search
within files" on a commercial particular Operating System.
I have not used this feature, so I am not sure how well
it works. My effort is not to contest against these
behemoths.
My requirements were simple
- should work cross platform
- should be easy to implement (and use)
- very specific applications for searching in
papers, publications etc.

Quick start

Simple steps.
1. Clone the repository from the main branch.
2. In the cloned directory, do cd pdf_searcher and run
./install.sh there. Internet connection will be required to
download a couple of packages. Basically, the download is
required to download a couple of dependencies of NLTK,
which you can download by yourself, if you know what
you are doing.

Hopefully, it should all go well and there should be no error message.

Usage

There is a boot up step; where the user has to essentially
tell the application where to look for PDFs. Once this is
successfully complete, the user can search for the desired
keywords.
There are two ways to run the application.
1. You can call the application from any directory location as follows: python -m pdf_searcher.main -h. The output should be something like:

(python3.7) krishnan@DS[~]$ python -m pdf_searcher.main -h
usage: main.py [-h] [-u | -s SEARCH | -e EDIT | -E | -i | -c | -C] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -u, --update          Update the database.
  -s SEARCH, --search SEARCH
                        Search the database for the word(s). Please use quotes
                        for multiple words
  -e EDIT, --edit EDIT  Edit the directory list. Note that no update will be
                        called automatically.
  -E, --edit_update     Edit and immediate run
  -i, --init            Initialise the database.
  -c, --clean           Clean DB only. Suggest to run with -u option after
                        this command.
  -C, --CLEAN           Clean DB AND scan directory list. Recommended to run
                        with -i after this option.
  -v, --verbose         Message detail ON/OFF(default)

As you can see, I am now in a different directory than where
the project was initially located. You can probably put the
above command in another script file so that you are saved
some typing each time.
2. The other way to run is to change the working directory to
where the code was downloaded. Note that there will be a sub-directory
called pdf_searcher. That is, you should see the below

(python3.7) krishnan@DS[~/projects/pdf_searcher/pdf_searcher]$ ls
bg_process/  common_utils/  config.yaml  fg_process/  main.py

Then, you can run as below

(python3.7) krishnan@DS[~/projects/pdf_searcher/pdf_searcher]$ python main.py -h
usage: main.py [-h] [-u | -s SEARCH | -e EDIT | -E | -i | -c | -C] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -u, --update          Update the database.
  -s SEARCH, --search SEARCH
                        Search the database for the word(s). Please use quotes
                        for multiple words
  -e EDIT, --edit EDIT  Edit the directory list. Note that no update will be
                        called automatically.
  -E, --edit_update     Edit and immediate run
  -i, --init            Initialise the database.
  -c, --clean           Clean DB only. Suggest to run with -u option after
                        this command.
  -C, --CLEAN           Clean DB AND scan directory list. Recommended to run
                        with -i after this option.
  -v, --verbose         Message detail ON/OFF(default)

Yes, it is a repeat of what you saw earlier.

Initialise the database

Assuming that you have chosen one of the above methods.
To save typing, I will assume we are using the second method
as described above. To, initialise run
python main.py -i
If you are on GNU/Linux, after a few seconds of waiting(introduced deliberately)
you should see a simple console editor. In this editor, please
enter full directory path of the directory where you store you PDFs.
Of course, you can enter multiple directories but take care
to enter one directory per line. Please note that you do not have
to enter sub-directories. Also, avoid duplicate entries. This will
still work but you will get duplicate answers.
For example, I have added two directories
/home/krishnan/Downloads
and /home/krishnan/papers as example.
Save and close the editor. If you have used pico, you know that
you have to first do Ctrl-O to write out the buffer and then Ctrl-X to exit.
You can see the keyboard shortcuts in the console window.
If you want to change the editor, not a problem.
Please open $HOME/.pdf_searcher/config.yaml
and change the editor: pico to say editor: vim
or editor: emacs or anything else for that matter.
If the OS is one of those wildly popular ones, the most commonly
available text editor of that platform should open up. I have not tested on
this platform yet.

Once saved and closed, the application will scan each of the
directories entered and
- lookup PDF files
- extract keywords from the PDF files. Note that, this wont work with PDFs which have scanned pages. Handling this is WIP.
- store them in an internal DB. This step may take sometime depending
on the number of files. There is ample scope for improvement from doing batch processing for SQL writes as well as multi-processing.

I hope the instructions are clear enough.

Search

The query is very simple. Assuming, as above you are using method 2,
python main.py -s query
The query can be one word or a set of strings within quotes.
quotes is needed and each of the word inside the quote is ANDed for searching.
For example, "faster spline" will return files which have keywords
"faster" AND "spline". Naturally, you can have more than two words.
Note that, it is not necessary that these words co-occur in any particular order.
As I said, this is a very simple search.

Update the database

The main intent is that, if you add more directories or for whatever reason, you want to simply update
the DB again. Think that, we have a better feature extractor and
wish to re-generate the queries without entering the directories.
You can simply run python main.py -u
For other commands, please python main.py -h or mail me (mailto:v. srikrishnan @gmailcom)
Mind the gaps and insert dots accordingly.

License

As of this version, there are no license. Basically, please
use as you want and if it is useful, please acknowledge by email.

Wishlist

If you really want to "reward" me(though i see no need for you
to do so), check here.
Again, this is not required.