File | Date | Author | Commit |
---|---|---|---|
pdf_searcher | 2023-06-16 |
![]() |
[517f28] bugfix for setp.py installation |
.gitignore | 2023-05-02 |
![]() |
[db88ca] bug fixes |
README.md | 2023-05-03 |
![]() |
[42dae2] fix readme |
install.sh | 2023-05-02 |
![]() |
[3538d2] simple install script for gnu/linux |
requirements.txt | 2023-05-01 |
![]() |
[3e5d95] pdf word processing |
setup.py | 2023-06-16 |
![]() |
[517f28] bugfix for setp.py installation |
This project was conceived as a long-standing desire
to search easily within locally stored PDF files.
When I say "locally stored", i mean all network
drives as well(basically the files which are
accessible from the local computer with a reachable
file path). I am aware of a feature called "Search
within files" on a commercial particular Operating System.
I have not used this feature, so I am not sure how well
it works. My effort is not to contest against these
behemoths.
My requirements were simple
- should work cross platform
- should be easy to implement (and use)
- very specific applications for searching in
papers, publications etc.
Simple steps.
1. Clone the repository from the main branch.
2. In the cloned directory, do cd pdf_searcher
and run
./install.sh
there. Internet connection will be required to
download a couple of packages. Basically, the download is
required to download a couple of dependencies of NLTK,
which you can download by yourself, if you know what
you are doing.
Hopefully, it should all go well and there should be no error message.
There is a boot up step; where the user has to essentially
tell the application where to look for PDFs. Once this is
successfully complete, the user can search for the desired
keywords.
There are two ways to run the application.
1. You can call the application from any directory location as follows: python -m pdf_searcher.main -h
. The output should be something like:
(python3.7) krishnan@DS[~]$ python -m pdf_searcher.main -h
usage: main.py [-h] [-u | -s SEARCH | -e EDIT | -E | -i | -c | -C] [-v]
optional arguments:
-h, --help show this help message and exit
-u, --update Update the database.
-s SEARCH, --search SEARCH
Search the database for the word(s). Please use quotes
for multiple words
-e EDIT, --edit EDIT Edit the directory list. Note that no update will be
called automatically.
-E, --edit_update Edit and immediate run
-i, --init Initialise the database.
-c, --clean Clean DB only. Suggest to run with -u option after
this command.
-C, --CLEAN Clean DB AND scan directory list. Recommended to run
with -i after this option.
-v, --verbose Message detail ON/OFF(default)
As you can see, I am now in a different directory than where
the project was initially located. You can probably put the
above command in another script file so that you are saved
some typing each time.
2. The other way to run is to change the working directory to
where the code was downloaded. Note that there will be a sub-directory
called pdf_searcher
. That is, you should see the below
(python3.7) krishnan@DS[~/projects/pdf_searcher/pdf_searcher]$ ls
bg_process/ common_utils/ config.yaml fg_process/ main.py
Then, you can run as below
(python3.7) krishnan@DS[~/projects/pdf_searcher/pdf_searcher]$ python main.py -h
usage: main.py [-h] [-u | -s SEARCH | -e EDIT | -E | -i | -c | -C] [-v]
optional arguments:
-h, --help show this help message and exit
-u, --update Update the database.
-s SEARCH, --search SEARCH
Search the database for the word(s). Please use quotes
for multiple words
-e EDIT, --edit EDIT Edit the directory list. Note that no update will be
called automatically.
-E, --edit_update Edit and immediate run
-i, --init Initialise the database.
-c, --clean Clean DB only. Suggest to run with -u option after
this command.
-C, --CLEAN Clean DB AND scan directory list. Recommended to run
with -i after this option.
-v, --verbose Message detail ON/OFF(default)
Yes, it is a repeat of what you saw earlier.
Assuming that you have chosen one of the above methods.
To save typing, I will assume we are using the second method
as described above. To, initialise run
python main.py -i
If you are on GNU/Linux, after a few seconds of waiting(introduced deliberately)
you should see a simple console editor. In this editor, please
enter full directory path of the directory where you store you PDFs.
Of course, you can enter multiple directories but take care
to enter one directory per line. Please note that you do not have
to enter sub-directories. Also, avoid duplicate entries. This will
still work but you will get duplicate answers.
For example, I have added two directories
/home/krishnan/Downloads
and /home/krishnan/papers
as example.
Save and close the editor. If you have used pico, you know that
you have to first do Ctrl-O
to write out the buffer and then Ctrl-X
to exit.
You can see the keyboard shortcuts in the console window.
If you want to change the editor, not a problem.
Please open $HOME/.pdf_searcher/config.yaml
and change the editor: pico
to say editor: vim
or editor: emacs
or anything else for that matter.
If the OS is one of those wildly popular ones, the most commonly
available text editor of that platform should open up. I have not tested on
this platform yet.
Once saved and closed, the application will scan each of the
directories entered and
- lookup PDF files
- extract keywords from the PDF files. Note that, this wont work with PDFs which have scanned pages. Handling this is WIP.
- store them in an internal DB. This step may take sometime depending
on the number of files. There is ample scope for improvement from doing batch processing for SQL writes as well as multi-processing.
I hope the instructions are clear enough.
The query is very simple. Assuming, as above you are using method 2,
python main.py -s query
The query can be one word or a set of strings within quotes.
quotes is needed and each of the word inside the quote is ANDed for searching.
For example, "faster spline" will return files which have keywords
"faster" AND "spline". Naturally, you can have more than two words.
Note that, it is not necessary that these words co-occur in any particular order.
As I said, this is a very simple search.
The main intent is that, if you add more directories or for whatever reason, you want to simply update
the DB again. Think that, we have a better feature extractor and
wish to re-generate the queries without entering the directories.
You can simply run python main.py -u
For other commands, please python main.py -h
or mail me (mailto:v. srikrishnan @gmailcom)
Mind the gaps and insert dots accordingly.
As of this version, there are no license. Basically, please
use as you want and if it is useful, please acknowledge by email.
If you really want to "reward" me(though i see no need for you
to do so), check here.
Again, this is not required.