Download Latest Version HAWK-v1.0 API.zip (30.8 MB)
Email in envelope

Get an email when there's a new version of HAWK - PDF Text Search Java Project

Home / Version 1.0
Name Modified Size InfoDownloads / Week
Parent folder
HAWK-v1.0 JavaDoc.zip 2014-01-25 96.2 kB
README.txt 2014-01-25 8.8 kB
HAWK-v1.0 API.zip 2014-01-25 30.8 MB
Totals: 3 Items   30.9 MB 0
#################################################################
#################################################################
########## TITLE    =>  HAWK PDF Search Project        ##########
########## AUTHOR   =>  Swapnil Ashok Jadhav (saj1919) ##########
########## EMAIL-ID =>  dadajibudhau@gmail.com         ##########
########## WEBSITE  =>  http://geekdadaji.com          ##########
########## VERSION  =>  API-v1.0                       ##########
#################################################################
#################################################################

############################
### CONTENTS OF ZIP FILE ###
############################
	1) HAWK DOCUMENTATION
	2) HAWK JAVA API (Add to your project path and use its functionality)
	3) Wordnet Dictionary
	
####################			
#### ABOUT HAWK ####
####################

	1) This is a java project used to search in a pdf file using different kinds of queries. 
	2) It supports search of queries which can be lines/paragraphs/pages in the pdf including spelling mistakes, disorder of words or both.
	3) To search the pdf, indexing should bee done. It takes ~5-10 minutes according to size pdf file.
	4) Indexing is one time only and indexed files can be reused for the same pdf file.
	5) Keep the all indexes in the same folder and wordnetDict in main project folder.
	6) Giving the pdf path while indexing should be of pdf file path(obviously!!!).
	7) But in initializing and searching give path as "index_folder_path/pdf_name.txt"
	8) Before searching index initialization should be done.
	9) Index initialization can be done for line-search, paragraph-search or page-search.
	10) Searching query takes ~5-15 seconds depending upon the length of a query.
	11) More the length and similarity of query to the text to be searched more the accuracy.
	
	
###################################			
#### WHY HAWK PDF SEARCH JAVA  ####
###################################
	
	1) In all pdf readers and viewers, only tagword search is available. Smart text search has been missing for many years.
	2) Improve pdf search and find out desired text without going through full text.
	3) Java api can be integrated in many codes and tools related to document search.
	4) Why pdf search?? You can convert all document types available to pdf... thats why!!!


############################			
#### ABOUT THIS VERSION ####
############################

	1) In this version 'Arabian Nights' pdf which is freely available over the internet is given for the example written below.
	2) Indexes are created for line or paragraph or page search and all pages are searchable.
	3) Searching results shows matching text details (Full Details are available - Page Number, Text, Score, Paragraph Number)
	4) All functions are public (Except some due to programming convention) and accessible.
	5) New Improved full version will be available soon :)
	
	
#################################	
#### HOW TO USE HAWK v1.0    ####
#################################

	1) In Eclipse or NetBeans put "WordnetDict" in the project folder.
	2) Import the following line
	
		import geekSearch.indexUtils;
		import geekSearch.queryUtils;
		import infoClasses.resultInfo;
		
	3) How to Index ????
	
		public static void main(String args[])throws Exception
		{	
			String pdfpath = "/home/saj/Desktop/geekDadaji/pdfpath/arabian_nights.pdf";
			indexUtils.runIndexUtil(pdfpath);
		}
	
	4) How To Initialize ????
		
		public static void main(String args[])throws Exception
		{
			int option = 2;
			int topmatches = 10;
			String indexpath = "/home/saj/Desktop/geekDadaji/indexpath/arabian_nights.pqr";
			queryUtils.initPdfSearch(pdfpath, option);
		}

		=> 'indexpath'(in 4) can be same as 'pdfpath'(in 3). It depends where you keep your generated index files. 
		=> Index files and pdf need not be in the same folder, but all index files should be in same folder.
		=> While giving indexpath give folder name in which indexes are there. Here '/home/saj/Desktop/geekDadaji/indexpath/'
		   is folderpath. But suffix the pdf file name with '.pqr' as shown above. Extension can be anything with 3 characters.
		   Example pdfname.xyz or pdfname.abc anything.
		=> This assumption is only for step 4 and 5 not for step 3.
	
	5) How to Search ????
	
		public static void main(String args[])throws Exception
		{
			int option = 2;
			int topmatches = 10;
			String indexpath = "/home/saj/Desktop/geekDadaji/indexpath/arabian_nights.pqr";
			queryUtils.initPdfSearch(indexpath, option);
		
			String query = "ride people sindabad beach ship tied hitching rack place";
			String searchResult = queryUtils.querySearch(query, topmatches, option);
		
			System.out.println(searchResult);
		}
		
	    OR
	    
	    public static void main(String[] args) throws Exception
		{
			/*
			 * Enter a search query from the indexed book and after some seconds you will see the result.
			 */
		
			int option = 2; /// option 1 for line, option 2 for paragraph, option 3 for pages
			int topmatches = 10; /// number of matches to be shown required
			String indexpath = "/home/saj/Desktop/geekDadaji/indexpath/arabian_nights.txt";
		
			queryUtils.initPdfSearch(indexpath, option); /// initializing indexes
		
			BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
			while(true)
			{
				System.out.print("Insert Query : ");
				String query = br.readLine();
				String searchResult = queryUtils.querySearch(query, topmatches, option);
				System.out.println(searchResult);
			}
		}
		
		OR
		
		public static void main(String[] args) throws Exception
		{
			/*
			 * Enter a search query from the book and after some seconds you will see the result.
			 */
		
			int option = 2; /// option 1 for line, option 2 for paragraph, option 3 for pages
			int topmatches = 10; /// number of matches required
			String indexpath = "/home/saj/Desktop/geekDadaji/indexpath/arabian_nights.txt";
		
			queryUtils.initPdfSearch(indexpath, option); /// initializing indexes
		
			BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
			while(true)
			{
				System.out.print("Insert Query : ");
				String query = br.readLine();
				resultInfo search_results[] = queryUtils.querySearchResultArray(query, topmatches, option);
				for(resultInfo result : search_results)
				{
					System.out.println("PAGE => "+result.pageNumber);
					System.out.println("PARA => "+result.paraNumberInPage);
					System.out.println("SCOR => "+result.textScore);
					System.out.println("TEXT => "+result.matchedText);
					System.out.println("");
				}
				System.out.println("=========================================================");
				System.out.println("=========================================================");
			}
		}
		
		Refer documents for more details
		Simple Isn't it !!!!


########################		
#### WHAT TO SEARCH ####
########################

	Try these inputs after Indexing and initializing given pdf(arabian_nights.pdf). (Page 33 in given pdf)
	
		1) "You won’t ride any more people around this place, I guess,” Sinbad said, and he blew down to the beach and got on board a ship that had just tied up to the hitching-rack" 
			(Exact match - Surely you will never write such big query !!!)
		2) "You ride more people around this place Sinbad said blew down beach got on board ship to the hitching-rack" 
			(less words used)
		3) "place got the hitching-rack on board ship to blew down Sinbad said around this beach You ride more people" 
			(word position change and less words used)
		4) "u rid mor piple arund that plc Seenbad told blown dawn bich get con bord sheep 2 d htchng rck" 
			(Too much spelling mistake with less words !!! still ranked in top 10 !!!)
		5) "u rid mor peple arund ths plc Seenbad sad blow dawn bech get o bord sheep 2 d htchng rck" 
			(spelling mistake and less words used)
		6) "dawn bech get o bord peple plc Seenbad 2 d htchng rck sheep arund ths sad blow u rid mor" 
			(spelling mistake with word position change and less words used)
		7) "ride people sindabad beach ship tied hitching rack place" 
			(Try some important words only !!!)
	
	
######################		
#### FROM CREATOR ####
######################	
	
		=> FEEL FREE TO GIVE ME ANY SUGGESTIONS OR INFORMATION !!!
		=> THANK YOU FOR DOWNLOADING/USING THIS STAND-ALONE SEARCHING PROJECT.
		=> SUPPORT/DONATE THE PROJECT.
		=> WANT TO COLLABORATE ... EMAIL ME !!!
		
		
#################################################################
######################## END OF README ##########################
#################################################################
Source: README.txt, updated 2014-01-25