Download Latest Version spendex.jar (44.9 kB)
Email in envelope

Get an email when there's a new version of Spencer

Home / OldFiles
Name Modified Size InfoDownloads / Week
Parent folder
spencerdexsrc.zip 2004-12-16 13.7 kB
spendex.zip 2004-12-01 20.5 kB
readme.txt 2004-12-01 7.9 kB
Totals: 3 Items   42.0 kB 0
Spencer has three components:

    * The Indexer (spendex)
    * The search interface (spencer)
    * The browse interface (spencer)

This document takes spendex and spencer in turn as they are fundamentally different applications. The Spencer web application which provides both the search and browse interfaces is discussed below.

The suite requires a MySQL database in which the index is stored and a SMB or SAMBA share which can be accessed over the network. In addition, it requires a standards-compliant servlet container such as Tomcat with an appropriate Java JDK (1.4.2_06 or later) in which to deploy the search/browse tools.

The filestore host server must also have an appropriate JRE (1.4.2_06 or later) installed to run the indexer.

Database

The database used in developing this application was MySQL 4.0.18-nt and this version or later should work well with Spencer. Create a new database by running:

mysqladmin -p create <database>

where <database> should be replaced with the name of the database you wish to use (e.g. spencer). You will be prompted for the root password during this process. Next, create the database structure by placing the supplied spencer.sql file in the mysql bin folder and typing:

mysql -p <database> < spencer.sql

again replacing <database> with the name of your newly created database.

All done, now let's start building the index.

Spendex

The indexer must be executed on the server where the files reside and the index must be populated (at least partially) before the search and browse interfaces will work. The initial indexing exercise will take some time. My reference implementation of around 330000 files in around 50Gb storage took about three days to create on a dual 800MHz Xeon 2GB RAM box.

First create a new folder outside of the file hierarchy to be indexed for example, e:bin (this of course refers to a Windows server - there is no reason why this will not work on a UNIX/Linux box).

Now place the following requisite files in this folder:

null.xslt

log4j.xml

spendex.properties

spendex.jar

At this point, open spencer.properties in a text editor and amend he settings appropriately:

server = Name or IP of server on which the database resides

database = Name of the database you created above

username = A user with rights to the database (read/write/delete)

password = The user's password

rootDir = The location of the root of the file hierarchy on the local server (NB: Repace single backslashes with double backslashes - e.g. c:\\Documents and Settings)

logsepth = The verbosity of the output between 0 and 4. I recommend 0 or 1 unless you are experiencing problems.

Now find the following files on the Internet, these are required third party libraries. Download them and place them in the same folder as the other files:

jxl.jar : http://www.andykhan.com/jexcelapi/jexcelapi_2_5_1.tar.gz

mysql-connector-java-3.0.15-ga-bin.jar:
http://dev.mysql.com/downloads/connector/j/3.0.html

PDFBox-0.6.7a.jar:http://prdownloads.sourceforge.net/pdfbox/PDFBox-0.6.7a.zip?download

poi-2.5.1-final-20040804.jar: http://www.mirrorservice.org/sites/ftp.apache.org/jakarta/poi/release/bin/poi-bin-2.5.1-final-20040804.tar.gz

tm-extractors-0.4.jar: http://www.textmining.org/modules.php?op=modload&name=Downloads&file=index&req=getit&lid=2

xsdlib.jar: http://javashoplm.sun.com/ECom/docs/Welcome.jsp?StoreId=22&PartDetailId=jwsdp-1_5-oth-JPR&SiteId=JSC&TransactionId=noreg

log4j-1.2.8.jar: http://logging.apache.org/log4j/docs/download.html

Now it is time to create a script/batch file to actually execute the indexing process. On windows the appropriate batch file might look something  like this:

e:
cd bin
"C:\Program Files\Java\j2re1.4.2_05\bin\java.exe" -Xms250M -Xmx500M -Dlog4j.configuration=file:///c:/bin/log4j.xml -cp "c:\bin\spendex.jar";"c:\bin\poi-2.5.1-final-20040804.jar";"c:\bin\mysql-connector-java-3.0.15-ga-bin.jar";"c:\bin\xsdlib.jar";"c:\bin\tm-extractors-0.4.jar";"c:\bin\PDFBox-0.6.7a.jar";"c:\bin\log4j-1.2.8.jar";"c:\bin\jxl.jar" spendex.Index >c:\bin\log.txt 2>&1

The redirectors at the end ensure that all of the output is recorded in the log.txt file for subsequent perusal. Note that for very large indices with a high logging level, this file can easily reach 100Mb very quickly.

The best way to ensure that your index is maintained is to use your OperatingSystem to schedule execution of the batch/script at regular intervals. I choose to run it overnight every night by creating an appropriately scheduled Windows Scheduled Task which points at the batch file. You could use a cron job in UNIX.

Periodically, check the log file and make sure that nothin too untoward is going on. Running the batch file will begin the long task of building the index and you can now get on with deploying the search/browse interfaces (see below).

The indexing task will recursively examine every file and folder in the filesystem of the root that you specify in the proprties file. If the file/folder already exists in the database and the date stamp on it has not changed since the last index, nothing is done. If it is not in the dtaabse or the time stamp has changed however, the file/folder is (re)added to the database and if the file extension is recognised (MS Office, Openoffice/StarOffice, PDF, zip or plain/marked up text) then the text is stripped out of the file, the words counted and sorted, cross referenced and added to the index.

There are a couple of other features of the indexer:

Running it with spencer.getCommonWords instead of spencer.Index in the command line populates the commons table of the databse with the most common words currently listed. An example script might look like:

e:
cd bin
"C:\Program Files\Java\j2re1.4.2_05\bin\java.exe" -Xms250M -Xmx500M -Dlog4j.configuration=file:///e:/bin/log4j.xml -cp "e:\bin\spendex.jar";"e:\bin\poi-2.5.1-final-20040804.jar";"e:\bin\mysql-connector-java-3.0.15-ga-bin.jar";"e:\bin\xsdlib.jar";"e:\bin\tm-extractors-0.4.jar";"e:\bin\PDFBox-0.6.7a.jar";"e:\bin\log4j-1.2.8.jar";"e:\bin\jxl.jar" spendex.getCommonWords 250 >e:\bin\common.txt 2>&1

This will populate 250 rows of the table (note the 250 in the command - theres the clue!)

A less useful option forces a reindex of certain file types in a rather dramatic way. use this:

<classpath> spendex.delete aaa_bbb_ccc

To DELETE all files with aaa, bbb or ccc as their extensions AND all of the associated word indices. nexzt time an index runs, the deleted files will be re-indexed anew. This is only really useful if there has been a problem with selected portions of the index.

The Web App

 The Spencer.war file can be deployed using your preferred method into an appropriate servlet container (I recommend Tomcat) and the following required files must be present in the Tomcat classpath or the Spencer lib folder:

mysql-connector-java-3.0.15-ga-bin.jar

jcifs-0.8.0.jar

Once deployed, the first page to visit MUST be http;//server/spencer/admin.jsp. Here is where you specify the connection and look and feel parameters of the app. It will look pretty ugly the first time you use it but to prettify quickly, enter spencer.css in the stylesheet box and click save to provide an easier on the eye experience.

Most of the fields in this page are pretty self explanatory. Those in the database section will usually match those that you specified in spencer.properties above. Those in the second section need to refer to the server, share and an account with appropriate read access to the contents. The final portions are HTML fragments that will be included in every page and can be used to customise the look and feel of your implementation.

All done! Have a browse, have a search once the indexing is complete and see what you find.
Source: readme.txt, updated 2004-12-01