Re: [VuFind-General] VuFind Harvesting

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

There are a few problems you need to think about here, beyond simply indexing the full-text content itself:

1.)    How do you uniquely identify the files in the Solr index?

2.)    What kind of metadata do you need, and how can you obtain it?

3.)    How do you want to represent these files in the search results?

I think the process to implement this will go something like this:

1.)    Write a PHP (or language of choice) script that loops through the target directory, passes each file to Aperture, and then posts a Solr document into the index based on the Aperture results.  You can probably borrow some of the logic from the existing XML indexing tool.

2.)    Write a custom record driver to display these results the way you need to - in the very simplest case, perhaps you might decide that you don't need a record view for these documents and instead want to link search results directly to pages that allow the original files to be downloaded.

This is just a broad outline - I'll be happy to elaborate and/or provide some helpful Wiki links if you need more details.  Let me know!

- Demian

From: Byron Smith [mailto:by...@we...]
Sent: Monday, May 16, 2011 12:37 AM
To: vuf...@li...
Subject: [VuFind-General] VuFind Harvesting

Hi All,

I am currently in the middle of testing a new VuFind 1.1 instance running on a Windows 2003 server.  I have been reading through the VuFind Wiki pages in regards to harvesting in the Importing Records section.  Currently we are looking at the possibility of being able to full text search a local or network directory with a series of Word and PDF documents inside.  Does anyone have any experience with this area or know of a good place to start out in how to set something like this up?

Regards,

Byron