Re: [VuFind-Tech] [VuFind-General] Indexing Institutional Website

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,
Vufind will work with anything you can map to the Solr Index, & some Vufind
sites are already indexing websites with their catalogue (e.g.
http://katalog.tub.tu-harburg.de/). As you know, Vufind comes with some
built in tools for loading MARC aswell as some XML-based record formats, &
can also use Aperture to extract full-text from URLs found in those records.

At the moment, though, I can't think of any out-of-the-box way to crawl a
website, so you would have to do some work to get your content into Solr.
Here are some of the options that come to mind:

   - If your website has an SQL database as a backend, you could look at
   using Solr's Data Import Request Handler (
   http://wiki.apache.org/solr/DataImportHandler). Or, if you can export
   from the database in either XML (
   http://wiki.apache.org/solr/UpdateXmlMessages) or CSV (
   http://wiki.apache.org/solr/UpdateCSV), you could write a script to post
   the data to Solr.
   - The latest version (1.3) of the Apache Nutch webcrawler can write
   directly to a Solr Index (see
   http://wiki.apache.org/nutch/RunningNutchAndSolr). I haven't tried it but
   it looks like you might be able to change {NUTCH_HOME}/conf/schema.xml and
   {NUTCH_HOME}/conf/solrindex-mapping.xml to map the Nutch output to Vufind's
   Solr schema ({VUFIND_HOME}/solr/biblio/conf/schema.xml)
   - If you have a list of all the URLs, you should be able to get Aperture
   to work. You could write a script to loop through the URLs (see
   /import/xsl/vufind.php and-or /import/index_scripts/getFulltext.bsh as a
   starting point). Aperture returns an xml document with some metadata fields
   (title, author etc) as well as the fulltext/body of the document. You would
   need to map these various elements to a Solr XML document and post it to the
   update url (http://wiki.apache.org/solr/UpdateXmlMessages). You could
   also roll your own & directly parse the HTML pages but you'd have to worry
   about stripping HTML tags, & this wouldn't help with pdf, doc, etc.

I'm sure there are other ways to skin this cat. You might also want to look
into sharding (see http://vufind.org/wiki/using_solr_shards and
http://wiki.apache.org/solr/DistributedSearch). Depending on your setup, it
might make sense to keep the website in a separate Solr Index than your
catalogue; however, you could also map everything to a single Solr instance
and rely on filtering/faceting to distiguish the different blocks in the UI.

Hope this helps as a starting point.
Eoghan

On 15 August 2011 20:21, Nathan Tallman <nta...@gm...> wrote:

> Hello all,
>
> Can Vufind be used to crawl/index the hosting institutions website? Perhaps
> through Aperture? I'm investigating a single-search function for our
> website, catalog, finding aids, digital content, etc. Thought there might be
> a way.
>
> Thanks!
>
> Nathan Tallman
>
>
>
> ------------------------------------------------------------------------------
> uberSVN's rich system and user administration capabilities and model
> configuration take the hassle out of deploying and managing Subversion and
> the tools developers use with it. Learn more about uberSVN and get a free
> download at:  http://p.sf.net/sfu/wandisco-dev2dev
>
> _______________________________________________
> VuFind-General mailing list
> VuF...@li...
> https://lists.sourceforge.net/lists/listinfo/vufind-general
>
>