From: Eoghan Ó C. <eog...@gm...> - 2011-08-15 21:44:21
|
Hi, Vufind will work with anything you can map to the Solr Index, & some Vufind sites are already indexing websites with their catalogue (e.g. http://katalog.tub.tu-harburg.de/). As you know, Vufind comes with some built in tools for loading MARC aswell as some XML-based record formats, & can also use Aperture to extract full-text from URLs found in those records. At the moment, though, I can't think of any out-of-the-box way to crawl a website, so you would have to do some work to get your content into Solr. Here are some of the options that come to mind: - If your website has an SQL database as a backend, you could look at using Solr's Data Import Request Handler ( http://wiki.apache.org/solr/DataImportHandler). Or, if you can export from the database in either XML ( http://wiki.apache.org/solr/UpdateXmlMessages) or CSV ( http://wiki.apache.org/solr/UpdateCSV), you could write a script to post the data to Solr. - The latest version (1.3) of the Apache Nutch webcrawler can write directly to a Solr Index (see http://wiki.apache.org/nutch/RunningNutchAndSolr). I haven't tried it but it looks like you might be able to change {NUTCH_HOME}/conf/schema.xml and {NUTCH_HOME}/conf/solrindex-mapping.xml to map the Nutch output to Vufind's Solr schema ({VUFIND_HOME}/solr/biblio/conf/schema.xml) - If you have a list of all the URLs, you should be able to get Aperture to work. You could write a script to loop through the URLs (see /import/xsl/vufind.php and-or /import/index_scripts/getFulltext.bsh as a starting point). Aperture returns an xml document with some metadata fields (title, author etc) as well as the fulltext/body of the document. You would need to map these various elements to a Solr XML document and post it to the update url (http://wiki.apache.org/solr/UpdateXmlMessages). You could also roll your own & directly parse the HTML pages but you'd have to worry about stripping HTML tags, & this wouldn't help with pdf, doc, etc. I'm sure there are other ways to skin this cat. You might also want to look into sharding (see http://vufind.org/wiki/using_solr_shards and http://wiki.apache.org/solr/DistributedSearch). Depending on your setup, it might make sense to keep the website in a separate Solr Index than your catalogue; however, you could also map everything to a single Solr instance and rely on filtering/faceting to distiguish the different blocks in the UI. Hope this helps as a starting point. Eoghan On 15 August 2011 20:21, Nathan Tallman <nta...@gm...> wrote: > Hello all, > > Can Vufind be used to crawl/index the hosting institutions website? Perhaps > through Aperture? I'm investigating a single-search function for our > website, catalog, finding aids, digital content, etc. Thought there might be > a way. > > Thanks! > > Nathan Tallman > > > > ------------------------------------------------------------------------------ > uberSVN's rich system and user administration capabilities and model > configuration take the hassle out of deploying and managing Subversion and > the tools developers use with it. Learn more about uberSVN and get a free > download at: http://p.sf.net/sfu/wandisco-dev2dev > > _______________________________________________ > VuFind-General mailing list > VuF...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-general > > |