From: Demian K. <dem...@vi...> - 2011-08-16 12:52:41
|
For what it's worth, the solution I'm working on at Villanova involves a separate website index (we didn't want to integrate website results into our catalog). The index is pretty simple, with a limited number of fields (URL, title, full text, category). I populate the index using an XSL sheet that runs against sitemap XML files from the website. It's admittedly a pretty crude solution (since it reindexes the entire site every time), but so far it's working well as a proof of concept, and it can always be refined later.... - Demian ________________________________________ From: Eoghan Ó Carragáin [eog...@gm...] Sent: Monday, August 15, 2011 5:44 PM To: Nathan Tallman Cc: vuf...@li...; vuf...@li... Subject: Re: [VuFind-Tech] [VuFind-General] Indexing Institutional Website Hi, Vufind will work with anything you can map to the Solr Index, & some Vufind sites are already indexing websites with their catalogue (e.g. http://katalog.tub.tu-harburg.de/). As you know, Vufind comes with some built in tools for loading MARC aswell as some XML-based record formats, & can also use Aperture to extract full-text from URLs found in those records. At the moment, though, I can't think of any out-of-the-box way to crawl a website, so you would have to do some work to get your content into Solr. Here are some of the options that come to mind: * If your website has an SQL database as a backend, you could look at using Solr's Data Import Request Handler (http://wiki.apache.org/solr/DataImportHandler). Or, if you can export from the database in either XML (http://wiki.apache.org/solr/UpdateXmlMessages) or CSV (http://wiki.apache.org/solr/UpdateCSV), you could write a script to post the data to Solr. * The latest version (1.3) of the Apache Nutch webcrawler can write directly to a Solr Index (see http://wiki.apache.org/nutch/RunningNutchAndSolr). I haven't tried it but it looks like you might be able to change {NUTCH_HOME}/conf/schema.xml and {NUTCH_HOME}/conf/solrindex-mapping.xml to map the Nutch output to Vufind's Solr schema ({VUFIND_HOME}/solr/biblio/conf/schema.xml) * If you have a list of all the URLs, you should be able to get Aperture to work. You could write a script to loop through the URLs (see /import/xsl/vufind.php and-or /import/index_scripts/getFulltext.bsh as a starting point). Aperture returns an xml document with some metadata fields (title, author etc) as well as the fulltext/body of the document. You would need to map these various elements to a Solr XML document and post it to the update url (http://wiki.apache.org/solr/UpdateXmlMessages). You could also roll your own & directly parse the HTML pages but you'd have to worry about stripping HTML tags, & this wouldn't help with pdf, doc, etc. I'm sure there are other ways to skin this cat. You might also want to look into sharding (see http://vufind.org/wiki/using_solr_shards and http://wiki.apache.org/solr/DistributedSearch). Depending on your setup, it might make sense to keep the website in a separate Solr Index than your catalogue; however, you could also map everything to a single Solr instance and rely on filtering/faceting to distiguish the different blocks in the UI. Hope this helps as a starting point. Eoghan On 15 August 2011 20:21, Nathan Tallman <nta...@gm...<mailto:nta...@gm...>> wrote: Hello all, Can Vufind be used to crawl/index the hosting institutions website? Perhaps through Aperture? I'm investigating a single-search function for our website, catalog, finding aids, digital content, etc. Thought there might be a way. Thanks! Nathan Tallman ------------------------------------------------------------------------------ uberSVN's rich system and user administration capabilities and model configuration take the hassle out of deploying and managing Subversion and the tools developers use with it. Learn more about uberSVN and get a free download at: http://p.sf.net/sfu/wandisco-dev2dev _______________________________________________ VuFind-General mailing list VuF...@li...<mailto:VuF...@li...> https://lists.sourceforge.net/lists/listinfo/vufind-general |