Re: [VuFind-Tech] webcrawl.php questions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Thanks Demian! That's all super helpful! We truly appreciate you taking the time to answer the questions. I think we'll need to stick with the quick-and-dirty idea for the moment since we're about to launch. I'll discuss with my teammates.

________________________________
From: Demian Katz [dem...@vi...]
Sent: Thursday, July 31, 2014 1:53 PM
To: Bradley Busenius; vuf...@li...
Subject: RE: webcrawl.php questions

First of all, by way of explanation: the current web crawling mechanism VuFind uses was my first quick-and-dirty idea to test the concept. It definitely wasn’t designed to scale to huge sites, and honestly, I’m amazed that it works as well as it does – I expected that I would have to rewrite it at some point, but I’ve never gotten around to doing that since the original code continues to meet our needs over here. It’s entirely possible that you’ve hit a threshold, though, and may need to come up with a different strategy.

Anyway, general comments aside, some answers to your questions:

1.       If you can set your memory limit independently for command-line PHP vs. Apache’s PHP, I don’t think it’s too dangerous to set your memory limit fairly high for the command-line version. The main purpose of memory limit (in my limited understanding) is to prevent a huge number of simultaneous web requests from swamping the server… but if you’re running a single ClI process, that’s not an issue.

2.       You’ll find the logic for the web crawl in VuFindConsole\Controller\ImportController::webcrawlAction(). You’ll see that the way this works is it first remembers the current time, then it harvests all of the sitemaps from your .ini file, then it deletes any outdated records from prior to the start time, and commits/optimizes the index. If you wanted to do this incrementally, you could break this apart into multiple tools – one to save the start time to a file, one to load sitemaps (possibly with a mechanism to handle each sitemap in a different process), and a third tool to read the time file written by the first tool, and do the necessary cleanup actions. (Side note – if you want to build new tools, note that controller name / action name are mapped to VuFind directory / PHP filename – so you can copy the existing util/webcrawl.php file to a different filename, and then by running that file, you will execute an action with a name matching the new file).

Regarding having to delete the index every time, I wonder if the problem is that your script is running out of memory before the Solr commit is happening, so your changes aren’t taking effect right away. It’s possible that if you manually optimized the index at the end of the process, you would see the new records.

3.       Right now, we use sitemap.xml because we already had a tool chain set up for running XSLT on XML and dumping the results into the Solr index (as I said above, quick-and-dirty!). It’s not that hard to build Solr XML documents and post them to the index… but if you want a different input, you would have to build tools to read it in, retrieve the data, and post it to Solr. All the pieces you need can be found in the existing XSLT code, but it would take some rearrangement and customization to get everything right. If you want to go down this road, I’d be happy to provide support. As I say, I think we could really use smarter/more flexible tools in this area, but I haven’t had time to build them due to lack of pressing need.

4.       No, VuFind is ignoring the priority field.

I hope this is helpful!

- Demian

From: Bradley Busenius [mailto:bbu...@uc...]
Sent: Thursday, July 31, 2014 2:00 PM
To: vuf...@li...
Subject: [VuFind-Tech] webcrawl.php questions

We're having trouble getting webcrawl.php to index our full sitemap.xml which is 128,293 lines. We can successfully index a smaller file but not the big one. Currently we're getting the php memory limit exhausted error (again). We had originally gotten rid of the error by upping the limit to 1024M. I'm trying to explore all options and have the following questions:

1. What's a reasonable memory limit for PHP?

2. Is there a way to incrementally index multiple sitemap.xml files and if so how should this be done?

Currently I need to delete everything in /vufind/solr/website/index/ before running the script each time. If I don't, the script executes but the new entries don't appear in the index. I might be doing something wrong here. I don't know.

3. Do we need to use xml for this or can we set this up as a flat file with tab separated fields?

4. Is the priority field necessary? Our current xml format is following the google sitemap standard. We have something like this:

<url>
  <loc>http://www.lib.uchicago.edu/storage/law/2004/law2004-001-36.pdf</loc>
  <priority>0.000014</priority
</url>

I apologize for so many questions. I'm very ignorant as to how this works.

-Brad