From: Osullivan L. <L.O...@sw...> - 2013-09-12 13:30:39
|
Hi Benjamin, We launched VuFind 2 about 6 week ago and for the first two days we pretty much had a DOS as a result of the Googlebot because I put our robots.txt file in the wrong directory. Here's what we have: User-agent: * Disallow: /discover/Search/Results Disallow: /discover/Record/*/Details Disallow: /discover/Record/*/Export Disallow: /discover/Record/*/Cite Disallow: /discover/Record/*/Email Disallow: /discover/Record/*/Holdings I've just checked the user_stats and user_stats_field tables and even with what I though was pretty conservative rules, they still had over 7.5 million rows in them, mostly due to the Googlebot! Can you share your robots.txt file with us as a comparison? I think I'll need to at least add Disallow: /discover/Author/* to the file as well. I have a script which runs at 23:59 which selects stats as follows: $phrases = "select value, count(*) as count from user_stats_fields where field = 'phrase' group by value order by count desc limit 10"; if ($getPhrases = mysqli_query($con, $phrases)) { while($row = mysqli_fetch_array($getPhrases)){ $phraseRes[] = $row; } } $phrases = "select user_stats_fields.value, count(*) as count from user_stats_fields left join user_stats on user_stats.id = user_stats_fields.id where user_stats_fields.field = 'phrase' and user_stats.datestamp between '" . $lastWeek . "' and '" . $now . "' group by value order by count desc limit 10"; if ($getPhrases = mysqli_query($con, $phrases)) { while($row = mysqli_fetch_array($getPhrases)){ $weeklyPhraseRes[] = $row; } } $records = "select value, count(*) as count from user_stats_fields where field = 'recordId' group by value order by count desc limit 10"; if ($getRecords = mysqli_query($con, $records)) { while($row = mysqli_fetch_array($getRecords)){ $recordRes[] = $row; } } $records = "select user_stats_fields.value, count(*) as count from user_stats_fields left join user_stats on user_stats.id = user_stats_fields.id where user_stats_fields.field = 'recordId' and user_stats.datestamp between '" . $lastWeek . "' and '" . $now . "' group by value order by count desc limit 10"; if ($getRecords = mysqli_query($con, $records)) { while($row = mysqli_fetch_array($getRecords)){ $weeklyRecordRes[] = $row; } } It wasn't (or shouldn't! have been) running at the time of our system meltdown but I don't think running select commands on a database which gets 7.5 million rows every 6 weeks is sustainable in the long run.... Thanks, Luke On 09/12/2013 02:06 PM, Mosior, Benjamin wrote: I apologize if I missed it, but I didn’t see any mention of search indexing. Google (and others) can DOS the Solr index if allowed to crawl the VuFind installation unchecked. To see if that’s an issue, you can watch your access logs during peak times. An example entry might look like this: [12/Sep/2013:07:46:15 -0400] "GET /vufind/etc. HTTP/1.1" 200 46 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" We mitigated our issues by using an appropriate robots.txt file, though we did have to temporarily disable all bot traffic with an Apache rewrite. Benjamin Mosior From: Osullivan L. [mailto:L.O...@sw...] Sent: Thursday, September 12, 2013 8:27 AM To: Brown A.T.; Whitcombe N.; 'vuf...@li...<mailto:vuf...@li...>' Cc: Roberts A.L. Subject: [VuFind-Tech] VuFind Problems Hi Folks, After reading a few articles and chatting with folk via the VuFind Mailing list, I have a few things which I will attempt to monitor over the next few days. I've included them here in case anyone is interested: Monitoring 1) Confirm that Garbage Collection is running correctly a) /var/log/vufind2/gc.log should be constantly updated b) Simulate a garbage full scenario by add the word "full" to the log (I will do this at night sometime as if it works correctly, it will restart VuFind) 2) Apache Severs a) Keep an eye on the number of apache clients being created (ps -e | grep "apache" | wc -l) b) Keep an eye on the amount of memory and average process size for apache (cron runs script at /home/ifind/scripts/apacheMonitor.sh every minute and outputs data to /var/log/apache2/process.log ) 3) Monitor /tmp a) Make sure /tmp is not mounted as a ramdisk (Ere Maijala) b) Make sure /tmp is not reaching maximum size (potential issue with VuFind sessions not expiring) 4) Virtual Server a) Is the virtual machines being too aggressively moved from physical machine to another by a load balancer? (Ere Maijala) b) Does the virtual machine support the type of swap / tmp directory in use (Ere Maijala) 5) Netstat a) look for large number of hung connections, anything with *WAIT* statuses (Joe Atzberger) 6) Jetty a) Check to see if direct access to Jetty / Solr is possible by calling the port directly (Demian Katz) Things to consider 1) It has been reported that the G1GC garbage collector would be great with Lucene/Solr. Ere is now running with it (so far without issues) with the following settings:JAVA_OPTIONS="-server -Xms1G -Xmx16G -XX:+UseG1GC" (Ere Maijala) 2) Separation of the Sold Server from the Web / PHP Front End (Tuan Nguyen / Demian Katz) 3) Reducing the amount of memory allocated to the Java Server may actually improve performance - http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html (Al Rykhus) 4) Use JMeter to simulate heavy loading Jason (Stirnaman) I really appreciate all your help! Thanks, Luke |