From: Greg P. <gre...@gm...> - 2013-09-13 03:49:32
|
Hey Luke, I've been watching these couple of threads for the last few days and haven't had the time to write up a reply until now. The variety of responses re-affirms my opinion though, which is that there are a lot of factors to consider, so good luck :) Couple of things I find helpful: 1) Keep in mind what your ratio of index size to RAM allocation is when deciding garbage collection approaches and directory selection (ie. to use MMap or not). If you can't fit your index in RAM entirely you will ALWAYS have garbage collection that adversely effects you, and your primary bottleneck is most likely CPU speeds (ie. how fast can you generate garbage... user load, weighted by CPUs, then collect it afterwards... pure CPU). 2) For the NLA, MMap is generally bad because we don't have the resources to allocate enough RAM to even come close to holding our indexes in memory. For example, if we allocate 20GB of RAM to the JVM, but the index is 500GB in size (ratio of 25:1) then using MMap tells Solr to ask the OS to handle 500GB of index data off-heap. The OS will try like a champ, but you'll be struggling with I/O wait issues and the performance of the machine as a whole will suffer. For older versions of Solr we are still running this isn't an issue since MMap is not the default, but a lot of our newer machines running large indexes get changed to 'solr.NIOFSDirectoryFactory' as standard. 3) Garbage collection really needs to be benchmarked and tuned for your index and user base to know what works. We have some indexes that go great with G1GC, others that crash within an hour, but work fine with CMS. 4) Use a tool like jvisualvm and the visualgc plugin to connect to your machines (might need to run jstatd on the host to facilitate the connection). It will give you a much better feel for what (if any) garbage collection problems you really have. For example, in my experience, for those machines running CMS the problems in Solr are actually the new generation heap (ie. not CMS, but ParNewGC), so the fine tuning of CMS settings don't do diddly squat. I ended up focusing on NewRatio and survivor generations to get those graphs in visualgc looking a bit more healthy. Unfortunately when using CMS as old gen collector your new gen choices are very limited by compatibility. If you can get G1GC working, I do like it though, it is just harder to predict since it adjusts your heap arrangement (ie. notices new gen is stressed and will re-allocate empty old gen space into new gen or vice versa) on the fly in response to current usage. I find that it sometimes trips over itself and makes some bad choices. 5) If you have spare CPU resources, strongly consider GC tuning settings that adjust thread counts to match. (eg. 6 CPUs on the host, configure GC to use maybe 3 of them). I have no idea if there is a secret sauce answer for finding the number, I just benchmark different settings using jmeter to simulate the same search load until I get the best response times on average. 6) Ingest rate, and ingest patterns matter... A LOT. The nature is different depending on whether you are doing Master/Slave replication, or direct write to the same host that is doing search traffic to users, but in all cases, receiving a commit and opening new searchers in memory are the most problematic time periods for the heap and garbage collection. Make sure any benchmarking and load testing includes this and watch what those heap graphs do under those circumstances. We found that for our worst index, the simplest fix was to adjust our business rules; instead of replication Master to Slave every 2 minutes, we made it 5 minutes, and those extra 3 minutes was all the JVM needed to get the heap back under control and avoid huge GC events. 7) Stop-the-world (ie. Full) GC events are not inherently bad and shouldn't warrant a reboot unless (after all the above tuning) they are going on for massive time periods that affect your business adversely. Personally, I think a reboot, which drops all your search caches etc. is actually worse that a full GC event under most circumstances. We do them occasionally still, but they are becoming very rare (haven't seen one in several months) as the GC settings get tuned more. We use an automated tool similar to what you've described, but it looks backwards in the logs and only intervenes if it detects excess amounts of Full GC. 8) Having a layer in front of the application to handle load spikes can work wonders. We are using nginx and shape all bot traffic to ensure it does not get too out of control. From memory we received ~20 hits/second against the catalogue from Google and Baidu combined before implementing that, which was a tad excessive and caused quite a few problems (although mostly Voyager couldn't handle the load, not Solr or VuFind). That's all I can think of off the top of my head. Happy to dig up specific settings to individual points if you need them, but so much of this is blurred together in my head since I've looked at it so much over the last year. We have nearly 10 TBs of Solr data I think, spread across versions 1, 3 and 4 and they each have different issues based on what we put in them and how they get used. I wish there was a silver bullet :) Ta, Greg On 13 September 2013 04:03, Joe Atzberger <jo...@bo...> wrote: > There are several netstat options (that vary by platform), usually > including -a (show all). This output is useful enough though. It shows > that an unusually large number of connections are stuck in the TIME_WAIT > state. This correlates to extra memory consumption, as described here: > > http://stackoverflow.com/questions/1803566/what-is-the-cost-of-many-time-wait-on-the-server-side > > And put into canonical detail here: > http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/ > > Are you running Solr on port 8888? Whatever it is has no *active* > connections and 100+ that are just waiting around to close, so that seems > like a good place to investigate. > > And "ccs-voyager.system:1521" must be your ILS? Oracle? Note that this > type of performance bottleneck is partially addressed by asset pipeline > architecture (combining CSS/JS so that each page requires fewer > connections). You can attempt to lower TCP's MSL as a mitigation strategy, > but that is not a real fix. In my experience the culprit is usually > software that doesn't fully or correctly implement the protocol, i.e. > socket-level scripting... but of course my experience is with a host of > ancient software! > > --joe > > > On Thu, Sep 12, 2013 at 8:10 AM, Osullivan L. <L.O...@sw...>wrote: > >> Hi Joe, >> >> Thanks for your e-mail. >> >> With regards to netstat, are there any variables I should set or would >> just base information be enough? >> >> I've attached the output of the basic command during a stable time to >> this e-mail in case you can spot anything. ccs-voyager is our Ex Libris >> Voyager ILS Server and I should point out that our VuFind Server sits >> behind a Microsoft ISA Server. >> >> With regards to Network Stability, I think this is tricky because of the >> aforementioned ISA Server - It basically sits in front of our VuFind Server >> as a gateway. As a result, all the ip addressed logged by Apache for >> example are always one of it's two gateways. I have to admit complete >> ignorance to the operation of the upstream switch port - can you let me >> know how I can go about checking how it is operating? >> >> Thanks, >> >> Luke >> >> >> >> >> On 09/11/2013 06:39 PM, Joe Atzberger wrote: >> >> It is most disconcerting that *rebooting* the server did not affect >> performance. Is this because load was immediately returned to a high >> level? Otherwise, the behavior is incomprehensible or perhaps related to >> something on-disk that persists between restarts. >> >> You have 3 different states: >> >> - overtaxed, slow even for SSH, but regular webserving is OK - does >> sound like a memory problem, confirmed by top >> - SSH OK, but VuFind and webserving blocked - presumably just related >> to /tmp? Does this occur at any point when /tmp is mounted correctly? >> - regular operation >> >> The limiting factor seems to be having enough resources to create >> additional sessions, prompting the question... how many sessions is too >> many? How few is not enough? Is there anything exacerbating the number of >> sessions needed? Would be nice to make a page that just reported the >> number of open sessions that you could poll/log regularly... >> >> I would be interested to see a couple things: >> >> - netstat during load -- look for large number of hung connections, >> anything with *WAIT* statuses >> - network stability - Does your upstream switch port auto-negoatiate >> (e.g. duplexity)? how are interfaces configured and do they stay correct >> under load? >> >> I've seen all kinds of weird behavior that caused sockets or sessions to >> be recreated en masse. I agree that the most interesting observation would >> be to see whether Solr remained nominally responsive while the application >> deteriorated, and then whether it would also be responsive for a novel >> query at the same time. >> >> There are many monitoring platforms available that might provide >> additional insight (nagios, OpenNMP, etc.), but the easiest to use and >> interpret is probably the commercial version from New Relic. (You might >> make use of their free trial period.) >> >> Good luck and thanks for the detailed reporting! >> --joe >> >> On Wed, Sep 11, 2013 at 12:07 PM, Alan Rykhus <ala...@mn...>wrote: >> >>> Hello, >>> >>> I have not fully read this blog post that I saw on the Solr list, but I >>> plan on looking into it further and adjusting some settings that I have. >>> Since we run on a 64 bit platform, I believe I will lower the amount of >>> memory I assign to java when starting things up. >>> >>> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >>> >>> al >>> >>> On 09/11/2013 10:53 AM, Osullivan L. wrote: >>> > Hi Tuan, >>> > >>> > Thanks for this - I've already done some research along this lines, >>> > based on stuff I read here: >>> > >>> > >>> http://cloudinservice.com/tune-apache-performance-using-mpm-prefork-module/ >>> > >>> > The article contains a link to a script which you can run to calculate >>> > the average amount of memory each client is using. >>> > >>> > Our Solr specs seem quite modest compared to yours: >>> > >>> > 374M alphabetical_browse >>> > 60K authority >>> > 1.5G biblio >>> > 19M jetty >>> > 17M lib >>> > 40K reserves >>> > 4.0K solr.xml >>> > 56K stats >>> > 60K website >>> > >>> > When you say the 18GB Solr index fits into Ram, do you mean via the >>> java >>> > server settings in VuFind.sh (-Xmx16384m) or do you calculate this >>> > based on the total amount of Ram available (24GB)? >>> > >>> > Having talked to Demian and read about your experiences, I will >>> > definitely investigate separating the solr server from the web server. >>> > It's not something I can do quickly though owing to University policy >>> > and I think 10 - 12 GB is the most Ram I'm ever going to be given for >>> > one server. With that said, our index is a lot smaller so perhaps 6GB >>> > allocated to Jetty to leave 4GB for the system would be fine? >>> > >>> > Thanks, >>> > >>> > Luke >>> > >>> > >>> > >>> > >>> > On 09/11/2013 04:34 PM, Tuan Nguyen wrote: >>> >> Luke, I feel your pain :). >>> >> >>> >> We had similar issues a couple years ago with VuFind locking up >>> during busy periods. We tried various memory and garbage collection >>> parameters to no avail. >>> >> >>> >> At that time we had BOTH vufind and SOLR on the same box. The server >>> itself had 16GB of ram shared among Mysql, Apache, VuFind and the OS. We >>> allocated 8GB to solr. >>> >> >>> >> During busy time, our Apache server uses a lot of memory (calculate >>> by multiplying MaxClients parameter with the amount of memory each client >>> is using - I forgot what the rough number was). Long story short, the >>> amount of memory that was actually need was more than 16GB, thus causing >>> the server to constantly swap, hence takes a while to SSH in. >>> >> >>> >> We decided to have solr on a dedicated server having 24GB of ram, and >>> 16GB allocated to solr. Our solr index is about 18GB so the whole index >>> fits nicely into ram and results in huge improvement. >>> >> >>> >> Now, we only have this in our java memory parameters for SOLR: >>> -server -Xmx16384m >>> >> >>> >> Our SOLR server is RHEL/CentOS 5.7. >>> >> Java(TM) SE Runtime Environment (build 1.7.0_03-b04) >>> >> Java HotSpot(TM) 64-Bit Server VM (build 22.1-b02, mixed mode) >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> On 2013-09-11, at 11:11 AM, Osullivan L. wrote: >>> >> >>> >>> Hi Folks, >>> >>> >>> >>> iFind Discover (VuFind 2) was recently unresponsive for a >>> considerable amount of time and I am having trouble isolating the cause. I >>> hope that by sharing the time line plus relevant technologies with you, >>> someone might be able to point me in the right direction. >>> >>> >>> >>> Summary >>> >>> >>> >>> Server >>> >>> Ubuntu 12.04.2 LTS 64bit on Virtual Server >>> >>> 10GB Memory >>> >>> 4 CPU >>> >>> 90.30GB HD (19.8% use) >>> >>> >>> >>> Primary Software >>> >>> VuFind (vufind.org) >>> >>> PHP (modified with Swansea University Custom Code) >>> >>> Lucene Solr Enterprise Search (Java Based) >>> >>> Runs on Jetty Server (http://www.eclipse.org/jetty/) using >>> >>> JAVA_OPTIONS="-server -d64 -Xms4096m -Xmx4096m >>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+AggressiveOpts -XX:NewRatio=9 >>> -Xloggc:/var/log/vufind2/gc.log" >>> >>> Jetty produces a garbage collection log at /var/log/vufind2/gc.log - >>> when garbage collection is unable to match demand, it out puts "Full" and a >>> cron script restarts vufind, thus emptying all garbage. As garbage >>> collection nears this limit, the VuFind Service begins to slow down. >>> >>> >>> >>> Historic Problems >>> >>> Over the past three years, VuFind has always struggled during >>> induction sessions when 40 - 50 students have accessed it at the same time. >>> The problem primarily appeared to be related to garbage collection as the >>> limit would constantly be reached, necessitating a restart. We now use >>> VuFind 2.0 which is based on Solr 4.0 which is supposed to vastly improve >>> the garbage issues. >>> >>> >>> >>> System Outage - Wednesday 11th September >>> >>> >>> >>> First reports of performance problems were received around 2.15pm >>> and I began investigations around 2.30pm >>> >>> >>> >>> ssh login to the server took over a minute >>> >>> >>> >>> http://ifind.swan.ac.uk/discover (VuFind) was unresponsive but the >>> landing page of http://ifind.swan.ac.uk (Plain html / images) appeared >>> to be uneffected >>> >>> >>> >>> An initial investigation using top revealed almost maximum memory >>> usage and CPU usage of around 40% >>> >>> >>> >>> Standard practice is to restart VuFind to clear any garbage issues >>> but this had no effect on performance. >>> >>> >>> >>> Apache was then restarted but this also had no effect on performance >>> >>> >>> >>> The server was then rebooted but this had no effect on performance - >>> the landing page of ifind.swan.ac.uk was also unresponsive. >>> >>> >>> >>> /tmp was not mounted correctly but after that issue was resolved ssh >>> login was instant. Top revealed memory usage to be less than 5GB and CPU >>> usage around 33%. >>> >>> >>> >>> I then edited apache2.conf, decreasing the amount of MaxClients from >>> 200 to 50 and then restarted apache. Both ifind.swan.ac.uk/discover and >>> ifind.swan.ac.uk became responsive immediately after this, around >>> 3:10pm. >>> >>> >>> >>> Conclusions >>> >>> >>> >>> I believe that Garbage Collection is operating correctly and is >>> actually vastly improved from VuFind 1 as when we first launched VuFind 2, >>> Googlebot hammered our pages, generating 1.7 million hits in 2 days. VuFind >>> restarted three times over those two days with intervals of 7 - 8 hours >>> which is a massive improvement over the 45 min - 90 min intervals sometimes >>> experienced during induction sessions last year for much less traffic. >>> >>> >>> >>> Though I would like to think that reducing the number if MaxClients >>> resolved the issue, I cannot rule out that the improvement was due to the >>> end of the induction session or some other unknown factor. The fact that >>> the landing page seemed to be initially responsive when VuFind makes me >>> concerned that the issue may be unrelated to the apache2.conf settings. >>> >>> >>> >>> Any help on this issue would be most gratefully received! >>> >>> >>> >>> Thanks, >>> >>> >>> >>> Luke >>> >> >> >> >> ------------------------------------------------------------------------------ >> How ServiceNow helps IT people transform IT departments: >> 1. Consolidate legacy IT systems to a single system of record for IT >> 2. Standardize and globalize service processes across IT >> 3. Implement zero-touch automation to replace manual, redundant taskshttp://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk >> >> >> >> _______________________________________________ >> Vufind-tech mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/vufind-tech >> >> >> > > > ------------------------------------------------------------------------------ > How ServiceNow helps IT people transform IT departments: > 1. Consolidate legacy IT systems to a single system of record for IT > 2. Standardize and globalize service processes across IT > 3. Implement zero-touch automation to replace manual, redundant tasks > http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk > _______________________________________________ > Vufind-tech mailing list > Vuf...@li... > https://lists.sourceforge.net/lists/listinfo/vufind-tech > > |