From: Demian K. <dem...@vi...> - 2017-08-22 13:37:46
|
One other thought/question: has anyone investigated whether the new JSON facet functionality has different performance characteristics than the legacy mechanism used by VuFind? It's long been on the to-do list to investigate this (see https://vufind.org/jira/browse/VUFIND-1210) but since the existing code works fine, there hasn't been a lot of urgency to work on it. Whether or not using JSON facets makes a difference, I wonder if this mechanism provides the answer to what changed the performance characteristics: perhaps there was some underlying refactoring related to facets that was needed to support new features but had a negative impact on speed. - Demian -----Original Message----- From: Ere Maijala [mailto:ere...@he...] Sent: Tuesday, August 22, 2017 5:37 AM To: Günter Hipler Cc: vuf...@li... Subject: Re: [VuFind-Tech] facet processing in Solr version 6.x Thanks for the insight, Günter! Regarding stored=true, Solr doesn't guarantee the order of the fields with docValues, and it can be relevant in some situations. I took a quick look at GBV's Solr configuration and couldn't find anything particularly special in it compared to our config. Puzzling stuff! I'll see if I can run a bit of profiling on 6.6 to see if there's anything in particular that sticks out. --Ere Günter Hipler kirjoitti 22.8.2017 klo 11.04: > Good morning Ere, > > thanks a lot for this valuable information Ere. > > I'm getting the impression that there is some kind of regression in > the course of the development from version 4 to 6 - and I'm a little > bit disappointed that nobody part of the Solr user list seems to be > interested on this topic. I already looked into your configurations > and I compared it with GBV (Göttingen, Germany - Till Kinstler, you > know him probably). GBV is running a big index and it seems they found > a solution based on version 6.3 (which didn't work for us so far) You > can find their configurations here [1] By the way you can find our > current various definitions here [2] - with and without docvalues > > I have seen that you defined all your docvalue types with stored=true > [3]. Is there any specific reason for this? > > Anyway - actually I have the following strategy in mind: > - I will contact external support. One possibility is a consultant > company in Germany we had some contact last year. The other > possibility is to speak with lucidworks directly. I think I contact > them in this sequence - and of course I will let you know if we come > across a solution > > - I haven't had in mind to change directly to the distributed mode but > this alternative gives us the possibility to have smaller shards. We > have different sizes of indexes for specialized services and I noticed > that having an index with 8 million docs is processing the facets > still slower compared to version 4 but less significant compared to > our index with more than 30 million docs. I wanted to avoid to make > two things at the same time (version update and introduction of > distributed mode) but ok... > > - and, once I have a little bit time (which never happens...) I would > like to evaluate our classic environment (with SOLR) on ElasticSearch. > I'm curious about the performance of their facet algorithm. Both > search servers use their own implementation (not the Lucene facet > module). And because we use ElasticSearch already for our > linked.swissbib.ch (with both servers in the background) and > data.swissbib.ch services the first step would be done. But anway - only on the medium or long term. > > > And thanks for your thoughtful hint related to our open SOLR - search > API [4] Only the select request handler is available for any worldwide > external services - all others are blocked by fire-wall rules. > (hopefully we haven't overseen any hole....) > > Best wishes from Basel to Finland! > > Günter > > > > [1] > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu > b.com%2Fgbv%2Ffindex-config%2Ftree%2Fmaster%2FSolrCloud&data=02%7C01%7 > Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a > 8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=WYFkQH > PgUaNDsgT%2FuEI%2BoG0I%2FEyXotFDPc4WmyeZOhE%3D&reserved=0 > [2] > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu > b.com%2Fswissbib%2Fsearchconf%2Ftree%2Fdocvalues%2Fsolr6%2Fsolr%2FSOLR > _HOME&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61 > 508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C63638991416 > 6382347&sdata=YWdewbk6lUEOOm4iF%2BD1hZv4u5mAcDfcxE%2FeMwWtTJ8%3D&reser > ved=0 > [3] > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu > b.com%2FNatLibFi%2Ffinna-solr%2Fblob%2Fmaster%2Fvufind%2Fbiblio%2Fconf > %2Fschema.xml&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10 > f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636 > 389914166382347&sdata=hBSD2sqVnXriQwnl%2FvgF%2BovxElQUoxIJkHK19RP3gdc% > 3D&reserved=0 > > [4] > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsearch > .swissbib.ch%2Fsolr%2Fsb-biblio%2Fselect%3Fq%3D*%253A*%26wt%3Dxml%26in > dent%3Dtrue&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f4 > 44cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C63638 > 9914166382347&sdata=nCGVXvbubINbA4l3O4TILG9bPXb0MGx%2B3GLO8ICCIHI%3D&r > eserved=0 > > > > On 21.08.2017 16:13, Ere Maijala wrote: >> Hi Günter! >> >> I think Demian tested with and without docvalues and didn't find >> docvalues to be faster, but in my experience they make a difference. >> That's probably because our index is much larger (test installation >> now at 57 million records). However, docvalues don't make a big >> enough impact to offset the slowliness. If you want to >> cross-reference your Solr setup with ours, our current Solr >> configuration can be found at <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNatLibFi%2Ffinna-solr&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=ESVWNIGvSHXTTWTwwyIPKCK9drkE8J1erVYZ6VFBKkw%3D&reserved=0>. >> >> So, faceting for a *:* search is very slow for us too (even slower >> than yours), and for a high result count searches it's generally slow. >> I can't test it with an older Solr, but I recall that we introduced >> asynchronously loading facets around the same time we moved to Solr >> 5.x as a sort of a workaround. There were some optimization made for >> 5.x or 6.x and they helped a bit, but the performance has never been >> the same as with 4.x. Unfortunately I haven't had time to investigate >> it further. I really hope you and others can find out what the reason >> could be. >> >> By the way, I noticed that your production Solr is publicly >> accessible. It's generally considered a really bad idea, since a >> malicious user could e.g. wipe your index with a simple query. >> >> Regards, >> Ere >> >> Günter Hipler kirjoitti 21.8.2017 klo 16.52: >>> Hi Ere, >>> >>> sorry for using your direct eMail account (and not the vufind - >>> list) but I think actually Finna might be the service with highest >>> expertise in this area. >>> >>> I'm trying to update our SOLR environment from Solr version 4.10 to >>> 6 (best would be the latest 6.6) >>> >>> Things are in general ok but I have a lot of trouble with the >>> processing time for facets which is significantly longer and a >>> stopper for changing the environment. >>> >>> I just posted a message to the solr list (copy at the end of the >>> eMail). I have different versions (without -as in version 4 - and >>> with docvalues - although I would like to use the docvalues >>> approach) >>> - What are your experiences in this area? >>> - do you have any hints how to tackle this? >>> >>> Thanks a lot and best wishes from Basel >>> >>> Günter >>> >>> *** mail to the list *** >>> >>> Hi, >>> >>> I can't figure out the reason why the facet processing in version 6 >>> needs significantly more time compared to version 4. >>> >>> The debugging response (for 30 million documents) >>> >>> solr 4 >>> <lst name="process"><double name="time">280.0</double><lst >>> name="query"><double name="time">0.0</double></lst><lst >>> name="facet"><double name="time">280.0</double></lst> (once the >>> query is cached) before caching: between 1.5 and 2 sec >>> >>> >>> solr 6.x (my last try was with 6.6) >>> without docvalues for facetting fields (same schema as version 4) >>> <lst name="process"><double name="time">5874.0</double><lst >>> name="query"><double name="time">0.0</double></lst><lst >>> name="facet"><double name="time">5873.0</double></lst><lst >>> name="facet_module"><double name="time">0.0</double></lst> >>> the time is not getting better even after repeating the query several >>> times >>> >>> >>> solr 6.6 with docvalues for facetting fields >>> <lst name="process"><double name="time">9837.0</double><lst >>> name="query"><double name="time">0.0</double></lst><lst >>> name="facet"><double name="time">9837.0</double></lst><lst >>> name="facet_module"><double name="time">0.0</double></lst> >>> >>> used query (our productive system with version 4) >>> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsearch.swissbib.ch%2Fsolr%2Fsb-biblio%2Fselect%3FdebugQuery%3Dtrue%26q%3D*%3A*%26facet%3Dtrue%26facet.field%3Dunion%26facet.field%3DnavAuthor_full%26facet.field%3Dformat%26facet.field%3Dlanguage%26facet.field%3DnavSub_green%26facet.field%3DnavSubform%26facet.field%3DpublishDate%26qt%3Dedismax%26ps%3D2%26json.nl%3Darrarr%26bf%3Drecip(abs(ms(NOW%2FDAY%2Cfreshness))%2C3.16e-10%2C100%2C100)%26fl%3D*%2Cscore%26hl.fragsize%3D250%26start%3D0%26q.op%3DAND%26sort%3Dscore%2Bdesc%26rows%3D0%26hl.simple.pre&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=zDdCTBpjtptHjfyI5CRm6y9yc2TWbFoP4OCnxiyvq3s%3D&reserved=0={{{{START_HILITE}}}}&facet.limit=100&hl.simple.post={{{{END_HILITE}}}}&spellcheck=false&qf=title_short^1000+title_alt^200+title_sub^200+title_old^200+title_new^200+author^750+author_additional^100+author_additional_dsv11_txt_mv^100+title_additional_dsv11_txt_mv^100+series^200+topic^500+addfields_txt_mv^50+publplace_txt_mv^25+publplace_dsv11_txt_mv^25+fulltext+callnumber^1000+ctrlnum^1000+publishDate+isbn+variant_isbn_isn_mv+issn+localcode+id&pf=title_short^1000&facet.mincount=1&hl.fl=fulltext&&wt=xml&facet.sort=count >>> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsearch.swissbib.ch%2Fsolr%2Fsb-biblio%2Fselect%3FdebugQuery%3Dtrue%26q%3D*%3A*%26facet%3Dtrue%26facet.field%3Dunion%26facet.field%3DnavAuthor_full%26facet.field%3Dformat%26facet.field%3Dlanguage%26facet.field%3DnavSub_green%26facet.field%3DnavSubform%26facet.field%3DpublishDate%26qt%3Dedismax%26ps%3D2%26json.nl%3Darrarr%26bf%3Drecip%2528abs%2528ms%2528NOW%2FDAY%2Cfreshness%2529%2529%2C3.16e-10%2C100%2C100%2529%26fl%3D*%2Cscore%26hl.fragsize%3D250%26start%3D0%26q.op%3DAND%26sort%3Dscore%2Bdesc%26rows%3D0%26hl.simple.pre%3D%257B%257B%257B%257BSTART_HILITE%257D%257D%257D%257D%26facet.limit%3D100%26hl.simple.post%3D%257B%257B%257B%257BEND_HILITE%257D%257D%257D%257D%26spellcheck%3Dfalse%26qf%3Dtitle_short%255E1000%2Btitle_alt%255E200%2Btitle_sub%255E200%2Btitle_old%255E200%2Btitle_new%255E200%2Bauthor%255E750%2Bauthor_additional%255E100%2Bauthor_additional_dsv11_txt_mv%255E100%2Btitle_additional_dsv11_txt_mv%255E100%2Bseries%255E200%2Btopic%255E500%2Baddfields_txt_mv%255E50%2Bpublplace_txt_mv%255E25%2Bpublplace_dsv11_txt_mv%255E25%2Bfulltext%2Bcallnumber%255E1000%2Bctrlnum%255E1000%2BpublishDate%2Bisbn%2Bvariant_isbn_isn_mv%2Bissn%2Blocalcode%2Bid%26pf%3Dtitle_short%255E1000%26facet.mincount%3D1%26hl.fl%3Dfulltext%26%26wt%3Dxml%26facet.sort%3Dcount&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=ae0jjS%2FsVNk4kAODunQ5te%2FeOGTOUAp6CtTk4YQQ%2BAY%3D&reserved=0> >>> >>> >>> >>> Running the queries on smaller indices (8 million docs) the >>> difference is similar although the absolute figures for processing >>> time are smaller >>> >>> >>> Any hints why this huge differences? >>> >>> Günter >>> >>> >>> >>> -- >>> Günter Hipler >>> >>> Universität Basel | Universitätsbibliothek | Projekt swissbib >>> >>> Schönbeinstrasse 18-20 | 4056 Basel | Schweiz >>> >>> Tel +41 61 207 31 12 | Fax +41 61 207 31 03 >>> >>> E-M...@un... |https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ub.unibas.ch&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=yc4KqiovHdEU7W%2F0cRm4r%2F3Gk2VvwOt2SmMhTTq7lGg%3D&reserved=0 >>> |https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.swissbib.ch&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=OtYYydWKcARNuKrWQdJiwg1HFGdb2yY94OpwBrhyshk%3D&reserved=0 >>> >> > -- Ere Maijala Kansalliskirjasto / The National Library of Finland ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsdm.link%2Fslashdot&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=TBpJ2b163N7FANT%2FnDlCAANTo3j91jsZPGSO5mLleJU%3D&reserved=0 _______________________________________________ Vufind-tech mailing list Vuf...@li... https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Fvufind-tech&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=Oj9yySZkzc5SJdEzAfufO85vJOglS5hUruTNEB2T9XI%3D&reserved=0 |