From: Ere M. <ere...@he...> - 2017-08-23 06:02:09
|
It's one of the large bunch of things I've been meaning to try out... It wouldn't be that difficult to convert a couple of queries by hand and see how it goes, but I've never gotten around to actually doing it. I'll see what I can find out soon. --Ere Demian Katz kirjoitti 22.8.2017 klo 16.37: > One other thought/question: has anyone investigated whether the new JSON facet functionality has different performance characteristics than the legacy mechanism used by VuFind? It's long been on the to-do list to investigate this (see https://vufind.org/jira/browse/VUFIND-1210) but since the existing code works fine, there hasn't been a lot of urgency to work on it. > > Whether or not using JSON facets makes a difference, I wonder if this mechanism provides the answer to what changed the performance characteristics: perhaps there was some underlying refactoring related to facets that was needed to support new features but had a negative impact on speed. > > - Demian > > -----Original Message----- > From: Ere Maijala [mailto:ere...@he...] > Sent: Tuesday, August 22, 2017 5:37 AM > To: Günter Hipler > Cc: vuf...@li... > Subject: Re: [VuFind-Tech] facet processing in Solr version 6.x > > Thanks for the insight, Günter! > > Regarding stored=true, Solr doesn't guarantee the order of the fields with docValues, and it can be relevant in some situations. > > I took a quick look at GBV's Solr configuration and couldn't find anything particularly special in it compared to our config. Puzzling stuff! > > I'll see if I can run a bit of profiling on 6.6 to see if there's anything in particular that sticks out. > > --Ere > > Günter Hipler kirjoitti 22.8.2017 klo 11.04: >> Good morning Ere, >> >> thanks a lot for this valuable information Ere. >> >> I'm getting the impression that there is some kind of regression in >> the course of the development from version 4 to 6 - and I'm a little >> bit disappointed that nobody part of the Solr user list seems to be >> interested on this topic. I already looked into your configurations >> and I compared it with GBV (Göttingen, Germany - Till Kinstler, you >> know him probably). GBV is running a big index and it seems they found >> a solution based on version 6.3 (which didn't work for us so far) You >> can find their configurations here [1] By the way you can find our >> current various definitions here [2] - with and without docvalues >> >> I have seen that you defined all your docvalue types with stored=true >> [3]. Is there any specific reason for this? >> >> Anyway - actually I have the following strategy in mind: >> - I will contact external support. One possibility is a consultant >> company in Germany we had some contact last year. The other >> possibility is to speak with lucidworks directly. I think I contact >> them in this sequence - and of course I will let you know if we come >> across a solution >> >> - I haven't had in mind to change directly to the distributed mode but >> this alternative gives us the possibility to have smaller shards. We >> have different sizes of indexes for specialized services and I noticed >> that having an index with 8 million docs is processing the facets >> still slower compared to version 4 but less significant compared to >> our index with more than 30 million docs. I wanted to avoid to make >> two things at the same time (version update and introduction of >> distributed mode) but ok... >> >> - and, once I have a little bit time (which never happens...) I would >> like to evaluate our classic environment (with SOLR) on ElasticSearch. >> I'm curious about the performance of their facet algorithm. Both >> search servers use their own implementation (not the Lucene facet >> module). And because we use ElasticSearch already for our >> linked.swissbib.ch (with both servers in the background) and >> data.swissbib.ch services the first step would be done. But anway - only on the medium or long term. >> >> >> And thanks for your thoughtful hint related to our open SOLR - search >> API [4] Only the select request handler is available for any worldwide >> external services - all others are blocked by fire-wall rules. >> (hopefully we haven't overseen any hole....) >> >> Best wishes from Basel to Finland! >> >> Günter >> >> >> >> [1] >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu >> b.com%2Fgbv%2Ffindex-config%2Ftree%2Fmaster%2FSolrCloud&data=02%7C01%7 >> Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a >> 8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=WYFkQH >> PgUaNDsgT%2FuEI%2BoG0I%2FEyXotFDPc4WmyeZOhE%3D&reserved=0 >> [2] >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu >> b.com%2Fswissbib%2Fsearchconf%2Ftree%2Fdocvalues%2Fsolr6%2Fsolr%2FSOLR >> _HOME&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61 >> 508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C63638991416 >> 6382347&sdata=YWdewbk6lUEOOm4iF%2BD1hZv4u5mAcDfcxE%2FeMwWtTJ8%3D&reser >> ved=0 >> [3] >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu >> b.com%2FNatLibFi%2Ffinna-solr%2Fblob%2Fmaster%2Fvufind%2Fbiblio%2Fconf >> %2Fschema.xml&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10 >> f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636 >> 389914166382347&sdata=hBSD2sqVnXriQwnl%2FvgF%2BovxElQUoxIJkHK19RP3gdc% >> 3D&reserved=0 >> >> [4] >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsearch >> .swissbib.ch%2Fsolr%2Fsb-biblio%2Fselect%3Fq%3D*%253A*%26wt%3Dxml%26in >> dent%3Dtrue&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f4 >> 44cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C63638 >> 9914166382347&sdata=nCGVXvbubINbA4l3O4TILG9bPXb0MGx%2B3GLO8ICCIHI%3D&r >> eserved=0 >> >> >> >> On 21.08.2017 16:13, Ere Maijala wrote: >>> Hi Günter! >>> >>> I think Demian tested with and without docvalues and didn't find >>> docvalues to be faster, but in my experience they make a difference. >>> That's probably because our index is much larger (test installation >>> now at 57 million records). However, docvalues don't make a big >>> enough impact to offset the slowliness. If you want to >>> cross-reference your Solr setup with ours, our current Solr >>> configuration can be found at <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNatLibFi%2Ffinna-solr&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=ESVWNIGvSHXTTWTwwyIPKCK9drkE8J1erVYZ6VFBKkw%3D&reserved=0>. >>> >>> So, faceting for a *:* search is very slow for us too (even slower >>> than yours), and for a high result count searches it's generally slow. >>> I can't test it with an older Solr, but I recall that we introduced >>> asynchronously loading facets around the same time we moved to Solr >>> 5.x as a sort of a workaround. There were some optimization made for >>> 5.x or 6.x and they helped a bit, but the performance has never been >>> the same as with 4.x. Unfortunately I haven't had time to investigate >>> it further. I really hope you and others can find out what the reason >>> could be. >>> >>> By the way, I noticed that your production Solr is publicly >>> accessible. It's generally considered a really bad idea, since a >>> malicious user could e.g. wipe your index with a simple query. >>> >>> Regards, >>> Ere >>> >>> Günter Hipler kirjoitti 21.8.2017 klo 16.52: >>>> Hi Ere, >>>> >>>> sorry for using your direct eMail account (and not the vufind - >>>> list) but I think actually Finna might be the service with highest >>>> expertise in this area. >>>> >>>> I'm trying to update our SOLR environment from Solr version 4.10 to >>>> 6 (best would be the latest 6.6) >>>> >>>> Things are in general ok but I have a lot of trouble with the >>>> processing time for facets which is significantly longer and a >>>> stopper for changing the environment. >>>> >>>> I just posted a message to the solr list (copy at the end of the >>>> eMail). I have different versions (without -as in version 4 - and >>>> with docvalues - although I would like to use the docvalues >>>> approach) >>>> - What are your experiences in this area? >>>> - do you have any hints how to tackle this? >>>> >>>> Thanks a lot and best wishes from Basel >>>> >>>> Günter >>>> >>>> *** mail to the list *** >>>> >>>> Hi, >>>> >>>> I can't figure out the reason why the facet processing in version 6 >>>> needs significantly more time compared to version 4. >>>> >>>> The debugging response (for 30 million documents) >>>> >>>> solr 4 >>>> <lst name="process"><double name="time">280.0</double><lst >>>> name="query"><double name="time">0.0</double></lst><lst >>>> name="facet"><double name="time">280.0</double></lst> (once the >>>> query is cached) before caching: between 1.5 and 2 sec >>>> >>>> >>>> solr 6.x (my last try was with 6.6) >>>> without docvalues for facetting fields (same schema as version 4) >>>> <lst name="process"><double name="time">5874.0</double><lst >>>> name="query"><double name="time">0.0</double></lst><lst >>>> name="facet"><double name="time">5873.0</double></lst><lst >>>> name="facet_module"><double name="time">0.0</double></lst> >>>> the time is not getting better even after repeating the query several >>>> times >>>> >>>> >>>> solr 6.6 with docvalues for facetting fields >>>> <lst name="process"><double name="time">9837.0</double><lst >>>> name="query"><double name="time">0.0</double></lst><lst >>>> name="facet"><double name="time">9837.0</double></lst><lst >>>> name="facet_module"><double name="time">0.0</double></lst> >>>> >>>> used query (our productive system with version 4) >>>> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsearch.swissbib.ch%2Fsolr%2Fsb-biblio%2Fselect%3FdebugQuery%3Dtrue%26q%3D*%3A*%26facet%3Dtrue%26facet.field%3Dunion%26facet.field%3DnavAuthor_full%26facet.field%3Dformat%26facet.field%3Dlanguage%26facet.field%3DnavSub_green%26facet.field%3DnavSubform%26facet.field%3DpublishDate%26qt%3Dedismax%26ps%3D2%26json.nl%3Darrarr%26bf%3Drecip(abs(ms(NOW%2FDAY%2Cfreshness))%2C3.16e-10%2C100%2C100)%26fl%3D*%2Cscore%26hl.fragsize%3D250%26start%3D0%26q.op%3DAND%26sort%3Dscore%2Bdesc%26rows%3D0%26hl.simple.pre&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=zDdCTBpjtptHjfyI5CRm6y9yc2TWbFoP4OCnxiyvq3s%3D&reserved=0={{{{START_HILITE}}}}&facet.limit=100&hl.simple.post={{{{END_HILITE}}}}&spellcheck=false&qf=title_short^1000+title_alt^200+title_sub^200+title_old^200+title_new^200+author^750+author_additional^100+author_additional_dsv11_txt_mv^100+title_additional_dsv11_txt_mv^100+series^200+topic^500+addfields_txt_mv^50+publplace_txt_mv^25+publplace_dsv11_txt_mv^25+fulltext+callnumber^1000+ctrlnum^1000+publishDate+isbn+variant_isbn_isn_mv+issn+localcode+id&pf=title_short^1000&facet.mincount=1&hl.fl=fulltext&&wt=xml&facet.sort=count >>>> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsearch.swissbib.ch%2Fsolr%2Fsb-biblio%2Fselect%3FdebugQuery%3Dtrue%26q%3D*%3A*%26facet%3Dtrue%26facet.field%3Dunion%26facet.field%3DnavAuthor_full%26facet.field%3Dformat%26facet.field%3Dlanguage%26facet.field%3DnavSub_green%26facet.field%3DnavSubform%26facet.field%3DpublishDate%26qt%3Dedismax%26ps%3D2%26json.nl%3Darrarr%26bf%3Drecip%2528abs%2528ms%2528NOW%2FDAY%2Cfreshness%2529%2529%2C3.16e-10%2C100%2C100%2529%26fl%3D*%2Cscore%26hl.fragsize%3D250%26start%3D0%26q.op%3DAND%26sort%3Dscore%2Bdesc%26rows%3D0%26hl.simple.pre%3D%257B%257B%257B%257BSTART_HILITE%257D%257D%257D%257D%26facet.limit%3D100%26hl.simple.post%3D%257B%257B%257B%257BEND_HILITE%257D%257D%257D%257D%26spellcheck%3Dfalse%26qf%3Dtitle_short%255E1000%2Btitle_alt%255E200%2Btitle_sub%255E200%2Btitle_old%255E200%2Btitle_new%255E200%2Bauthor%255E750%2Bauthor_additional%255E100%2Bauthor_additional_dsv11_txt_mv%255E100%2Btitle_additional_dsv11_txt_mv%255E100%2Bseries%255E200%2Btopic%255E500%2Baddfields_txt_mv%255E50%2Bpublplace_txt_mv%255E25%2Bpublplace_dsv11_txt_mv%255E25%2Bfulltext%2Bcallnumber%255E1000%2Bctrlnum%255E1000%2BpublishDate%2Bisbn%2Bvariant_isbn_isn_mv%2Bissn%2Blocalcode%2Bid%26pf%3Dtitle_short%255E1000%26facet.mincount%3D1%26hl.fl%3Dfulltext%26%26wt%3Dxml%26facet.sort%3Dcount&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=ae0jjS%2FsVNk4kAODunQ5te%2FeOGTOUAp6CtTk4YQQ%2BAY%3D&reserved=0> >>>> >>>> >>>> >>>> Running the queries on smaller indices (8 million docs) the >>>> difference is similar although the absolute figures for processing >>>> time are smaller >>>> >>>> >>>> Any hints why this huge differences? >>>> >>>> Günter >>>> >>>> >>>> >>>> -- >>>> Günter Hipler >>>> >>>> Universität Basel | Universitätsbibliothek | Projekt swissbib >>>> >>>> Schönbeinstrasse 18-20 | 4056 Basel | Schweiz >>>> >>>> Tel +41 61 207 31 12 | Fax +41 61 207 31 03 >>>> >>>> E-M...@un... |https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ub.unibas.ch&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=yc4KqiovHdEU7W%2F0cRm4r%2F3Gk2VvwOt2SmMhTTq7lGg%3D&reserved=0 >>>> |https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.swissbib.ch&data=02%7C01%7Cdemian.katz%40villanova.edu%7C6454f7f6b10f444cf61508d4e94153cf%7C765a8de5cf9444f09cafae5bf8cfa366%7C0%7C0%7C636389914166382347&sdata=OtYYydWKcARNuKrWQdJiwg1HFGdb2yY94OpwBrhyshk%3D&reserved=0 >>>> >>> >> > -- Ere Maijala Kansalliskirjasto / The National Library of Finland |