From: Christopher M. <chr...@mc...> - 2004-04-07 15:53:30
|
Greetings htdig folks, In my search.conf my exculde_urls is set to: exclude_urls: `/usr/local/htdig/common/excludeURL/exclude_urls` This works as expected, and all URLs that are in that file are excluded from being indexed. However, I've noticed that adding URLs *seriously* degrades digging performance. To a point that with 30 or so patterns, I got 8k pages in over 8 hours, and without them, I could do 15k pages in an hour. With such a drastic difference, I'm assuming that there's a bug somewhere. I'll try to go digging through the code to find it, but I imagine that someone on this list will have better luck than me. :-) Cheers, Chris -- Christopher Murtagh Enterprise Systems Administrator ISR / Web Communications Group McGill University Montreal, Quebec Canada Tel.: (514) 398-3122 Fax: (514) 398-2017 |
From: Lachlan A. <lh...@us...> - 2004-04-10 04:23:23
|
Greetings Chris and team, Profiling should help a lot here. One possibility is that it is because ht://Dig now allows attributes=20 to be specified on a per-host basis. As a result, all of the=20 exclude_urls patterns are re-compiled for each url checked (even if=20 you don't specify host-dependent attributes). If (for example) there=20 is a memory leak in the regex compiler, this would slow the code down=20 as you describe. Even if it is not causing Chris's problem, I've been thinking for a=20 long time that this re-parsing may partially account for the big=20 slowdown in digging with recent 3.2.0 betas. From speaking with=20 Gabriele, I know that he finds the feature very useful. However, I=20 can't find the config file format documented, so I don't think that=20 many people can benefit from it, and it is currently just=20 (unquantified) bloat. If Chris finds that that is in fact the problem, I suggest that: 1. We set a flag if per-host attributes are used at all 2. If no per-host attributes are used, all expensive-to-parse attributes (like regular expressions) should be cached in their parsed forms. Thoughts? Lachlan On Thu, 8 Apr 2004 01:52, Christopher Murtagh wrote: > I've noticed that adding URLs *seriously* degrades > digging performance. To a point that with 30 or so patterns, I got > 8k pages in over 8 hours, and without them, I could do 15k pages in > an hour. > > With such a drastic difference, I'm assuming that there's a bug > somewhere. I'll try to go digging through the code to find it, but > I imagine that someone on this list will have better luck than me. > :-) --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
From: Gilles D. <gr...@sc...> - 2004-04-14 18:53:04
|
According to Lachlan Andrew: > Even if it is not causing Chris's problem, I've been thinking for a > long time that this re-parsing may partially account for the big > slowdown in digging with recent 3.2.0 betas. From speaking with > Gabriele, I know that he finds the feature very useful. However, I > can't find the config file format documented, so I don't think that > many people can benefit from it, and it is currently just > (unquantified) bloat. > > If Chris finds that that is in fact the problem, I suggest that: > 1. We set a flag if per-host attributes are used at all > 2. If no per-host attributes are used, all expensive-to-parse > attributes (like regular expressions) should be cached in their > parsed forms. Remember that there are URL blocks as well as server blocks and globals, so we actually have 3 different levels. It might be useful to maintain a status for each attribute that tells us the lowest level at which the attribute is defined. That way even if some attributes are used in URL blocks, we don't need to reparse the ones that aren't. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Lachlan A. <lh...@us...> - 2004-04-17 00:00:09
|
Greetings Gilles, That is a very good suggestion. I suppose it comes down to the=20 question of how many people actually use per-server blocks. If they=20 are rare enough, then I vote we optimise for the common case. The=20 cost of having per-attribute flags is (a) an extra two bits per attribute (which probably becomes 32 bits=20 after alignment issues), and (b) an extra attribute-lookup each time the attribute may be used. (c) extra code to write/debug/document/maintain. If, say, under 1% of users use per-host attributes, then we could use=20 a single flag. If it is more than 1%, we could "do the right thing"=20 as Giles suggests. As Robert Ribnitz recently pointed out, many=20 people will only index documentation on their own box (that is what=20 brought me to ht://Dig). Of the rest, most probably only index a=20 single site. Of the rest, most probably don't know about the=20 per-host option, because it's documentation isn't very prominent. Thoughts? Lachlan On Thu, 15 Apr 2004 04:52, Gilles Detillieux wrote: > According to Lachlan Andrew: > > If Chris finds that that is in fact the problem, I suggest that: > > 1. We set a flag if per-host attributes are used at all > > 2. If no per-host attributes are used, all expensive-to-parse > > attributes (like regular expressions) should be cached in > > their parsed forms. > > Remember that there are URL blocks as well as server blocks and > globals, so we actually have 3 different levels. It might be > useful to maintain a status for each attribute that tells us the > lowest level at which the attribute is defined. That way even if > some attributes are used in URL blocks, we don't need to reparse > the ones that aren't. --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
From: Lachlan A. <lh...@us...> - 2004-04-10 05:02:41
|
Greetings all, One of the biggest show-stoppers for 3.2.0rc1 seems to be speed. We=20 had agreed to complete testing and then release, but ht://Dig would=20 get a very bad reputation if we release the current slow product. I=20 propose that we suspend testing and all concentrate on getting=20 performance to within a factor of 2 of the 3.1.6 speed? Some possible approaches are: 1. Keep the database format entirely unchanged, but write to it more=20 efficiently. Neal reported good results by caching updates using the=20 STL. That would improve the "locality" of the database accesses,=20 both improving its own caching and reduce the performance hit from=20 compression. Neal, could you tell us a bit more about this? 2. Give users the option to store only the first occurrence of each=20 word in each document. That will kill phrase searching, but it=20 should make the database smaller, and (if done correctly) eliminate=20 most of the writes to it in the build phase. This keeps the database format nominally the same (so search/purge are unaffected). 3. Totally rewrite the database format to avoid the significant=20 redundancy. This really should be done at some stage, and I vote the=20 sooner the better. As I understand it, the entries are all of the format Word, Doc ID (32 bits), flags (8 bits), location (16 bits). Does anyone know how well BDB handles variable length records? If=20 they are OK, how about a format like: Word, Doc ID (32 bits), count (16), <flags (8), offset (16)> + Here, the "offset" field is the *difference* between the locations of=20 consecutive occurrences. These numbers will be more likely to be=20 under 255, and so should compress better. Because entries are made for cross-references from other documents, we=20 could allow multiple entries of that form for the same word/document,=20 but still massively reduce the number of redundant Word/DocID fields,=20 and (more importantly?) the number of database writes. Could I ask for a "show of hands" of people who can help here? We=20 need people who know why the current database format was selected,=20 understand the current code, understand BDB or are willing to help=20 code. (I'm only in the last of those categories, unfortunately.) Cheers, Lachlan --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
From: Lachlan A. <lh...@us...> - 2004-04-12 02:54:29
|
Greetings Chris, What are the actual 30 patterns that cause problems? It could well be=20 a single pattern which is giving the regular-expression handler=20 grief. Cheers, Lachlan On Thu, 8 Apr 2004 01:52, Christopher Murtagh wrote: > In my search.conf my exculde_urls is set to: > exclude_urls: `/usr/local/htdig/common/excludeURL/exclude_urls` > > However, I've noticed that adding URLs *seriously* degrades > digging performance, to the point that with 30 or so patterns, I got > 8k pages in over 8 hours, and without them, I could do 15k pages in > an hour. > > With such a drastic difference, I'm assuming that there's a bug > somewhere. --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
From: Christopher M. <chr...@mc...> - 2004-04-13 12:04:16
|
On Sun, 2004-04-11 at 22:50, Lachlan Andrew wrote: > Greetings Chris, > > What are the actual 30 patterns that cause problems? It could well be > a single pattern which is giving the regular-expression handler > grief. Yeah, I had thought the same as well. However, I tried changing the patterns, and also leaving very simple patterns. They all ended up the same. Also, the more patterns I added, the slower the performance was. Cheers, Chris -- Christopher Murtagh Enterprise Systems Administrator ISR / Web Communications Group McGill University Montreal, Quebec Canada Tel.: (514) 398-3122 Fax: (514) 398-2017 |
From: Christopher M. <chr...@mc...> - 2004-04-13 12:06:17
|
On Sat, 2004-04-10 at 00:19, Lachlan Andrew wrote: > If Chris finds that that is in fact the problem, I suggest that: > 1. We set a flag if per-host attributes are used at all > 2. If no per-host attributes are used, all expensive-to-parse > attributes (like regular expressions) should be cached in their > parsed forms. > > Thoughts? Just let me know what you want/need me to do. I can poke my head into the code again if need be, but my C++ is starting to get really rusty. If you have any general direction/pointers to throw my way, it would be much appreciated, and I'll do what I can to help out. Thanks for looking into this. Cheers, Chris -- Christopher Murtagh Enterprise Systems Administrator ISR / Web Communications Group McGill University Montreal, Quebec Canada Tel.: (514) 398-3122 Fax: (514) 398-2017 |
From: Lachlan A. <lh...@us...> - 2004-04-16 14:32:21
|
Greetings Chris, As I recall, you listed the patterns in a file, and included that file=20 in the htdig.conf file using backquotes. It just occurred to me that=20 the file listing the patterns is probably being read in each time the=20 attribute is read (each time a url is parsed). What happens to the speed if you list the patterns explicitly in the=20 htdig.conf file? If that is the problem, one solution would be for HtConfig::Find() to=20 replace any `file` argument by the contents of the file the first=20 time it is read. That may chew up some memory for things like long=20 lists for start_url, but it would only use memory for attributes=20 that are actually used. (Of course, it would still be better not to=20 re-parse the attributes unnecessarily.) Opinions? Cheers, Lachlan On Tue, 13 Apr 2004 22:04, Christopher Murtagh wrote: > On Sun, 2004-04-11 at 22:50, Lachlan Andrew wrote: > > Greetings Chris, > > > > What are the actual 30 patterns that cause problems? It could > > well be a single pattern which is giving the regular-expression > > handler grief. > > Yeah, I had thought the same as well. However, I tried changing > the patterns, and also leaving very simple patterns. They all ended > up the same. Also, the more patterns I added, the slower the > performance was. > > Cheers, > > Chris --=20 lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
From: Christopher M. <chr...@mc...> - 2004-04-16 16:26:12
|
On Fri, 2004-04-16 at 10:27, Lachlan Andrew wrote: > As I recall, you listed the patterns in a file, and included that file > in the htdig.conf file using backquotes. It just occurred to me that > the file listing the patterns is probably being read in each time the > attribute is read (each time a url is parsed). > > What happens to the speed if you list the patterns explicitly in the > htdig.conf file? Yeah, I thought the same thing and had tried that. Same results. From what I can see in the source of htdig/Retriever.cc (line 998-1000), the URL list is re-parsed at *every* URL: // // If the URL contains any of the patterns in the exclude list, // mark it as invalid // tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); HtRegexList excludes; excludes.setEscaped(tmpList, config->Boolean("case_sensitive")); This would explain some of the problems that it is having. The other place I have been looking at is the HtRegexList::setEscaped method (htlib/HtRegexList.cc) which seems to be really expensive, but I've really lost touch with C++ and I'm definitely not a good judge of it anymore. Cheers, Chris -- Christopher Murtagh Enterprise Systems Administrator ISR / Web Communications Group McGill University Montreal, Quebec Canada Tel.: (514) 398-3122 Fax: (514) 398-2017 |
From: Christopher M. <chr...@mc...> - 2004-04-19 19:51:48
|
Ok, I've solved my problem, and can now have a list of working exclude_urls without the serious performance decrease. Here are the changes I made (sorry I'm not sending a proper diff file... need guidance on how to do that properly): htdig/htdig.h -------------------- added: extern int exclude_checked; extern int badquerystr_checked; extern HtRegexList excludes; extern HtRegexList badquerystr; htdig/htdig.cc ---------------------- added these as global variable definitions: int exclude_checked = 0; int badquerystr_checked = 0; HtRegexList excludes; HtRegexList badquerystr; htdig/Retriever.cc added these conditionals and removed the previous tmplist creates and .setEscaped() calls: if(!(exclude_checked)){ //only parse this once and store into global variable tmpList.Destroy(); tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); excludes.setEscaped(tmpList, config->Boolean("case_sensitive")); exclude_checked = 1; } if(!(badquerystr_checked)){ //only parse this once and store into global variable tmpList.Destroy(); tmpList.Create(config->Find(&aUrl, "bad_querystr"), " \t"); badquerystr.setEscaped(tmpList, config->Boolean("case_sensitive")); badquerystr_checked = 1; } The difference in performance is night and day, and the excludes list is only parsed once per dig rather than at *every* URL found. If this is at all useful to anyone, let me know. I can send files or if someone would enlighten me (even RTFM me) I can send diff/patches. Cheers, Chris -- Christopher Murtagh Enterprise Systems Administrator ISR / Web Communications Group McGill University Montreal, Quebec Canada Tel.: (514) 398-3122 Fax: (514) 398-2017 |
From: Gilles D. <gr...@sc...> - 2004-04-20 21:45:12
|
Hi, Chris and other developers. The problem with this fix is that exclude_urls and bad_querystr can no longer be used in server blocks or URL blocks, as they'll only be parsed once regardless of how they're used. That's OK if you don't use them in blocks, but for the distributed code, we need to find a more generalized solution. Also, there may be other regex-based attributes that also need optimizing. However, your fix and the discussion that led up to it did give me an inspiration for a fairly simple fix to the Regex class to optimize all uses of it in a general way. Instead of just having a flag in the Regex object that says if the pattern has been compiled, why not have it hold on to a copy of the compiled string? That way, whenever the set() method is called again, it can check to see if it's being asked to compile the same pattern over again. If it is, it knows it doesn't need to call regcomp() again. Presumably, it's regcomp() that's the real time killer here, and not all the attribute and string handling done at a higher level. The repeated string comparisons should be cheap in comparison to all the regcomp() calls on those strings. I'll see if I can find some time to work out a patch for this. Complications: 1) The variables that use the HtRegex and HtRegexList classes will need to be global, or otherwise made persistent, so that they can take advantage of the optimization. Is this going to be a problem with libhtdig, Neal? What is the best way to approach this issue? 2) The savings in not calling regcomp(), which likely will be substantial, may not be entirely enough to get the performance gain that the quick fix below gets. If so, we could look into higher-level fixes as well, e.g. in HtRegexList::setEscaped() as well, to save us some of the string handling that takes place there. Chris, would you be willing to do some comparative performance testing on various patches, if it comes to that? 3) We may also need to determine if the repeated calls to config->Find() at each URL are having an impact on performance as well. E.g. what is the performance cost of doing thousands of calls like this one? tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); We might need to do some more profiling as well. Thoughts? According to Christopher Murtagh: > Ok, I've solved my problem, and can now have a list of working > exclude_urls without the serious performance decrease. Here are the > changes I made (sorry I'm not sending a proper diff file... need > guidance on how to do that properly): > > > htdig/htdig.h > -------------------- > > added: > > extern int exclude_checked; > extern int badquerystr_checked; > extern HtRegexList excludes; > extern HtRegexList badquerystr; > > > > htdig/htdig.cc > ---------------------- > > added these as global variable definitions: > > int exclude_checked = 0; > int badquerystr_checked = 0; > > HtRegexList excludes; > HtRegexList badquerystr; > > > htdig/Retriever.cc > > added these conditionals and removed the previous tmplist creates and > .setEscaped() calls: > > if(!(exclude_checked)){ > //only parse this once and store into global variable > tmpList.Destroy(); > tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); > excludes.setEscaped(tmpList, config->Boolean("case_sensitive")); > exclude_checked = 1; > } > > if(!(badquerystr_checked)){ > //only parse this once and store into global variable > tmpList.Destroy(); > tmpList.Create(config->Find(&aUrl, "bad_querystr"), " \t"); > badquerystr.setEscaped(tmpList, config->Boolean("case_sensitive")); > badquerystr_checked = 1; > } > > The difference in performance is night and day, and the excludes list > is only parsed once per dig rather than at *every* URL found. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Lachlan A. <lh...@us...> - 2004-04-21 13:14:11
Attachments:
slightly-better.0
|
Greetings Gilles + all, Yes, I agree that we need a more "polished" patch for the distribution. I still like my intermediate path: If *any* server blocks or URL blocks are used, then the user takes the performance hit and re-parses each time. If *no* server/URL blocks are used, we use Chris's patch. This should be just as fast as Chris's patch (in the "3.1-compatibly mode" without server/URL blocks), and just as flexible as the current status (if blocks are used). If that can get ht://Dig fast enough to get into sarge, then I suggest we implement it first, and then work on Gilles's more complete solution at more leisure. A first hack at this (not even compile-tested) is attached, patched relative to Chris's patched version, so you can see what I mean. If people are in favour, I'll try to work on it over the weekend. One issue with caching input strings is that we would have to have some sort of cache-flushing, or just let the storage grow as HtRegEx is called repeatedly. Cheers, Lachlan On Wed, 21 Apr 2004 07:45 am, Gilles Detillieux wrote: > Hi, Chris and other developers. The problem with this fix is that > exclude_urls and bad_querystr can no longer be used in server > blocks or URL blocks, as they'll only be parsed once regardless of > how they're used. -- lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
From: Christopher M. <chr...@mc...> - 2004-04-21 15:07:00
|
On Wed, 2004-04-21 at 09:13, Lachlan Andrew wrote: > Yes, I agree that we need a more "polished" patch for the > distribution. I still like my intermediate path: If *any* server > blocks or URL blocks are used, then the user takes the performance > hit and re-parses each time. That sounds like a decent plan to me. However, 'performance hit' is a serious understatement with the current code. Without my patch, my Dual 3GHz Xeon had one CPU pegged at 100% for 8 hours and still only managed to index a little under 1000 pages per hour. Once I stopped re-parsing the URL list at for each URL, the CPU usage was down to about 6-10%, essentially making it an I/O bound problem rather than CPU. I suspect, as Gilles suggests that there are probably other optimizations in the Regex code that could help out a lot in this matter. > If *no* server/URL blocks are used, we use Chris's patch. This should > be just as fast as Chris's patch (in the "3.1-compatibly mode" without > server/URL blocks), and just as flexible as the current status > (if blocks are used). I can't believe it took me until today to find the block configuration in the documentation. Once I found it, it seems to be in an obvious place, but perhaps it needs a mention as a feature in the front page of the docs? Cheers, Chris -- Christopher Murtagh Enterprise Systems Administrator ISR / Web Communications Group McGill University Montreal, Quebec Canada Tel.: (514) 398-3122 Fax: (514) 398-2017 |
From: Gilles D. <gr...@sc...> - 2004-04-22 03:02:20
|
According to Christopher Murtagh: > On Wed, 2004-04-21 at 09:13, Lachlan Andrew wrote: > > Yes, I agree that we need a more "polished" patch for the > > distribution. I still like my intermediate path: If *any* server > > blocks or URL blocks are used, then the user takes the performance > > hit and re-parses each time. > > That sounds like a decent plan to me. However, 'performance hit' is a > serious understatement with the current code. Without my patch, my Dual > 3GHz Xeon had one CPU pegged at 100% for 8 hours and still only managed > to index a little under 1000 pages per hour. Once I stopped re-parsing > the URL list at for each URL, the CPU usage was down to about 6-10%, > essentially making it an I/O bound problem rather than CPU. > > I suspect, as Gilles suggests that there are probably other > optimizations in the Regex code that could help out a lot in this > matter. Yes, what I initially had in mind was pretty simple. The Regex object already stores the compiled pattern for the last value used. I'd just add to that object the string that was compiled to get that binary pattern, so that we could avoid repeatedly compiling the same pattern. That, in itself, would be a trivial addition. It wouldn't be a full caching scheme with multiple patterns stored, so we don't need to worry about cache flushing. However, it occurs to me that this fix wouldn't be a huge help to users who do use exclude_urls or bad_querystr within URL or server blocks. The reason for this is that htdig will alternate between servers, to distribute the load, so when it's used a pattern within an URL or server block it will move quickly on to another server, causing htdig to have to go to another pattern -- forcing a recompile of the pattern. Still, it would likely be an improvement over what it does now, for the vast majority of users. > > If *no* server/URL blocks are used, we use Chris's patch. This should > > be just as fast as Chris's patch (in the "3.1-compatibly mode" without > > server/URL blocks), and just as flexible as the current status > > (if blocks are used). > > I can't believe it took me until today to find the block configuration > in the documentation. Once I found it, it seems to be in an obvious > place, but perhaps it needs a mention as a feature in the front page of > the docs? Yeah, it's not very prominent in the docs right now, and it ought to be made more obvious. Most users of 3.2 don't even know about this feature. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Lachlan A. <lh...@us...> - 2004-04-22 11:55:24
|
Yes, the heavy I/O is why we should also look at optimising the database format / access. Cheers, Lachlan On Thu, 22 Apr 2004 01:05 am, Christopher Murtagh wrote: > Once I stopped re-parsing the URL list at for each URL, the CPU > usage was down to about 6-10%, essentially making it an I/O bound > problem rather than CPU. -- lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
From: Joe R. J. <jj...@cl...> - 2004-04-21 23:16:42
|
On Wed, 21 Apr 2004, Lachlan Andrew wrote: > Date: Wed, 21 Apr 2004 23:13:27 +1000 > From: Lachlan Andrew <lh...@us...> > To: Gilles Detillieux <gr...@sc...>, Christopher Murtagh <chr...@mc...> > Cc: htd...@li... > Subject: [htdig-dev] Re: Performance issue with exclude_urls > > Greetings Gilles + all, > > Yes, I agree that we need a more "polished" patch for the > distribution. I still like my intermediate path: If *any* server > blocks or URL blocks are used, then the user takes the performance > hit and re-parses each time. If *no* server/URL blocks are used, we > use Chris's patch. This should be just as fast as Chris's patch (in > the "3.1-compatibly mode" without server/URL blocks), and just as > flexible as the current status (if blocks are used). If that can get > ht://Dig fast enough to get into sarge, then I suggest we implement > it first, and then work on Gilles's more complete solution at more > leisure. I applied Chris' patch and ran htdig on the same site as before for profile; htdig ran ~40% faster than last time;) Here is the profile: ftp://ftp.ccsf.org/htdig-patches/3.2.0b5/htdig.gmon.exclude_perform.gz > A first hack at this (not even compile-tested) is attached, patched > relative to Chris's patched version, so you can see what I mean. If > people are in favour, I'll try to work on it over the weekend. The "slightly-better.0" patch applies, but it does not compile: Retriever.cc: In method `int Retriever::IsValidURL(const String &)': Retriever.cc:998: `config_server_URL_blocks' undeclared (first use this function) Retriever.cc:998: (Each undeclared identifier is reported only once Retriever.cc:998: for each function it appears in.) gmake[1]: *** [Retriever.o] Error 1 Regards, Joe -- _/ _/_/_/ _/ ____________ __o _/ _/ _/ _/ ______________ _-\<,_ _/ _/ _/_/_/ _/ _/ ......(_)/ (_) _/_/ oe _/ _/. _/_/ ah jj...@cl... > One issue with caching input strings is that we would have to have > some sort of cache-flushing, or just let the storage grow as HtRegEx > is called repeatedly. > > Cheers, > Lachlan > > On Wed, 21 Apr 2004 07:45 am, Gilles Detillieux wrote: > > Hi, Chris and other developers. The problem with this fix is that > > exclude_urls and bad_querystr can no longer be used in server > > blocks or URL blocks, as they'll only be parsed once regardless of > > how they're used. |
From: Gilles D. <gr...@sc...> - 2004-04-22 03:33:47
|
According to Joe R. Jah: > On Wed, 21 Apr 2004, Lachlan Andrew wrote: > > Date: Wed, 21 Apr 2004 23:13:27 +1000 > > From: Lachlan Andrew <lh...@us...> > > To: Gilles Detillieux <gr...@sc...>, > Christopher Murtagh <chr...@mc...> > > Cc: htd...@li... > > Subject: [htdig-dev] Re: Performance issue with exclude_urls > > > > Greetings Gilles + all, > > > > Yes, I agree that we need a more "polished" patch for the > > distribution. I still like my intermediate path: If *any* server > > blocks or URL blocks are used, then the user takes the performance > > hit and re-parses each time. If *no* server/URL blocks are used, we > > use Chris's patch. This should be just as fast as Chris's patch (in > > the "3.1-compatibly mode" without server/URL blocks), and just as > > flexible as the current status (if blocks are used). If that can get > > ht://Dig fast enough to get into sarge, then I suggest we implement > > it first, and then work on Gilles's more complete solution at more > > leisure. > > I applied Chris' patch and ran htdig on the same site as before for > profile; htdig ran ~40% faster than last time;) Here is the profile: > > ftp://ftp.ccsf.org/htdig-patches/3.2.0b5/htdig.gmon.exclude_perform.gz > > > A first hack at this (not even compile-tested) is attached, patched > > relative to Chris's patched version, so you can see what I mean. If > > people are in favour, I'll try to work on it over the weekend. > > The "slightly-better.0" patch applies, but it does not compile: > > Retriever.cc: In method `int Retriever::IsValidURL(const String &)': > Retriever.cc:998: `config_server_URL_blocks' undeclared (first use this function) > Retriever.cc:998: (Each undeclared identifier is reported only once > Retriever.cc:998: for each function it appears in.) > gmake[1]: *** [Retriever.o] Error 1 The patch declares config_server_URL_blocks in conf_parser.h, but not in any header file that Retriever.cc includes. Try either adding an include of conf_parser.h to Retriever.cc, or copy the declaration of config_server_URL_blocks from conf_parser.h to Retriever.cc. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Christopher M. <chr...@mc...> - 2004-04-21 15:14:39
|
On Tue, 2004-04-20 at 17:45, Gilles Detillieux wrote: > Hi, Chris and other developers. The problem with this fix is that > exclude_urls and bad_querystr can no longer be used in server blocks or > URL blocks, as they'll only be parsed once regardless of how they're used. > That's OK if you don't use them in blocks, but for the distributed code, > we need to find a more generalized solution. Right. Having just found the block documentation, I can indeed see this as a useful feature, and probably something that I would use if the performance hit wasn't so bad. One thing I could think of that could help performance quite considerably is to have an array of type *HtRegexList that could contain the parsed excludes list/badquery lists, etc. per block. Or perhaps a struct that contains all parsed config attributes per block and have an array of pointers to it. This way the config could be loaded and still only need to be parsed once. The only downside I could see is that this would mean htdig would have a slightly larger memory footprint, but I don't really see that as a big problem. We're probably talking about a couple k more, by today's standards, even a couple meg more wouldn't be a big deal. > 3) We may also need to determine if the repeated calls to config->Find() > at each URL are having an impact on performance as well. E.g. what is > the performance cost of doing thousands of calls like this one? > > tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); Easy thing to test. I'll give it a try later this week if I can, perhaps tomorrow, and report back. Cheers, Chris -- Christopher Murtagh Enterprise Systems Administrator ISR / Web Communications Group McGill University Montreal, Quebec Canada Tel.: (514) 398-3122 Fax: (514) 398-2017 |
From: Gilles D. <gr...@sc...> - 2004-04-22 03:25:22
|
According to Christopher Murtagh: > On Tue, 2004-04-20 at 17:45, Gilles Detillieux wrote: > > Hi, Chris and other developers. The problem with this fix is that > > exclude_urls and bad_querystr can no longer be used in server blocks or > > URL blocks, as they'll only be parsed once regardless of how they're used. > > That's OK if you don't use them in blocks, but for the distributed code, > > we need to find a more generalized solution. > > Right. Having just found the block documentation, I can indeed see this > as a useful feature, and probably something that I would use if the > performance hit wasn't so bad. > > One thing I could think of that could help performance quite > considerably is to have an array of type *HtRegexList that could contain > the parsed excludes list/badquery lists, etc. per block. Or perhaps a > struct that contains all parsed config attributes per block and have an > array of pointers to it. This way the config could be loaded and still > only need to be parsed once. The only downside I could see is that this > would mean htdig would have a slightly larger memory footprint, but I > don't really see that as a big problem. We're probably talking about a > couple k more, by today's standards, even a couple meg more wouldn't be > a big deal. There's an idea worth considering. It's quite a bit more complicated than the quick fix I had in mind, but probably much simpler than a full-blown caching scheme. It would also help out the case where regex-based attributes are used in URL or server blocks, which my proposed fix would only marginally help. > > 3) We may also need to determine if the repeated calls to config->Find() > > at each URL are having an impact on performance as well. E.g. what is > > the performance cost of doing thousands of calls like this one? > > > > tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); > > Easy thing to test. I'll give it a try later this week if I can, > perhaps tomorrow, and report back. Great. I'll try to get my fix to Regex.cc in by the end of the week too, so it would be great if you could give it a whirl. It would probably mean having to back out your own patch, though, or it wouldn't really get tested. Neal, I'd still like your opinion on the matter of making these HtRegexList variables global, and whether that will be a problem for libhtdig. Looking at the code, I see that "limits" and "limitsn", set by limit_urls_to and limit_normalized, are already global. But these are defined in htdig.cc, rather than Retriever.cc. Does this matter? I imagine it just means making parallel changes to libhtdig_htdig.cc, but right now it doesn't even seem to be making use of URL blocks, as it doesn't pass aUrl to HtConfiguration::Find(). Is this an oversight, or am I missing something? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Gilles D. <gr...@sc...> - 2004-04-22 05:05:39
|
Earlier I wrote: > According to Christopher Murtagh: > > Easy thing to test. I'll give it a try later this week if I can, > > perhaps tomorrow, and report back. > > Great. I'll try to get my fix to Regex.cc in by the end of the week too, > so it would be great if you could give it a whirl. It would probably > mean having to back out your own patch, though, or it wouldn't really > get tested. OK, the simple fix I had in mind was indeed very simple to implement, but unfortunately ineffective. The problem is the way HtRegexList uses (or abuses) the HtRegex class. It repeatedly calls HtRegex::set() with an increasingly complex pattern, until it fails, to see how big a pattern it can build before breaking it up. Of course, this completely defeats any attempt to fix the problem at the level of HtRegex. Here's the result of a trace print I added to HtRegex::set() which shows how htdig deals with these two simple patterns: exclude_urls: /cgi-bin/ .cgi bad_querystr: C=D C=M C=N C=S O=A O=D For every URL htdig encountered within every document it parsed, it spat out the following: compiling pattern: /cgi-bin/ compiling pattern: /cgi-bin/|\.cgi compiling pattern: C=D compiling pattern: C=D|C=M compiling pattern: C=D|C=M|C=N compiling pattern: C=D|C=M|C=N|C=S compiling pattern: C=D|C=M|C=N|C=S|O=A compiling pattern: C=D|C=M|C=N|C=S|O=A|O=D You can see how this would rapidly degenerate with lots of documents containing lots of links, and with more complex patterns in those two attributes! No wonder this was such a big problem. I'll try again tomorrow with a similar fix in HtRegexList::setEscaped() to see how much that helps matters. It's a bit more complicated there because we're dealing with lists instead of strings, but it shouldn't be too nasty. On the other hand, if these two attributes are the only ones that pose a problem, maybe we should just deal with them in Retriever.cc and be done with it! Rather than using a flag, as Chris and Lachlan's patches do, I was thinking of saving the string returned by config->Find(&aUrl, "exclude_urls") and comparing it to the string we had the last time through. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Lachlan A. <lh...@us...> - 2004-04-22 11:52:36
|
Greetings Gilles, Your patch sounds nice -- totally contained within Retriever.cc and yet still working (and providing some benefit) with server blocks. If the overhead of a Find is significant, we could still disable even that test when no server blocks are used. We could also eliminate the generation of the dictionaries of valid and invalid extensions... I believe in the "software RISC" principle -- the fastest, least buggy and most memory efficient function calls are those that aren't there. Anyway, Gilles's elegant solution will certainly get rid of almost all of the problem. Cheers, Lachlan On Thu, 22 Apr 2004 02:55 pm, Gilles Detillieux wrote: > On the other hand, if these two attributes are the only ones that > pose a problem, maybe we should just deal with them in Retriever.cc > and be done with it! Rather than using a flag, as Chris and > Lachlan's patches do, I was thinking of saving the string returned > by > > config->Find(&aUrl, "exclude_urls") > > and comparing it to the string we had the last time through. -- lh...@us... ht://Dig developer DownUnder (http://www.htdig.org) |
From: Christopher M. <chr...@mc...> - 2004-04-22 20:25:51
|
On Wed, 2004-04-21 at 23:21, Gilles Detillieux wrote: > > > 3) We may also need to determine if the repeated calls to config->Find() > > > at each URL are having an impact on performance as well. E.g. what is > > > the performance cost of doing thousands of calls like this one? > > > > > > tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); > > > > Easy thing to test. I'll give it a try later this week if I can, > > perhaps tomorrow, and report back. > > Great. I'll try to get my fix to Regex.cc in by the end of the week too, > so it would be great if you could give it a whirl. It would probably > mean having to back out your own patch, though, or it wouldn't really > get tested. Here's what I did. I simply added a new variable called 'fooList' of type StringList. Then I inserted extra Create() method calls: //these lines are simply to test the CPU load of the Create() method fooList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); fooList.Destroy(); //these lines are simply to test the CPU load of the Create() method fooList.Destroy(); fooList.Create(config->Find(&aUrl, "bad_querystr"), " \t"); fooList.Destroy(); Then I re-compiled and ran with my normal excludes URL list. It didn't seem to have much of an impact on performance. This means that the performance hit is definitely in the HtRegexList::setEscaped method. Cheers, Chris -- Christopher Murtagh Enterprise Systems Administrator ISR / Web Communications Group McGill University Montreal, Quebec Canada Tel.: (514) 398-3122 Fax: (514) 398-2017 |
From: Gilles D. <gr...@sc...> - 2004-04-22 21:12:16
|
According to Christopher Murtagh: > On Wed, 2004-04-21 at 23:21, Gilles Detillieux wrote: > > > > 3) We may also need to determine if the repeated calls to config->Find() > > > > at each URL are having an impact on performance as well. E.g. what is > > > > the performance cost of doing thousands of calls like this one? > > > > > > > > tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); > > > > > > Easy thing to test. I'll give it a try later this week if I can, > > > perhaps tomorrow, and report back. > > > > Great. I'll try to get my fix to Regex.cc in by the end of the week too, > > so it would be great if you could give it a whirl. It would probably > > mean having to back out your own patch, though, or it wouldn't really > > get tested. > > Here's what I did. I simply added a new variable called 'fooList' of > type StringList. Then I inserted extra Create() method calls: > > > //these lines are simply to test the CPU load of the Create() method > fooList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); > fooList.Destroy(); > > //these lines are simply to test the CPU load of the Create() method > fooList.Destroy(); > fooList.Create(config->Find(&aUrl, "bad_querystr"), " \t"); > fooList.Destroy(); > > Then I re-compiled and ran with my normal excludes URL list. It didn't > seem to have much of an impact on performance. This means that the > performance hit is definitely in the HtRegexList::setEscaped method. Thanks, Chris. That's good to know! Maybe then we don't need to use Lachlan's patch to the config parser to track whether server or URL blocks were defined. I'll try a quick fix to Retriever.cc, but I'll also try to find if there are other uses of HtRegexList that may need attention. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Gilles D. <gr...@sc...> - 2004-04-22 22:17:06
|
According to me: > According to Christopher Murtagh: > > Then I re-compiled and ran with my normal excludes URL list. It didn't > > seem to have much of an impact on performance. This means that the > > performance hit is definitely in the HtRegexList::setEscaped method. > > Thanks, Chris. That's good to know! Maybe then we don't need to use > Lachlan's patch to the config parser to track whether server or URL > blocks were defined. I'll try a quick fix to Retriever.cc, but I'll > also try to find if there are other uses of HtRegexList that may need > attention. Well, there are some uses of it in htsearch related code that I'm not all that sure about, but let's just cross that bridge when we get to it. In any case, the optimizations tend to need to be done right where the HtRegexList object is created anyway, rather than buried in the class definition, because you need to make sure the object sticks around or any optimization will have no effect. So, here's my simple stab at fixing Retriever.cc, with no other files needing patches. It should be used instead of Chris's and Lachlan's patches from earlier this week, not put on top of them. I've built this patch over the current CVS code for Retriever.cc (as of Apr 7), but it does seem to apply to a vanilla 3.2.0b5 source as well. Please test this out and make sure it doesn't cause any problems, and that it helps! Apply using "patch -p0 << this-message-file". --- htdig/Retriever.cc.orig 2004-04-07 17:02:00.000000000 -0500 +++ htdig/Retriever.cc 2004-04-22 16:45:52.000000000 -0500 @@ -995,10 +995,21 @@ int Retriever::IsValidURL(const String & // If the URL contains any of the patterns in the exclude list, // mark it as invalid // - tmpList.Create(config->Find(&aUrl, "exclude_urls"), " \t"); - HtRegexList excludes; - excludes.setEscaped(tmpList, config->Boolean("case_sensitive")); - if (excludes.match(url, 0, 0) != 0) + String exclude_urls = config->Find(&aUrl, "exclude_urls"); + static String *prevexcludes = 0; + static HtRegexList *excludes = 0; + if (!excludes || !prevexcludes || prevexcludes->compare(exclude_urls) != 0) + { + if (!excludes) + excludes = new HtRegexList; + if (prevexcludes) + delete prevexcludes; + prevexcludes = new String(exclude_urls); + tmpList.Create(exclude_urls, " \t"); + excludes->setEscaped(tmpList, config->Boolean("case_sensitive")); + tmpList.Destroy(); + } + if (excludes->match(url, 0, 0) != 0) { if (debug > 2) cout << endl << " Rejected: item in exclude list "; @@ -1009,12 +1020,22 @@ int Retriever::IsValidURL(const String & // If the URL has a query string and it is in the bad query list // mark it as invalid // - tmpList.Destroy(); - tmpList.Create(config->Find(&aUrl, "bad_querystr"), " \t"); - HtRegexList badquerystr; - badquerystr.setEscaped(tmpList, config->Boolean("case_sensitive")); + String bad_querystr = config->Find(&aUrl, "bad_querystr"); + static String *prevbadquerystr = 0; + static HtRegexList *badquerystr = 0; + if (!badquerystr || !prevbadquerystr || prevbadquerystr->compare(bad_querystr) != 0) + { + if (!badquerystr) + badquerystr = new HtRegexList; + if (prevbadquerystr) + delete prevbadquerystr; + prevbadquerystr = new String(bad_querystr); + tmpList.Create(bad_querystr, " \t"); + badquerystr->setEscaped(tmpList, config->Boolean("case_sensitive")); + tmpList.Destroy(); + } char *ext = strrchr((char *) url, '?'); - if (ext && badquerystr.match(ext, 0, 0) != 0) + if (ext && badquerystr->match(ext, 0, 0) != 0) { if (debug > 2) cout << endl << " Rejected: item in bad query list "; -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |