The attached patch provides two changes that together significantly improve the performance of building the inverted index:
1. Built-in external-sort implementation for the postings file (instead of using the system's sort command).
2. Replace a linear search with a binary one for a file index in invmake().
On my machine, these changes reduce the time of building a new database for the Linux kernel (~35K files) from over 52 seconds to 33 (with a warm cache).
I ran some cscope timing tests earlier, so I ran this patch through the evaluation script. Results (all timings in seconds):
A bit more detail and the script used to run the tests is here: http://notes.secretsauce.net/notes/2014/04/20_more-cscope-benchmarks.html
Clearly this produces good performance gains, and hopefully it gets merged (I'm not the maintainer).
Thanks, Dima. I find the slight difference in query times surprising, as the change has no effect on the resulting database, nor does it impact the query path in the code. Can it just be test noise?
I sent an email to cscope-devel with more information about this patch, but for some reason it seems to not have made it to the mailing list. The list may be moderated, but I do recall using it successfully in the past.
Here is the information from that email:
I was made aware that the sorting code I wrote for min-cscope a few years ago (primarily targeted at Windows) was not performing very well on Linux and FreeBSD. I re-wrote the code which now runs faster than GNU sort (at least for the postings list produced by Cscope and compared with the sort version on Ubuntu 12.04). The code is also much more compact and fits nicely within a single source file. Consequently, I would like to propose its inclusion in the main Cscope code. The attached patch provides the new sort.c and sort.h files, the necessary changes to build.c, crossref.c, invlib.c and Makefile.am. It also replaces a linear search within invmake() with a binary search, which gives an additional performance boost.
Performance results:
Ran a clean Cscope build on the Linux 3.13.7 tree (~35K files) with a warm cache (i.e., ran a clean build first and discarded the results).
Vanilla Cscope:
elahav@elahav-mbp:~/src/linux-3.13.7$ time /usr/bin/cscope -b -k -q
real 0m52.349s
user 1m13.193s
sys 0m2.680s
Modified Cscope:
elahav@elahav-mbp:~/src/linux-3.13.7$ time ~/tmp/cscope/src/cscope -b -k -q
real 0m33.800s
user 0m30.874s
sys 0m1.224s
Verified that the generated DB is sane by running a query for all references to '.*kernel.*' (there are 7647 of those) and made sure the vanilla and modified versions produced the same results.
Note: after applying the patch, you need to run autoreconf to update the various build configuration files.
--Elad
P.S.,
There are annoying inconsistencies in indentation throughout the code, with a mixture of tabs and spaces, even within single files. Should probably be cleaned-up at some point.
I think it's real. Unless you pass -d, cscope updates the database on
each query, so presumably that's what we're seeing.
No. It only updates once per start of the program.
You're right about that detail. The benchmarks in question start a new cscope process with each query, so in this case a database update does happen with each query, and we should see that affect the performance.
I neglected to notice that Cscope is using a BSD license. Attaching a new patch that uses a two-clause BSD license for the new files (instead of GPLv3).
A few early comments:
+
+#define _GNU_SOURCE
+#include <stdio.h>
Don't define _GNU_SOURCE like that, please, modify the config script so that the compilation command includes it properly for any files that need it.
+#ifndef PAGE_SIZE_ORDER
+#define PAGE_SIZE_ORDER 22 /* log2 of page size. /
+#endif
Um, most systems have a 4k page size, which would be 12 here, not 22. Unless you mean something other than pages here, in which case please rename this to something that doesn't as easily get convoluted with the notion of memory pages.
+#define DIRECTORY_LEN_BITS STRING_LEN_ORDER
Why call your other macros ORDER, but this one BITS?
+#define PAGE_SIZE (1 << PAGE_SIZE_ORDER)
You can get this dynamically from the getpagesize() function, unless again you're not actually referring to memory pages, in which case, please rename (see sys/user.h)
ostensibly these are configurable by system developers, and are set in response to the amount of RAM available on a given system, and in relation to just how fast you want the sort to be? If so I would suggest making these config knobs represent maximum allowable sane values, and then just code the patch so that cscope makes a reasonable decision for the memory sizing at run time based on the system its running on. Cscope runs on a huge variety of large and small systems, so building for just one doesn't make too much sense in my mind.
--Elad
Last edit: Elad Lahav 2014-04-21
On Mon, Apr 21, 2014 at 06:27:02PM +0000, Elad Lahav wrote:
Related
Patches: #86
I gave some thought to the "page" nomenclature, and think that it is appropriate, given that it is a well-established terminology for extrenal sorting (pages are internally sorted and then merged into runs). That said, I agree that using symbols such as PAGE_SIZE is unfortunate. I therefore changed all 'page*' symbols to 'sort_page*', to avoid such ambiguity.
Also, I tested the use of a run-time defined page size, and see no significant performance impact. The only question left is whether there is a reasonable, platform-agnostic way to come up with a good number.
The attached patch incorporates these changes, as well as drops the use of qsort_r and _GNU_SOURCE.