#86 Speed up the building of the inverted index

open
nobody
None
5
2014-04-21
2014-04-20
Elad Lahav
No

The attached patch provides two changes that together significantly improve the performance of building the inverted index:
1. Built-in external-sort implementation for the postings file (instead of using the system's sort command).
2. Replace a linear search with a binary one for a file index in invmake().

On my machine, these changes reduce the time of building a new database for the Linux kernel (~35K files) from over 52 seconds to 33 (with a warm cache).

1 Attachments

Related

Patches: #86

Discussion

  • dima
    dima
    2014-04-21

    I ran some cscope timing tests earlier, so I ran this patch through the evaluation script. Results (all timings in seconds):

    **** Cold disk cache
    
    |                                              |   Stock | Patched |
    |----------------------------------------------+---------+---------|
    | Initial database build                       | 123.572 | 95.5225 |
    | Database re-build after touching a file      | 57.2912 |   30.91 |
    | Initial search                               | 9.11125 |   8.415 |
    | Re-search after touching a file              | 59.6287 |   31.92 |
    | Initial no-db-update search                  | 0.80625 |  1.2075 |
    | No-db-update re-search after touching a file |   0.805 |    0.95 |
    
    **** Warm disk cache
    
    |                                              |   Stock | Patched |
    |----------------------------------------------+---------+---------|
    | Initial database build                       | 55.3537 | 29.5287 |
    | Database re-build after touching a file      | 45.4975 |  18.805 |
    | Initial search                               | 0.12125 |    0.12 |
    | Re-search after touching a file              |  45.985 | 19.0437 |
    | Initial no-db-update search                  |       0 |       0 |
    | No-db-update re-search after touching a file |       0 | 0.00125 |
    

    A bit more detail and the script used to run the tests is here: http://notes.secretsauce.net/notes/2014/04/20_more-cscope-benchmarks.html

    Clearly this produces good performance gains, and hopefully it gets merged (I'm not the maintainer).

     
  • Elad Lahav
    Elad Lahav
    2014-04-21

    Thanks, Dima. I find the slight difference in query times surprising, as the change has no effect on the resulting database, nor does it impact the query path in the code. Can it just be test noise?

    I sent an email to cscope-devel with more information about this patch, but for some reason it seems to not have made it to the mailing list. The list may be moderated, but I do recall using it successfully in the past.

    Here is the information from that email:

    I was made aware that the sorting code I wrote for min-cscope a few years ago (primarily targeted at Windows) was not performing very well on Linux and FreeBSD. I re-wrote the code which now runs faster than GNU sort (at least for the postings list produced by Cscope and compared with the sort version on Ubuntu 12.04). The code is also much more compact and fits nicely within a single source file. Consequently, I would like to propose its inclusion in the main Cscope code. The attached patch provides the new sort.c and sort.h files, the necessary changes to build.c, crossref.c, invlib.c and Makefile.am. It also replaces a linear search within invmake() with a binary search, which gives an additional performance boost.

    Performance results:
    Ran a clean Cscope build on the Linux 3.13.7 tree (~35K files) with a warm cache (i.e., ran a clean build first and discarded the results).

    Vanilla Cscope:
    elahav@elahav-mbp:~/src/linux-3.13.7$ time /usr/bin/cscope -b -k -q

    real 0m52.349s
    user 1m13.193s
    sys 0m2.680s

    Modified Cscope:
    elahav@elahav-mbp:~/src/linux-3.13.7$ time ~/tmp/cscope/src/cscope -b -k -q

    real 0m33.800s
    user 0m30.874s
    sys 0m1.224s

    Verified that the generated DB is sane by running a query for all references to '.*kernel.*' (there are 7647 of those) and made sure the vanilla and modified versions produced the same results.

    Note: after applying the patch, you need to run autoreconf to update the various build configuration files.

    --Elad

    P.S.,
    There are annoying inconsistencies in indentation throughout the code, with a mixture of tabs and spaces, even within single files. Should probably be cleaned-up at some point.

     
    • dima
      dima
      2014-04-21

      I think it's real. Unless you pass -d, cscope updates the database on
      each query, so presumably that's what we're seeing.

       
      • Unless you pass -d, cscope updates the database on each query,

        No. It only updates once per start of the program.

         
        • dima
          dima
          2014-04-22

          You're right about that detail. The benchmarks in question start a new cscope process with each query, so in this case a database update does happen with each query, and we should see that affect the performance.

           
  • Elad Lahav
    Elad Lahav
    2014-04-21

    I neglected to notice that Cscope is using a BSD license. Attaching a new patch that uses a two-clause BSD license for the new files (instead of GPLv3).

     
  • Neil Horman
    Neil Horman
    2014-04-21

    A few early comments:

    • ************/
      +
      +#define _GNU_SOURCE
      +#include <stdio.h>

    Don't define _GNU_SOURCE like that, please, modify the config script so that the compilation command includes it properly for any files that need it.

    +#ifndef PAGE_SIZE_ORDER
    +#define PAGE_SIZE_ORDER 22 /* log2 of page size. /
    +#endif

    Um, most systems have a 4k page size, which would be 12 here, not 22. Unless you mean something other than pages here, in which case please rename this to something that doesn't as easily get convoluted with the notion of memory pages.

    +#define DIRECTORY_LEN_BITS STRING_LEN_ORDER

    Why call your other macros ORDER, but this one BITS?

    +#define PAGE_SIZE (1 << PAGE_SIZE_ORDER)
    You can get this dynamically from the getpagesize() function, unless again you're not actually referring to memory pages, in which case, please rename (see sys/user.h)

    ostensibly these are configurable by system developers, and are set in response to the amount of RAM available on a given system, and in relation to just how fast you want the sort to be? If so I would suggest making these config knobs represent maximum allowable sane values, and then just code the patch so that cscope makes a reasonable decision for the memory sizing at run time based on the system its running on. Cscope runs on a huge variety of large and small systems, so building for just one doesn't make too much sense in my mind.

     
    • Elad Lahav
      Elad Lahav
      2014-04-21

      Don't define _GNU_SOURCE like that, please, modify the config script so that the compilation command includes it properly for any files that need it.
      This was for the qsort_r definition. I can drop it, as the code does not pretend to be re-entrant anyway, so might as well just use qsort().

      Um, most systems have a 4k page size, which would be 12 here, not 22.
      A "page" here is not necessarily a virtual memory-manager's page. It is a collection of strings with a header and directory. It made the most sense to use a VMM page for this purpose, so I started with 4K, but 4M yielded better results. If you find the term "page" inappropriate then I can probably come up with someting else, though I think it is legitimate to use it here.

      Why call your other macros ORDER, but this one BITS?
      ORDER refers to a log base-2 value of a size. BITS is the number of bits within a single word used for a sub-field, in this case the length value. There is no requirement that this value be directly related to the ORDER value, it just happens to be the case here.

      ostensibly these are configurable by system developers, and are set in response to the amount of RAM available on a given system
      Not really, unless you are targeting extremely limited systems, that are unable to support a virtual, non-backed, allocation of 256MB. The only potential downside of a 4M page is thrashing, but, again, you would need to have a really limited system to encounter a problem. I can't think of any existing system, even in the embedded space, with such a limit.

      --Elad

       
      Last edit: Elad Lahav 2014-04-21
      • Neil Horman
        Neil Horman
        2014-04-22

        On Mon, Apr 21, 2014 at 06:27:02PM +0000, Elad Lahav wrote:

        Don't define _GNU_SOURCE like that, please, modify the config script so that the compilation command includes it properly for any files that need it.
        This was for the qsort_r definition. I can drop it, as the code does not pretend to be re-entrant anyway, so might as well just use qsort().

        That sounds good to me too.

        Um, most systems have a 4k page size, which would be 12 here, not 22.
        A "page" here is not necessarily a virtual memory-manager's page. It is a collection of strings with a header and directory. It made the most sense to use a VMM page for this purpose, so I started with 4K, but 4M yielded better results. If you find the term "page" inappropriate then I can probably come up with someting else, though I think it is legitimate to use it here.

        I would still change it, pages, particularly macros that use PAGE_* are common
        in sys/user.h, and so can be quite confusing. I would suggest the BLOCK or SEG
        prefix.

        Why call your other macros ORDER, but this one BITS?
        ORDER refers to a log base-2 value of a size. BITS is the number of bits within a single word used for a sub-field, in this case the length value. There is no requirement that this value be directly related to the ORDER value, it just happens to be the case here.

        Ah, ok

        ostensibly these are configurable by system developers, and are set in response to the amount of RAM available on a given system
        Not really, unless you are targeting extremely limited systems, that are unable to support a virtual, non-backed, allocation of 256MB. The only potential downside of a 4M page is thrashing, but, again, you would need to have a really limited system to encounter a problem. I can't think of any existing system, even in the embedded space, with such a limit.

        Well, ok, but that then begs the question, is there value to making it
        configurable? Are there performance gains to be had on systems with larger
        amounts of memory? If so, again, making this sizing dynamic in its compuation
        seems adventageous.
        Neil

        --Elad


        [patches:#86] Speed up the building of the inverted index

        Status: open
        Group:
        Created: Sun Apr 20, 2014 11:03 PM UTC by Elad Lahav
        Last Updated: Mon Apr 21, 2014 05:51 PM UTC
        Owner: nobody

        The attached patch provides two changes that together significantly improve the performance of building the inverted index:
        1. Built-in external-sort implementation for the postings file (instead of using the system's sort command).
        2. Replace a linear search with a binary one for a file index in invmake().

        On my machine, these changes reduce the time of building a new database for the Linux kernel (~35K files) from over 52 seconds to 33 (with a warm cache).


        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/cscope/patches/86/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

         

        Related

        Patches: #86

        • Elad Lahav
          Elad Lahav
          2014-04-22

          I gave some thought to the "page" nomenclature, and think that it is appropriate, given that it is a well-established terminology for extrenal sorting (pages are internally sorted and then merged into runs). That said, I agree that using symbols such as PAGE_SIZE is unfortunate. I therefore changed all 'page*' symbols to 'sort_page*', to avoid such ambiguity.

          Also, I tested the use of a run-time defined page size, and see no significant performance impact. The only question left is whether there is a reasonable, platform-agnostic way to come up with a good number.

          The attached patch incorporates these changes, as well as drops the use of qsort_r and _GNU_SOURCE.