#34 Add ignorefiles and extend ignoredirs

closed
Andre-Littoz
General (16)
5
2012-11-15
2012-10-20
Lukasz M
No

It would be nice to add possibility to add ignorefiles option for files just like ignoredirs for directories. I have added this for my lxr and it's just 3 lines of code.
Would it be also possible for ignoredirs option to handle regexp? I would like to exclude /include dir from indexing (as header files are also within libs).

Discussion

  • Andre-Littoz
    Andre-Littoz
    2012-10-20

    Well, this is new feature which could be included in the next release.

    1/ regexp for 'ignoredirs'
    I had a quick look at the code sections related to 'ignoredirs'. The change seems to involve the 'getdir' sub in the vatious Files/ handlers.

    2/ files exclusion
    I wonder if it is not already there. Look at the end of 'source' script (lines 401-406 in release 1.0). There is an undocumented call to lxr.conf's parameter 'filter'. It looks like it should be a regexp SELECTING (not excluding) which file or directory is displayed. This has been lurking in 'source' for ages and I really never succeeded in setting it up correctly. The main difficulty for the regexp is to be valid both for directories (otherwise they can't be listed) and for wanted files. All failures end up with 'fil does not exist' (which is what I always got!). You might experiment with it.

    Note that this does not exclude files from indexing, meaning you have no speed improvement in genxref.

    I suppose your 3-line solution is something equivalent to line 242 (release 1.0) which excludes *.o, *.a, core files (and also the index files of the initial LXR implementation). Can you send your patch?

    Best regards
    ajl

     
  • Lukasz M
    Lukasz M
    2012-10-21

    Actually, I do not mind displaying the file/directory. I would rather it not be indexed.
    For ignoring files (from indexing) I just modified the following:

    LXR/Files/Plain.pm

    # Check directories to ignore
    if (-d $dir . $node) {
    foreach my $ignoredir (@{$config->{'ignoredirs'}}) {
    next FILE if $node eq $ignoredir;
    }
    # Directory to keep: suffix name with a slash
    push(@dirs, $node . '/');
    } else {
    --> foreach my $ignorefile ($config->ignorefiles) {
    --> next FILE if $node eq $ignorefile;
    --> }
    # File: don't change the name
    push(@files, $node);
    }

     
  • Andre-Littoz
    Andre-Littoz
    2012-10-21

    I experimented with 'filter' and finally got it right. To include only Perl files for instance, add in lxr.conf:

    , 'filter' => '(\\/$|\\.pm$)'

    The first alternative keeps directories (they have a canonical trailing slash as fixed in LXR::Common::httpinit); the second keeps only .pm files;

    I admit that this INCLUDE rule is probably less flexible as in EXCLUDE rule. Second, it does not prevent genxref from indexing. I'll add an 'ignorefiles' parameter for both genxref and source.

    Could you better explain your "exclude /include dir from indexing (as header files are also within libs)". Do you mean there is a link resulting in duplicate files: one set accessed through /include and another accessed through /libs? I'll see if the "already indexed" featured can cope with this. Otherwise, add one of the set to 'ignoredirs'.

     
  • Lukasz M
    Lukasz M
    2012-10-21

    Exactly, I have duplicate files in /include folder. Due to that any search results in duplcate results. I also cannot add this folder to ignoredirs because I have some other include dirs in some libs. So really I would like to ignore only /include folder and not /somelib/include.

     
  • Andre-Littoz
    Andre-Littoz
    2012-10-24

    Mmmh! Your "specification" is hard to twist into the present implementation. It was designed to be rather efficient: 'ignoredirs' is taken into consideration when function getdir() is invoked to enumerate the content of a directory. 'ignoredirs' subdirectories are filtered here. This is also where 'ignorefiles' could be filtered. But, only this very "local" path element is compared, not the whole absolute path.

    This is very good for large sized projects such as the Linux kernel (~37 000 files and hundreds, maybe thousands, directories). I want to keep performance on such projects.

    'ignoredirs' is also scanned in toreal() function with pattern matching. This is compatible with a longer path fragment (i.e. containing path separators). But this function exists only ib Plain.pm and CVS.pm, not in GIT.pm nor Subversion.pm. Consequently, this is not the place for implementation.

    While I think about an angle of attack, what about the following strategy since your concern is to prevent duplicates from entering into the DB:
    - before genxref step, disable (or remove) the links (ln) causing the duplicates,
    - launch genxref to create the DB without duplicates,
    - recreate the links.

    This could temporarily solve your problem. If there are too many links, you can design a small script so that you only type a short command to do the removal/creation.

     
  • Andre-Littoz
    Andre-Littoz
    2012-11-01

    Transferred from "support request" to "feature request"

     
  • Andre-Littoz
    Andre-Littoz
    2012-11-01

    • milestone: 122753 -->
    • labels: 398381 -->
     
  • Andre-Littoz
    Andre-Littoz
    2012-11-02

    I rearchitected the "storage" backend through common factoring 'ignoredirs' and file filtering processing. They are now located in a single Files.pm method which can be referenced from the specific classes.

    dirs: I can add a new parameter to filter out based on full path instead of last segment. It is preferentially a regexp to allow accurate exclusion. However, I fear performance impact on kernel indexing (more than 38'000 files which would trigger the regexp -- mostly to tell "go ahead")

    What would suggest for the name of the global directory-excluding parameter?

    files: I replaced the various hard-coded regexp in the storage backends by a call to the new method which uses regexp contained in 'ignorefiles'. I also removed the filter in source's direxpand since the regexp already excludes the previously discarded files (and it is more efficient since the removal is done when enumerating the directory).

     
  • Andre-Littoz
    Andre-Littoz
    2012-11-02

    • assigned_to: nobody --> ajlittoz
    • labels: --> General
     
  • Lukasz M
    Lukasz M
    2012-11-05

    Would it be possible to leave 'ignoredirs'? The list could be extended to handle something like r'abc', where abc would be regexp. If just name got given then it would work as it worked before.

     
  • Andre-Littoz
    Andre-Littoz
    2012-11-05

    I was thinking of a new pair of parameters.

    In your specification proposal, you want to be able to filter the full path.

    Presently, 'ignoredirs' and the new 'ignorefiles' are activated in sub getdir() when scanning the "current" directory. It is thus very fast to check only the last segment of the path. I could extend 'ignoredirs' to be a mixed list of strings and regexps (if I can find an efficient Perl way to discriminate between then) but still on the last path segment.

    The new set (or may be a single parameter, a path is a string after all) would be an indication that full path filtering is wanted. The reason why I'd like to have both sets separate is I fear the cost of repetitively regexp-testing the full path when genxref'ing the kernel (38'000 files and hundreds of directories with an average path length over 60 characters, max. around 110 characters). Presently, my best indexing time on my high-end computer (3.4GHz) is 2 hours 40 minutes on a 3.1 kernel. I had a hard time to squeeze it from 3:50 to 2:40 (this was through DB requests restructuring, but directory tree traversal seems also expensive -- I know the worst step is reference collecting because the parser is written in Perl [interpretation not execution!!] with regexp instead of a good LR finite state automaton).

    If the set does not exist, I can quickly skip the test. If it exist, I can launch a "long" test on the full path.

    In the single set solution, I don't see how I can keep the fast last-segment test and switch to the long full-path regexp test.

    On what kind of tree do you need such detailed exclusion control? (number of files/directories, any conventional pattern in names?, mixture of languages, ...) This information could give me leads in better understanding your needs.

    ajl

    PS I've uploaded a beta version of the User Manual with a description of 'ignorefiles'. You can download it through a link in http://lxr.sf.net/en/index.html. Please give me your feedback.

     
  • Andre-Littoz
    Andre-Littoz
    2012-11-15

    • status: open --> closed
     
  • Andre-Littoz
    Andre-Littoz
    2012-11-15

    Extension implemented as 3 new configuration parameters while retaining the present simple and fast 'ignoredirs'.

    a/ 'ignorefiles' is a regexp against the final path segment (aka. filename). If it matches, file is skipped.
    b/ 'filterdirs' is an array of regexp against the full path. If one matches, directory is skipped.
    c/ 'filterfiles' is an array of regexp against the full path. If one matches, file is skipped.

    These exclusion rules are tested inside getfir() function. This function is a method inside the storage engines (Files/*.pm). It provides a list of a directory content, considering separately sub-directories and files. Rule order of application is 'ignore***' first, then 'filter***' if first rule did not exclude the candidate directory/file.

    The exclusion rules are checked only in getdir(). This allows to bypass them by typing an otherwise forbidden path as an URL in the browser address bar. Of course, the locally declared variables or functions will not be highlighted since they have not been indexed by genxref. Ther's no free meal!