LXR Cross Referencer / Bugs / #41 Indexing of files once indexed is buggy!

Nobody/Anonymous - 2002-02-16

Logged In: NO

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shree - 2002-02-16

Logged In: YES
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of "Files.pm" ]. You run genxref, then change a file & genxref again

2. Files are in CVS, and you want to index the "head" tag. Files change regularly, and you want to keep the

cross reference in sync - probably by running genxref once an hour or so [as a cron job].

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malcolm Box - 2002-02-18

Logged In: YES
user_id=215386

Yes, you're right, this is a bug. The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem. For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Malcolm Box - 2002-02-18

milestone: 161102 --> current_cvs

priority: 5 --> 7

assigned_to: nobody --> mbox
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Shree - 2002-02-19

Logged In: YES
user_id=142912

Here's my fix for this bug:

Add a field "timestamp" to the "status" table. And remove
the "status" field.

Before finding identifiers in a file, check whether it's
modification time is greater that it was previously. If
yes, then remove all the identifier definitions due to this
file [and release] from the database. Store the new
timestamp in the database.

Before finding references in a file, remove all identifier
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since
the "definitions" are always found before the references].

In a large CVS tree, it is quite possible that a file may
change between the time it is "indexed" and "referenced".
An easy way out of this seems to be to "index" a file and
immediately "reference" it.

Related to this there is a problem in "Plain.pm" - the
current "filerev" function returns a value based on the
timestamp. Problem arises if a file changes between runs
of genxref. What happens is that different values are
returned by "filerev" even though it is the same
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
my ($self, $filename, $release) = @_;

# TODO: length of filename+revision
# might turn out to be > 255 chars
# [length used in the db]
return join("-", $filename,
$release);
}

With this modification filerev() will return the same value
for (file,revision) pair everytime - thus solving the
problem.

I have a patch ready for this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gregor Hartmann - 2002-06-07

Logged In: YES
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay
in the database forever as well.

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which
have changed or were removed in the source tree.
then proceed indexing as before.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Richard Kisley - 2003-04-08

Logged In: YES
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths? Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree. So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed. Now the original links in tree
one are all dead links, with live dupes. No files changed.

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax? That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Heikki Toivonen - 2004-03-12

Logged In: YES
user_id=972898

Anyone have a patch for this?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2004-07-20

Logged In: NO

Still no fix for this?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2004-07-20

Logged In: NO

Shree, you've said you had a patch, could you attach to the
Tracker?
Thanks,
Dennis

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2004-07-20

Logged In: YES
user_id=341298

I have recently added the --reindexall option to genxref (in
CVS). Please try and see if this works. Perhaps it should
be default and not an option if it does.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2004-07-20

Logged In: NO

Sorry, I can't access CVS (firewall).
Does your switch do "intelligent" job like described in message
dated "2002-02-18 23:21" below?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brondsema - 2004-07-20

Logged In: YES
user_id=341298

Yes. For the version being indexed, it deletes all data
directly related to that version.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian J. Johnson - 2004-10-21

Logged In: YES
user_id=85501

Here's a patch to provide incremental reindexing. The
--reindexall
option was useful, but not really what's required (at least,
not what
I really wanted): I want to avoid reindexing files which
haven't
changed, remove all index info. for files which have been
deleted, and
correctly reindex (i.e. without duplication) files which
which have
changed. This patch does so. ItWorksForMe, please let me
know if it
breaks for you. (And if someone wants to add support for
databases
besides mysql, please do!)

I _think_ I got the database manipulation right; please let
me know if
I've made any bad assumptions about how the tables relate,
or about
the semantics of the various fields. I've tested this with
Files::Plain and Index::Mysql.

The patch is against lxr 0.9.3.

Thanks,
Brian

----------------------------------------------------------------------
Patch to add incremental indexing to lxr-0.9.3
By Brian J. Johnson 10/21/04

This patch adds incremental indexing to lxr-0.9.3. That is,
files
which have not changed are not reindexed, and files which
have changed
or been removed have their old information erased from the
database
before they are reindexed. On my machine, this saves _hours_ of
reindexing time on trees which don't change much from day to
day.

The patch modifies genxref to add an extra pass before the
existing
gensearch, genindex, and genrefs passes. For each file in
the current
release, it retrieves the file's revision from the database
and checks
if the file has changed in the file store: i.e.
$files->filerev() is
different from the revision in the database. If the file
has changed,
it calls $index->purgefile() to remove that fileid (i.e.
[filename,
revision] pair) from the release. If that fileid is no
longer active
in any release, $index->purgefile() removes it from the
other tables
as well. Then the genindex and genrefs passes add the new
revision of
the file (with a new fileid) to the database.

Symbols can still be left around (as with the "reindexall"
option), so
the administrator should drop and regenerate the database
completely
from scratch every so often.

I haven't tried to add support for any database besides
mysql, as I
have no way to test the changes. It should be pretty easy
for others
to add it, though. (I don't really even know SQL, and I
could do the
mysql port in a few hours, working solely from the
documentation at
mysql.com.)

Questions and comments:
- $self->{releases_select} was not being undef-ed in the DESTROY
routine. Added that.
- When is it necessary to call $self->{xxx}->finish()? Some dbi
queries seem to do so, and others don't.

Index: lxr-0.9.3/genxref

--- lxr-0.9.3.orig/genxref 2004-10-21 13:19:44.000000000 -0500
+++ lxr-0.9.3/genxref 2004-10-21 13:38:51.000000000 -0500
@@ -89,11 +89,29 @@

foreach my $version (@versions) {
$index->purge($version) if $option{'reindexall'};
+ cleanindex($version);
gensearch($version);
genindex('/', $version);
genrefs('/', $version);
}

+sub cleanindex {
+ my ($release) = @_;
+ my ($f, @files);
+
+ @files = $index->getfiles($release);
+ foreach $f (@files) {
+ # $f == [filename, fileid, revision]
+ # Skip this file if it is still at the same revision
+ next if $files->filerev($f->[0], $release) == $f->[2];
+
+ print(STDERR "%%% DELETED/MODIFIED: ", join(" ", @$f), "\n");
+
+ # Remove old revision from this release.
+ $index->purgefile($f->[1], $release);
+ }
+}
+
sub genindex {
my ($pathname, $release) = @_;

Index: lxr-0.9.3/lib/LXR/Index/Mysql.pm

--- lxr-0.9.3.orig/lib/LXR/Index/Mysql.pm 2004-10-21
13:19:44.000000000 -0500
+++ lxr-0.9.3/lib/LXR/Index/Mysql.pm 2004-10-21
13:39:07.000000000 -0500
@@ -56,6 +56,11 @@
$self->{dbh}
->prepare("insert into ${prefix}files (filename,
revision, fileid) values (?, ?, NULL)");

+ $self->{allfiles_select} =
+ $self->{dbh}->prepare("select f.filename, f.fileid,
f.revision "
+ . "from ${prefix}files f, ${prefix}releases r "
+ . "where f.fileid = r.fileid and r.release = ?");
+
$self->{symbols_byname} =
$self->{dbh}->prepare("select symid from
${prefix}symbols where symname = ?");
$self->{symbols_byid} =
@@ -81,6 +86,10 @@
$self->{dbh}->prepare("select * from ${prefix}releases
where fileid = ? and release = ?");
$self->{releases_insert} =
$self->{dbh}->prepare("insert into ${prefix}releases
(fileid, release) values (?, ?)");
+ $self->{releases_delete} =
+ $self->{dbh}->prepare("delete from ${prefix}releases
where fileid = ? and release = ?");
+ $self->{releases_select_file} =
+ $self->{dbh}->prepare("select * from ${prefix}releases
where fileid = ?");

$self->\{status\_get\} = $self->\{dbh\}->prepare\("select status from

${prefix}status where fileid = ?");
@@ -134,6 +143,19 @@
. "where f.fileid = r.fileid "
. "and r.release = ?");

+ $self->{indexes_del_fileid} =
+ $self->{dbh}->prepare("delete from ${prefix}indexes "
+ . "where ${prefix}indexes.fileid = ?");
+ $self->{useage_del_fileid} =
+ $self->{dbh}->prepare("delete from ${prefix}useage "
+ . "where ${prefix}useage.fileid = ?");
+ $self->{status_del_fileid} =
+ $self->{dbh}->prepare("delete from ${prefix}status "
+ . "where ${prefix}status.fileid = ?");
+ $self->{files_del_fileid} =
+ $self->{dbh}->prepare("delete from ${prefix}files "
+ . "where ${prefix}files.fileid = ?");
+
return $self;
}

@@ -326,6 +348,47 @@
return $id;
}

+# List all files in this release
+sub getfiles {
+ my ($self, $release) = @_;
+ my ($rows, @ret);
+
+ $rows = $self->{allfiles_select}->execute("$release");
+
+ while ($rows-- > 0) {
+ push(@ret, [ $self->{allfiles_select}->fetchrow_array ] );
+ }
+
+ $self->{allfiles_select}->finish();
+
+ return @ret;
+}
+
+# Remove all references to $fileid from $release
+sub purgefile {
+ my ($self, $fileid, $release) = @_;
+
+ # Remove $fileid from $release
+ $self->{releases_delete}->execute($fileid, $release);
+ $self->{releases_delete}->finish();
+
+ # Find how many releases still reference $fileid
+ my $rows = $self->{releases_select_file}->execute($fileid
+ 0);
+ $self->{releases_select_file}->finish();
+
+ # If none, remove fileid from all other tables
+ unless ($rows > 0) {
+ # we don't delete symbols, because they might be used by
other
+ # versions so we can end up with unused symbols, but that
+ # doesn't cause any problems. Drop and rebuild the database
+ # from time to time if it bothers you.
+ $self->{indexes_del_fileid}->execute($fileid);
+ $self->{useage_del_fileid}->execute($fileid);
+ $self->{status_del_fileid}->execute($fileid);
+ $self->{files_del_fileid}->execute($fileid);
+ }
+}
+
sub purge {
my ($self, $version) = @_;

@@ -342,11 +405,15 @@
my ($self) = @_;
$self->{files_select} = undef;
$self->{files_insert} = undef;
+ $self->{allfiles_select} = undef;
$self->{symbols_byname} = undef;
$self->{symbols_byid} = undef;
$self->{symbols_insert} = undef;
$self->{indexes_insert} = undef;
+ $self->{releases_select} = undef;
$self->{releases_insert} = undef;
+ $self->{releases_delete} = undef;
+ $self->{releases_select_file} = undef;
$self->{status_insert} = undef;
$self->{status_update} = undef;
$self->{usage_insert} = undef;
@@ -358,6 +425,10 @@
$self->{delete_status} = undef;
$self->{delete_releases} = undef;
$self->{delete_files} = undef;
+ $self->{indexes_del_fileid} = undef;
+ $self->{useage_del_fileid} = undef;
+ $self->{status_del_fileid} = undef;
+ $self->{files_del_fileid} = undef;

if $$self->\{dbh\}$ \{ $self->\{dbh\}->disconnect;
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andre-Littoz - 2012-09-10

At least fixed in 1.0 release after 10 years! Always be patient.

The fix is based on the suggestion submitted by bjjohnson but requires extra subtleties to avoid erasing a base revision still in use in another version. Since the DB schemas now include reference counts, some tests are easier and faster because they do not involve a extensive query through the whole DB only to count referencing records.

One feature is not handled is the suggestion: the "member of" relation in a definition. It is necessary to take it into account to update a symbol reference count before removing a definition. This action disturbs the "sequential" processing of the definition. Consequently a symbol cache is used to minimize DB queries (important for large projects such as the Linux kernel).

Symbols are left in the DB on the assumption that cleaning was caused by an indexing request: the new file revision will likely differ little from the previous one in the parsed symbols.

Since the fix is rather complex, read CVS for details.

Note: with Index.pm rewriting (common factoring what is DB-engine independent), the fix is valid for all the currently supported DB.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andre-Littoz - 2012-09-10

assigned_to: mbox --> ajlittoz

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Indexing of files once indexed is buggy!

Group

Searches

Help

#41 Indexing of files once indexed is buggy!

Discussion

Index: lxr-0.9.3/genxref

Index: lxr-0.9.3/lib/LXR/Index/Mysql.pm