[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!
Brought to you by:
ajlittoz
From: SourceForge.net <no...@so...> - 2004-10-21 19:01:02
|
Bugs item #518365, was opened at 2002-02-16 04:04 Message generated for change (Comment added) made by bjjohnson You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350 Category: genxref Group: current cvs Status: Open Resolution: None Priority: 7 Submitted By: Shree Kumar (shreekumar) Assigned to: Malcolm Box (mbox) Summary: Indexing of files once indexed is buggy! Initial Comment: I am using LXR-0.9.1 Consider this scenario : There is a source tree "test" having only one file - test.c test.c ------- #define TEST 100 now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in test.c at line 1 now I change test.c to ------- #define T 1 #define TEST 100 & run genxref Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 ! The culprit is this piece of code in function processfile() [ Tagger.pm ] ------ if ($index->toindex($fileid)) { $index->empty_cache(); print(STDERR "--- $pathname $fileid\n"); my $path = $files->tmpfile($pathname, $release); $lang->indexfile($pathname, $path, $fileid, $index, $config); unlink($path); } else { print(STDERR "$pathname was already indexed\n"); } ------ The problem is that if the file already existed and has changed since then [based on the timestamp], the identifiers added to the database due to this file in the previous run of genxref are not removed from the database, hence the number of definitions will keep on growing... The same problem is also present in processrefs(). ---------------------------------------------------------------------- Comment By: Brian J. Johnson (bjjohnson) Date: 2004-10-21 14:01 Message: Logged In: YES user_id=85501 Here's a patch to provide incremental reindexing. The --reindexall option was useful, but not really what's required (at least, not what I really wanted): I want to avoid reindexing files which haven't changed, remove all index info. for files which have been deleted, and correctly reindex (i.e. without duplication) files which which have changed. This patch does so. ItWorksForMe, please let me know if it breaks for you. (And if someone wants to add support for databases besides mysql, please do!) I _think_ I got the database manipulation right; please let me know if I've made any bad assumptions about how the tables relate, or about the semantics of the various fields. I've tested this with Files::Plain and Index::Mysql. The patch is against lxr 0.9.3. Thanks, Brian ---------------------------------------------------------------------- Patch to add incremental indexing to lxr-0.9.3 By Brian J. Johnson 10/21/04 This patch adds incremental indexing to lxr-0.9.3. That is, files which have not changed are not reindexed, and files which have changed or been removed have their old information erased from the database before they are reindexed. On my machine, this saves _hours_ of reindexing time on trees which don't change much from day to day. The patch modifies genxref to add an extra pass before the existing gensearch, genindex, and genrefs passes. For each file in the current release, it retrieves the file's revision from the database and checks if the file has changed in the file store: i.e. $files->filerev() is different from the revision in the database. If the file has changed, it calls $index->purgefile() to remove that fileid (i.e. [filename, revision] pair) from the release. If that fileid is no longer active in any release, $index->purgefile() removes it from the other tables as well. Then the genindex and genrefs passes add the new revision of the file (with a new fileid) to the database. Symbols can still be left around (as with the "reindexall" option), so the administrator should drop and regenerate the database completely from scratch every so often. I haven't tried to add support for any database besides mysql, as I have no way to test the changes. It should be pretty easy for others to add it, though. (I don't really even know SQL, and I could do the mysql port in a few hours, working solely from the documentation at mysql.com.) Questions and comments: - $self->{releases_select} was not being undef-ed in the DESTROY routine. Added that. - When is it necessary to call $self->{xxx}->finish()? Some dbi queries seem to do so, and others don't. Index: lxr-0.9.3/genxref =================================================================== --- lxr-0.9.3.orig/genxref 2004-10-21 13:19:44.000000000 -0500 +++ lxr-0.9.3/genxref 2004-10-21 13:38:51.000000000 -0500 @@ -89,11 +89,29 @@ foreach my $version (@versions) { $index->purge($version) if $option{'reindexall'}; + cleanindex($version); gensearch($version); genindex('/', $version); genrefs('/', $version); } +sub cleanindex { + my ($release) = @_; + my ($f, @files); + + @files = $index->getfiles($release); + foreach $f (@files) { + # $f == [filename, fileid, revision] + # Skip this file if it is still at the same revision + next if $files->filerev($f->[0], $release) == $f->[2]; + + print(STDERR "%%% DELETED/MODIFIED: ", join(" ", @$f), "\n"); + + # Remove old revision from this release. + $index->purgefile($f->[1], $release); + } +} + sub genindex { my ($pathname, $release) = @_; Index: lxr-0.9.3/lib/LXR/Index/Mysql.pm =================================================================== --- lxr-0.9.3.orig/lib/LXR/Index/Mysql.pm 2004-10-21 13:19:44.000000000 -0500 +++ lxr-0.9.3/lib/LXR/Index/Mysql.pm 2004-10-21 13:39:07.000000000 -0500 @@ -56,6 +56,11 @@ $self->{dbh} ->prepare("insert into ${prefix}files (filename, revision, fileid) values (?, ?, NULL)"); + $self->{allfiles_select} = + $self->{dbh}->prepare("select f.filename, f.fileid, f.revision " + . "from ${prefix}files f, ${prefix}releases r " + . "where f.fileid = r.fileid and r.release = ?"); + $self->{symbols_byname} = $self->{dbh}->prepare("select symid from ${prefix}symbols where symname = ?"); $self->{symbols_byid} = @@ -81,6 +86,10 @@ $self->{dbh}->prepare("select * from ${prefix}releases where fileid = ? and release = ?"); $self->{releases_insert} = $self->{dbh}->prepare("insert into ${prefix}releases (fileid, release) values (?, ?)"); + $self->{releases_delete} = + $self->{dbh}->prepare("delete from ${prefix}releases where fileid = ? and release = ?"); + $self->{releases_select_file} = + $self->{dbh}->prepare("select * from ${prefix}releases where fileid = ?"); $self->{status_get} = $self->{dbh}->prepare("select status from ${prefix}status where fileid = ?"); @@ -134,6 +143,19 @@ . "where f.fileid = r.fileid " . "and r.release = ?"); + $self->{indexes_del_fileid} = + $self->{dbh}->prepare("delete from ${prefix}indexes " + . "where ${prefix}indexes.fileid = ?"); + $self->{useage_del_fileid} = + $self->{dbh}->prepare("delete from ${prefix}useage " + . "where ${prefix}useage.fileid = ?"); + $self->{status_del_fileid} = + $self->{dbh}->prepare("delete from ${prefix}status " + . "where ${prefix}status.fileid = ?"); + $self->{files_del_fileid} = + $self->{dbh}->prepare("delete from ${prefix}files " + . "where ${prefix}files.fileid = ?"); + return $self; } @@ -326,6 +348,47 @@ return $id; } +# List all files in this release +sub getfiles { + my ($self, $release) = @_; + my ($rows, @ret); + + $rows = $self->{allfiles_select}->execute("$release"); + + while ($rows-- > 0) { + push(@ret, [ $self->{allfiles_select}->fetchrow_array ] ); + } + + $self->{allfiles_select}->finish(); + + return @ret; +} + +# Remove all references to $fileid from $release +sub purgefile { + my ($self, $fileid, $release) = @_; + + # Remove $fileid from $release + $self->{releases_delete}->execute($fileid, $release); + $self->{releases_delete}->finish(); + + # Find how many releases still reference $fileid + my $rows = $self->{releases_select_file}->execute($fileid + 0); + $self->{releases_select_file}->finish(); + + # If none, remove fileid from all other tables + unless ($rows > 0) { + # we don't delete symbols, because they might be used by other + # versions so we can end up with unused symbols, but that + # doesn't cause any problems. Drop and rebuild the database + # from time to time if it bothers you. + $self->{indexes_del_fileid}->execute($fileid); + $self->{useage_del_fileid}->execute($fileid); + $self->{status_del_fileid}->execute($fileid); + $self->{files_del_fileid}->execute($fileid); + } +} + sub purge { my ($self, $version) = @_; @@ -342,11 +405,15 @@ my ($self) = @_; $self->{files_select} = undef; $self->{files_insert} = undef; + $self->{allfiles_select} = undef; $self->{symbols_byname} = undef; $self->{symbols_byid} = undef; $self->{symbols_insert} = undef; $self->{indexes_insert} = undef; + $self->{releases_select} = undef; $self->{releases_insert} = undef; + $self->{releases_delete} = undef; + $self->{releases_select_file} = undef; $self->{status_insert} = undef; $self->{status_update} = undef; $self->{usage_insert} = undef; @@ -358,6 +425,10 @@ $self->{delete_status} = undef; $self->{delete_releases} = undef; $self->{delete_files} = undef; + $self->{indexes_del_fileid} = undef; + $self->{useage_del_fileid} = undef; + $self->{status_del_fileid} = undef; + $self->{files_del_fileid} = undef; if ($self->{dbh}) { $self->{dbh}->disconnect(); ---------------------------------------------------------------------- Comment By: Dave Brondsema (brondsem) Date: 2004-07-20 10:25 Message: Logged In: YES user_id=341298 Yes. For the version being indexed, it deletes all data directly related to that version. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2004-07-20 09:42 Message: Logged In: NO Sorry, I can't access CVS (firewall). Does your switch do "intelligent" job like described in message dated "2002-02-18 23:21" below? ---------------------------------------------------------------------- Comment By: Dave Brondsema (brondsem) Date: 2004-07-20 07:21 Message: Logged In: YES user_id=341298 I have recently added the --reindexall option to genxref (in CVS). Please try and see if this works. Perhaps it should be default and not an option if it does. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2004-07-20 06:36 Message: Logged In: NO Shree, you've said you had a patch, could you attach to the Tracker? Thanks, Dennis ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2004-07-20 06:31 Message: Logged In: NO Still no fix for this? ---------------------------------------------------------------------- Comment By: Heikki Toivonen (hjtoi) Date: 2004-03-12 11:48 Message: Logged In: YES user_id=972898 Anyone have a patch for this? ---------------------------------------------------------------------- Comment By: Richard Kisley (kisley) Date: 2003-04-07 19:39 Message: Logged In: YES user_id=102080 (1) How about an intermediate solution, where someone writes a VERIFY script which compares the paths in the database with the version they refer to and deletes entries for invalid paths? Same options as genxref? I indexed a subtree of a sourcetree, then realized I needed to index the whole source tree. So I moved the revision main dir (since it was really a subdir) up a level and added the other directories at their proper top level as other subdirs, then re-indexed. Now the original links in tree one are all dead links, with live dupes. No files changed. (2) So we all don't have to scurry for our SQL books, buried in a box in the back of a closet at home (not work) how about posting the exact drop syntax? That might also be a good thing to add to doc short-term, since genxref doesn't work (prune) as expected. ---------------------------------------------------------------------- Comment By: Gregor Hartmann (grex) Date: 2002-06-07 08:10 Message: Logged In: YES user_id=559509 Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay in the database forever as well. Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which have changed or were removed in the source tree. then proceed indexing as before. ---------------------------------------------------------------------- Comment By: Shree Kumar (shreekumar) Date: 2002-02-19 01:21 Message: Logged In: YES user_id=142912 Here's my fix for this bug: Add a field "timestamp" to the "status" table. And remove the "status" field. Before finding identifiers in a file, check whether it's modification time is greater that it was previously. If yes, then remove all the identifier definitions due to this file [and release] from the database. Store the new timestamp in the database. Before finding references in a file, remove all identifier references due to this file [and release] from the database. [ No need to check the timestamp in this case since the "definitions" are always found before the references]. In a large CVS tree, it is quite possible that a file may change between the time it is "indexed" and "referenced". An easy way out of this seems to be to "index" a file and immediately "reference" it. Related to this there is a problem in "Plain.pm" - the current "filerev" function returns a value based on the timestamp. Problem arises if a file changes between runs of genxref. What happens is that different values are returned by "filerev" even though it is the same (file,revision) pair is being indexed [or referenced]. I have changed filerev() for this purpose as sub filerev { my ($self, $filename, $release) = @_; # TODO: length of filename+revision # might turn out to be > 255 chars # [length used in the db] return join("-", $filename, $release); } With this modification filerev() will return the same value for (file,revision) pair everytime - thus solving the problem. I have a patch ready for this. ---------------------------------------------------------------------- Comment By: Malcolm Box (mbox) Date: 2002-02-18 08:20 Message: Logged In: YES user_id=215386 Yes, you're right, this is a bug. The underlying assumption that is being broken is that the files in a version are static - which is true if one is indexing released software, but not if it is a development tree. The simplest work-around is to drop and recreate the database each time, thus avoiding the problem. For small to medium repositories with the index updated nightly this should work fine, but it doesn't work for large repositories. The full solution would appear to be to check for an existing entry for the (filename, release) pair and if it is found delete it and all associated information. ---------------------------------------------------------------------- Comment By: Shree Kumar (shreekumar) Date: 2002-02-16 07:32 Message: Logged In: YES user_id=142912 There are two cases where the scenario that I've referred to applies: 1. Files are not in CVS [ ie usage of "Files.pm" ]. You run genxref, then change a file & genxref again 2. Files are in CVS, and you want to index the "head" tag. Files change regularly, and you want to keep the cross reference in sync - probably by running genxref once an hour or so [as a cron job]. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2002-02-16 06:47 Message: Logged In: NO I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350 |