Re: [Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Again, moving discussion of this to the list.

no...@so... wrote:

<bug is that if the version of a file changes in a revision, all the old 
data hangs around>

>The culprit is this piece of code in function processfile() [ Tagger.pm ]
>------
>          if ($index->toindex($fileid)) {
>                $index->empty_cache();
>                print(STDERR "--- $pathname $fileid\n");
>
>                my $path = $files->tmpfile($pathname, $release);
>
>                $lang->indexfile($pathname, $path, $fileid, $index, $config);
>                unlink($path);
>          } else {
>                print(STDERR "$pathname was already indexed\n");
>          }
>------
>The problem is that if the file already existed and has changed since then [based on the 
>timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
>not removed from the database, hence the number of definitions will keep on growing...
>
>The same problem is also present in processrefs().
>
Yes, that's exactly what happens.  Essentially we notice that we haven't 
already indexed this file at this version, so we go ahead and do so, but 
without removing any old data for this (filename, release) pair.  This 
comes from the underlying assumption in LXR that the mapping from 
(filename, revision) to release is static for the life of the database.

>>Comment By: Shree Kumar (shreekumar)
>>
>Here's my fix for this bug:
>
>Add a field "timestamp" to the "status" table. And remove 
>the "status" field.
>
If there's no status field, how do you know whether you've indexed and 
referenced the file, or just indexed it?  Or even started indexing but 
not finished?  It's perfectly possible for an indexing run to fail 
half-way through (out of memory, crash, out of diskspace, whatever). 
 With no way of telling what's been done, you'd have to drop and 
recreate the database from scratch.

>Before finding identifiers in a file, check whether it's 
>modification time is greater that it was previously. If 
>yes, then remove all the identifier definitions due to this 
>file [and release] from the database. Store the new 
>timestamp in the database.
>
How do you know the modification time of a file in CVS or other 
non-Plain backends?  If you mean "filerev" by timestamp then this might 
work.

>Before finding references in a file, remove all identifier 
>references due to this file [and release] from the database.
>[ No need to check the timestamp in this case since 
>the "definitions" are always found before the references]. 
>
I don't understand this.  How do you know that you need to reference 
this file - is it because you needed to index it, so it must need 
referencing?

>In a large CVS tree, it is quite possible that a file may 
>change between the time it is "indexed" and "referenced". 
>An easy way out of this seems to be to "index" a file and 
>immediately "reference" it.
>
No, that won't work, since until you've indexed all the files in a 
release, you don't know all the identifiers and therefore you can't 
reference the file.  Hence the way the code works at the moment.

>Related to this there is a problem in "Plain.pm" - the 
>current  "filerev" function returns a value based on the 
>timestamp.  Problem arises if a file changes between runs 
>of genxref. What happens is that different values are 
>returned by "filerev" even though it is the same 
>(file,revision) pair is being indexed [or referenced].
>
That's by design - a filerev is not the same as a release.  If the file 
changes between runs of genxref, then it's a different file, so it needs 
to be processed again.  What Plain.pm is trying to emulate is the 
versioning 1.1, 1.2, 1.3 etc provided by CVS for a file.  The timestamp 
acts as this.

The basic underlying organisation of lxr is that there are a set of 
files with various revisions - the tuple (filename, filerev) uniquely 
identifies each distinct file.  So (filename, 1.1) is not the same 
entity as (filename, 1.2) which is correct.  Then a release is defined 
as a set of (filename, filerev) pairs - which means that the same file 
entity can appear in multiple releases.

This is intuitively correct - if you have a CVS repository with files 
evolving through revisions as they are modified, each release will tag a 
certain set of those files.  However, two releases will have the same 
file entity in them if the file has not been modified between releases. 
 Because LXR deals with file entities, if the file appears in several 
releases at the same revision number, it will be indexed only once.

Plain.pm then emulates this by using timestamps as the revision history 
of the file.  Therefore if you have two directories v1 and v2, and they 
both contain a file X with the same timestamp, LXR will treat them as 
the same file.  The easiest way to see this is to symlink between the 
two version directories - indexing will occur only once.  This mechanism 
only has one way of breaking - two files with the same name and 
timestamp that are actually not identical.  However, it would be a 
strange revision control system that would give you such files - most 
systems either give you the time of checkin (in which case the 
timestamps would not be identical) or the time of checkout (similarly, 
the timestamps should not be equal since the two files probably weren't 
written at the same time).  

The result of this is that with Plain.pm LXR will sometimes index the 
file more often that it needs to, but it should not decide not to index 
it when it does need to.   Using the Plain.pm backend essentially trades 
off diskspace for ease of use.

>I have changed filerev() for this purpose as
>
>sub filerev {
> my ($self, $filename, $release) = @_;
>
> # TODO: length of filename+revision
> #       might turn out to be > 255 chars 
> #       [length used in the db]
>        return join("-", $filename,
>                                $release);
>}
>
>With this modification filerev() will return the same value 
>for (file,revision) pair everytime - thus solving the 
>problem.
>
Nope, this is wrong for the reasons outlined above.  This returns the 
same value for filename, release every time - which is not the same as 
filename, revision.

>I have a patch ready for this.
>
When you have patches, please send them to the list.  It's usually 
easier to discuss code changes with the actual diff than to talk about 
the description - since the description and the diff may in fact differ, 
and either could be the Right Thing.

Cheers,

Malcolm

>