Re: [Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!
Brought to you by:
ajlittoz
From: Malcolm B. <ma...@br...> - 2002-02-20 03:44:13
|
Again, moving discussion of this to the list. no...@so... wrote: <bug is that if the version of a file changes in a revision, all the old data hangs around> >The culprit is this piece of code in function processfile() [ Tagger.pm ] >------ > if ($index->toindex($fileid)) { > $index->empty_cache(); > print(STDERR "--- $pathname $fileid\n"); > > my $path = $files->tmpfile($pathname, $release); > > $lang->indexfile($pathname, $path, $fileid, $index, $config); > unlink($path); > } else { > print(STDERR "$pathname was already indexed\n"); > } >------ >The problem is that if the file already existed and has changed since then [based on the >timestamp], the identifiers added to the database due to this file in the previous run of genxref are >not removed from the database, hence the number of definitions will keep on growing... > >The same problem is also present in processrefs(). > Yes, that's exactly what happens. Essentially we notice that we haven't already indexed this file at this version, so we go ahead and do so, but without removing any old data for this (filename, release) pair. This comes from the underlying assumption in LXR that the mapping from (filename, revision) to release is static for the life of the database. >>Comment By: Shree Kumar (shreekumar) >> >Here's my fix for this bug: > >Add a field "timestamp" to the "status" table. And remove >the "status" field. > If there's no status field, how do you know whether you've indexed and referenced the file, or just indexed it? Or even started indexing but not finished? It's perfectly possible for an indexing run to fail half-way through (out of memory, crash, out of diskspace, whatever). With no way of telling what's been done, you'd have to drop and recreate the database from scratch. >Before finding identifiers in a file, check whether it's >modification time is greater that it was previously. If >yes, then remove all the identifier definitions due to this >file [and release] from the database. Store the new >timestamp in the database. > How do you know the modification time of a file in CVS or other non-Plain backends? If you mean "filerev" by timestamp then this might work. >Before finding references in a file, remove all identifier >references due to this file [and release] from the database. >[ No need to check the timestamp in this case since >the "definitions" are always found before the references]. > I don't understand this. How do you know that you need to reference this file - is it because you needed to index it, so it must need referencing? >In a large CVS tree, it is quite possible that a file may >change between the time it is "indexed" and "referenced". >An easy way out of this seems to be to "index" a file and >immediately "reference" it. > No, that won't work, since until you've indexed all the files in a release, you don't know all the identifiers and therefore you can't reference the file. Hence the way the code works at the moment. >Related to this there is a problem in "Plain.pm" - the >current "filerev" function returns a value based on the >timestamp. Problem arises if a file changes between runs >of genxref. What happens is that different values are >returned by "filerev" even though it is the same >(file,revision) pair is being indexed [or referenced]. > That's by design - a filerev is not the same as a release. If the file changes between runs of genxref, then it's a different file, so it needs to be processed again. What Plain.pm is trying to emulate is the versioning 1.1, 1.2, 1.3 etc provided by CVS for a file. The timestamp acts as this. The basic underlying organisation of lxr is that there are a set of files with various revisions - the tuple (filename, filerev) uniquely identifies each distinct file. So (filename, 1.1) is not the same entity as (filename, 1.2) which is correct. Then a release is defined as a set of (filename, filerev) pairs - which means that the same file entity can appear in multiple releases. This is intuitively correct - if you have a CVS repository with files evolving through revisions as they are modified, each release will tag a certain set of those files. However, two releases will have the same file entity in them if the file has not been modified between releases. Because LXR deals with file entities, if the file appears in several releases at the same revision number, it will be indexed only once. Plain.pm then emulates this by using timestamps as the revision history of the file. Therefore if you have two directories v1 and v2, and they both contain a file X with the same timestamp, LXR will treat them as the same file. The easiest way to see this is to symlink between the two version directories - indexing will occur only once. This mechanism only has one way of breaking - two files with the same name and timestamp that are actually not identical. However, it would be a strange revision control system that would give you such files - most systems either give you the time of checkin (in which case the timestamps would not be identical) or the time of checkout (similarly, the timestamps should not be equal since the two files probably weren't written at the same time). The result of this is that with Plain.pm LXR will sometimes index the file more often that it needs to, but it should not decide not to index it when it does need to. Using the Plain.pm backend essentially trades off diskspace for ease of use. >I have changed filerev() for this purpose as > >sub filerev { > my ($self, $filename, $release) = @_; > > # TODO: length of filename+revision > # might turn out to be > 255 chars > # [length used in the db] > return join("-", $filename, > $release); >} > >With this modification filerev() will return the same value >for (file,revision) pair everytime - thus solving the >problem. > Nope, this is wrong for the reasons outlined above. This returns the same value for filename, release every time - which is not the same as filename, revision. >I have a patch ready for this. > When you have patches, please send them to the list. It's usually easier to discuss code changes with the actual diff than to talk about the description - since the description and the diff may in fact differ, and either could be the Right Thing. Cheers, Malcolm > |