Re: [Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Arne Georg Gleditsch wrote:

>This is a bit non-obvious, and I'm happy to see that you're able to
>summarise the issue so clearly when someone suggests to mangle stuff
>we've actually struggled a fair bit with in the past. :)  As a note,
>Plain.pm includes the file size in the revision string, which means
>that files would have to have the same timestamp and size as well as
>different contents for LXR to fail to index changed files.
>
I hadn't realised that size was included - that makes it even more 
robust.  Certainly the terminology of releases, revisions etc is not 
that clear - it took me a while to get my head round it.  Moving to the 
idea of being able to index a "HEAD" revision (ie one that is evolving) 
clearly challenges some of the assumptions in the code, though not the 
overall semantic model.

>As far as solutions to this problem go; even with Plain.pm we have
>some notion of the set of files belonging to a particular release.
>Thus, when indexing a release and encountering a (filename,
>revision)-tuple belonging to it, we could invalidate all non-matching
>(filename, *)-tuples marked as belonging to the same release (and no
>other releases).  In doing this, we would also need to invalidate the
>reference-information for this release.  As long as we do that we'd be
>home free as far as database integrity is concerned, as far as I can
>see.
>
Indeed, this works very well.  I have got it going for the Postgres 
backend, since the nice referential integrity triggers make this kind of 
cascading delete very easy.  Unfortunately I haven't completed the port 
to the MySQL backend, since that takes much more manual grovelling to 
clean up.  This also won't be hitting the CVS repository for a while 
since the code is on a laptop which is being shipped from Japan to the 
UK and so is now bobbing around on the Pacific ocean at a guess :-)  Of 
course, I might get frustrated enough with the bug to just re-code the 
fix, but the "drop  and rebuild the db every now and then" fix is 
working for me at the moment.

>(A possible shortcut would be to index (filename, rev2) before
>(possibly) invalidating the information for (filename, rev1) and only
>invalidate the reference-information if we find that the two define
>non-matching sets of symbols.)
>
It's probably more effort to track the new stuff and compare with the 
old than simply to delete and re-add.  The big problem (that's just 
occurred to me) is with the useage table.  If the new revision of the 
file defines new symbols, then for total accuracy all existing files 
need to be re-referenced to see if they use that symbol.  Luckily it's 
extremely unlikely that someone would add a new symbol in a file that 
retrospectively re-defines symbols in other files, but I guess it is a 
theoretical possibility.

This is also the reason why a "index file, reference file" loop doesn't 
work, rather than the "index all files", "reference all files"  approach 
taken at the moment.

Cheers,

Malcolm

P.S.  Good to see you back on the list again :-)