Thread: [Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

Brought to you by: ajlittoz

lxr-developer

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: <no...@so...> - 2002-02-16 10:04:42

Bugs item #518365, was opened at 2002-02-16 02:04
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: v0.9
Status: Open
Resolution: None
Priority: 5
Submitted By: Shree Kumar (shreekumar)
Assigned to: Nobody/Anonymous (nobody)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree "test" having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

& run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index->toindex($fileid)) {
                $index->empty_cache();
                print(STDERR "--- $pathname $fileid\n");

                my $path = $files->tmpfile($pathname, $release);

                $lang->indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR "$pathname was already indexed\n");
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: <no...@so...> - 2002-02-16 12:47:56

Bugs item #518365, was opened at 2002-02-16 02:04
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: v0.9
Status: Open
Resolution: None
Priority: 5
Submitted By: Shree Kumar (shreekumar)
Assigned to: Nobody/Anonymous (nobody)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree "test" having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

& run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index->toindex($fileid)) {
                $index->empty_cache();
                print(STDERR "--- $pathname $fileid\n");

                my $path = $files->tmpfile($pathname, $release);

                $lang->indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR "$pathname was already indexed\n");
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 04:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: <no...@so...> - 2002-02-16 13:32:06

Bugs item #518365, was opened at 2002-02-16 02:04
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: v0.9
Status: Open
Resolution: None
Priority: 5
Submitted By: Shree Kumar (shreekumar)
Assigned to: Nobody/Anonymous (nobody)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree "test" having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

& run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index->toindex($fileid)) {
                $index->empty_cache();
                print(STDERR "--- $pathname $fileid\n");

                my $path = $files->tmpfile($pathname, $release);

                $lang->indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR "$pathname was already indexed\n");
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

>Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 05:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of "Files.pm" ]. You run genxref, then change a file & genxref again

2. Files are in CVS, and you want to index the "head" tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 04:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: <no...@so...> - 2002-02-18 14:21:05

Bugs item #518365, was opened at 2002-02-16 02:04
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
>Group: current cvs
Status: Open
Resolution: None
>Priority: 7
Submitted By: Shree Kumar (shreekumar)
>Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree "test" having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

& run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index->toindex($fileid)) {
                $index->empty_cache();
                print(STDERR "--- $pathname $fileid\n");

                my $path = $files->tmpfile($pathname, $release);

                $lang->indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR "$pathname was already indexed\n");
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

>Comment By: Malcolm Box (mbox)
Date: 2002-02-18 06:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 05:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of "Files.pm" ]. You run genxref, then change a file & genxref again

2. Files are in CVS, and you want to index the "head" tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 04:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: <no...@so...> - 2002-02-19 07:21:41

Bugs item #518365, was opened at 2002-02-16 02:04
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree "test" having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

& run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index->toindex($fileid)) {
                $index->empty_cache();
                print(STDERR "--- $pathname $fileid\n");

                my $path = $files->tmpfile($pathname, $release);

                $lang->indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR "$pathname was already indexed\n");
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

>Comment By: Shree Kumar (shreekumar)
Date: 2002-02-18 23:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field "timestamp" to the "status" table. And remove 
the "status" field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the "definitions" are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is "indexed" and "referenced". 
An easy way out of this seems to be to "index" a file and 
immediately "reference" it.

Related to this there is a problem in "Plain.pm" - the 
current  "filerev" function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by "filerev" even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be > 255 chars 
 #       [length used in the db]
        return join("-", $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 06:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 05:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of "Files.pm" ]. You run genxref, then change a file & genxref again

2. Files are in CVS, and you want to index the "head" tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 04:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Re: [Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: Malcolm B. <ma...@br...> - 2002-02-20 03:44:13

Again, moving discussion of this to the list.

no...@so... wrote:

<bug is that if the version of a file changes in a revision, all the old 
data hangs around>

>The culprit is this piece of code in function processfile() [ Tagger.pm ]
>------
>          if ($index->toindex($fileid)) {
>                $index->empty_cache();
>                print(STDERR "--- $pathname $fileid\n");
>
>                my $path = $files->tmpfile($pathname, $release);
>
>                $lang->indexfile($pathname, $path, $fileid, $index, $config);
>                unlink($path);
>          } else {
>                print(STDERR "$pathname was already indexed\n");
>          }
>------
>The problem is that if the file already existed and has changed since then [based on the 
>timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
>not removed from the database, hence the number of definitions will keep on growing...
>
>The same problem is also present in processrefs().
>
Yes, that's exactly what happens.  Essentially we notice that we haven't 
already indexed this file at this version, so we go ahead and do so, but 
without removing any old data for this (filename, release) pair.  This 
comes from the underlying assumption in LXR that the mapping from 
(filename, revision) to release is static for the life of the database.

>>Comment By: Shree Kumar (shreekumar)
>>
>Here's my fix for this bug:
>
>Add a field "timestamp" to the "status" table. And remove 
>the "status" field.
>
If there's no status field, how do you know whether you've indexed and 
referenced the file, or just indexed it?  Or even started indexing but 
not finished?  It's perfectly possible for an indexing run to fail 
half-way through (out of memory, crash, out of diskspace, whatever). 
 With no way of telling what's been done, you'd have to drop and 
recreate the database from scratch.

>Before finding identifiers in a file, check whether it's 
>modification time is greater that it was previously. If 
>yes, then remove all the identifier definitions due to this 
>file [and release] from the database. Store the new 
>timestamp in the database.
>
How do you know the modification time of a file in CVS or other 
non-Plain backends?  If you mean "filerev" by timestamp then this might 
work.

>Before finding references in a file, remove all identifier 
>references due to this file [and release] from the database.
>[ No need to check the timestamp in this case since 
>the "definitions" are always found before the references]. 
>
I don't understand this.  How do you know that you need to reference 
this file - is it because you needed to index it, so it must need 
referencing?

>In a large CVS tree, it is quite possible that a file may 
>change between the time it is "indexed" and "referenced". 
>An easy way out of this seems to be to "index" a file and 
>immediately "reference" it.
>
No, that won't work, since until you've indexed all the files in a 
release, you don't know all the identifiers and therefore you can't 
reference the file.  Hence the way the code works at the moment.

>Related to this there is a problem in "Plain.pm" - the 
>current  "filerev" function returns a value based on the 
>timestamp.  Problem arises if a file changes between runs 
>of genxref. What happens is that different values are 
>returned by "filerev" even though it is the same 
>(file,revision) pair is being indexed [or referenced].
>
That's by design - a filerev is not the same as a release.  If the file 
changes between runs of genxref, then it's a different file, so it needs 
to be processed again.  What Plain.pm is trying to emulate is the 
versioning 1.1, 1.2, 1.3 etc provided by CVS for a file.  The timestamp 
acts as this.

The basic underlying organisation of lxr is that there are a set of 
files with various revisions - the tuple (filename, filerev) uniquely 
identifies each distinct file.  So (filename, 1.1) is not the same 
entity as (filename, 1.2) which is correct.  Then a release is defined 
as a set of (filename, filerev) pairs - which means that the same file 
entity can appear in multiple releases.

This is intuitively correct - if you have a CVS repository with files 
evolving through revisions as they are modified, each release will tag a 
certain set of those files.  However, two releases will have the same 
file entity in them if the file has not been modified between releases. 
 Because LXR deals with file entities, if the file appears in several 
releases at the same revision number, it will be indexed only once.

Plain.pm then emulates this by using timestamps as the revision history 
of the file.  Therefore if you have two directories v1 and v2, and they 
both contain a file X with the same timestamp, LXR will treat them as 
the same file.  The easiest way to see this is to symlink between the 
two version directories - indexing will occur only once.  This mechanism 
only has one way of breaking - two files with the same name and 
timestamp that are actually not identical.  However, it would be a 
strange revision control system that would give you such files - most 
systems either give you the time of checkin (in which case the 
timestamps would not be identical) or the time of checkout (similarly, 
the timestamps should not be equal since the two files probably weren't 
written at the same time).  

The result of this is that with Plain.pm LXR will sometimes index the 
file more often that it needs to, but it should not decide not to index 
it when it does need to.   Using the Plain.pm backend essentially trades 
off diskspace for ease of use.

>I have changed filerev() for this purpose as
>
>sub filerev {
> my ($self, $filename, $release) = @_;
>
> # TODO: length of filename+revision
> #       might turn out to be > 255 chars 
> #       [length used in the db]
>        return join("-", $filename,
>                                $release);
>}
>
>With this modification filerev() will return the same value 
>for (file,revision) pair everytime - thus solving the 
>problem.
>
Nope, this is wrong for the reasons outlined above.  This returns the 
same value for filename, release every time - which is not the same as 
filename, revision.

>I have a patch ready for this.
>
When you have patches, please send them to the list.  It's usually 
easier to discuss code changes with the actual diff than to talk about 
the description - since the description and the diff may in fact differ, 
and either could be the Right Thing.

Cheers,

Malcolm

>

Re: [Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: Arne G. G. <ar...@li...> - 2002-05-01 11:31:22

* Malcolm Box
> Again, moving discussion of this to the list.

Forgive me for butting in this late, I'm just trying to catch up.
I'll just add a few comments to this issue.  Just whack me if I'm
stating stuff you've already covered.

> Plain.pm then emulates this by using timestamps as the revision
> history of the file.  Therefore if you have two directories v1 and v2,
> and they both contain a file X with the same timestamp, LXR will treat
> them as the same file.  The easiest way to see this is to symlink
> between the two version directories - indexing will occur only once.
> This mechanism only has one way of breaking - two files with the same
> name and timestamp that are actually not identical.  However, it would
> be a strange revision control system that would give you such files -
> most systems either give you the time of checkin (in which case the
> timestamps would not be identical) or the time of checkout (similarly,
> the timestamps should not be equal since the two files probably
> weren't written at the same time).  The result of this is that with
> Plain.pm LXR will sometimes index the file more often that it needs
> to, but it should not decide not to index it when it does need to.
> Using the Plain.pm backend essentially trades off diskspace for ease
> of use.

This is a bit non-obvious, and I'm happy to see that you're able to
summarise the issue so clearly when someone suggests to mangle stuff
we've actually struggled a fair bit with in the past. :)  As a note,
Plain.pm includes the file size in the revision string, which means
that files would have to have the same timestamp and size as well as
different contents for LXR to fail to index changed files.

As far as solutions to this problem go; even with Plain.pm we have
some notion of the set of files belonging to a particular release.
Thus, when indexing a release and encountering a (filename,
revision)-tuple belonging to it, we could invalidate all non-matching
(filename, *)-tuples marked as belonging to the same release (and no
other releases).  In doing this, we would also need to invalidate the
reference-information for this release.  As long as we do that we'd be
home free as far as database integrity is concerned, as far as I can
see.

(A possible shortcut would be to index (filename, rev2) before
(possibly) invalidating the information for (filename, rev1) and only
invalidate the reference-information if we find that the two define
non-matching sets of symbols.)


							Arne.

Re: [Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: Malcolm B. <ma...@br...> - 2002-05-01 12:01:02

Arne Georg Gleditsch wrote:

>This is a bit non-obvious, and I'm happy to see that you're able to
>summarise the issue so clearly when someone suggests to mangle stuff
>we've actually struggled a fair bit with in the past. :)  As a note,
>Plain.pm includes the file size in the revision string, which means
>that files would have to have the same timestamp and size as well as
>different contents for LXR to fail to index changed files.
>
I hadn't realised that size was included - that makes it even more 
robust.  Certainly the terminology of releases, revisions etc is not 
that clear - it took me a while to get my head round it.  Moving to the 
idea of being able to index a "HEAD" revision (ie one that is evolving) 
clearly challenges some of the assumptions in the code, though not the 
overall semantic model.

>As far as solutions to this problem go; even with Plain.pm we have
>some notion of the set of files belonging to a particular release.
>Thus, when indexing a release and encountering a (filename,
>revision)-tuple belonging to it, we could invalidate all non-matching
>(filename, *)-tuples marked as belonging to the same release (and no
>other releases).  In doing this, we would also need to invalidate the
>reference-information for this release.  As long as we do that we'd be
>home free as far as database integrity is concerned, as far as I can
>see.
>
Indeed, this works very well.  I have got it going for the Postgres 
backend, since the nice referential integrity triggers make this kind of 
cascading delete very easy.  Unfortunately I haven't completed the port 
to the MySQL backend, since that takes much more manual grovelling to 
clean up.  This also won't be hitting the CVS repository for a while 
since the code is on a laptop which is being shipped from Japan to the 
UK and so is now bobbing around on the Pacific ocean at a guess :-)  Of 
course, I might get frustrated enough with the bug to just re-code the 
fix, but the "drop  and rebuild the db every now and then" fix is 
working for me at the moment.

>(A possible shortcut would be to index (filename, rev2) before
>(possibly) invalidating the information for (filename, rev1) and only
>invalidate the reference-information if we find that the two define
>non-matching sets of symbols.)
>
It's probably more effort to track the new stuff and compare with the 
old than simply to delete and re-add.  The big problem (that's just 
occurred to me) is with the useage table.  If the new revision of the 
file defines new symbols, then for total accuracy all existing files 
need to be re-referenced to see if they use that symbol.  Luckily it's 
extremely unlikely that someone would add a new symbol in a file that 
retrospectively re-defines symbols in other files, but I guess it is a 
theoretical possibility.

This is also the reason why a "index file, reference file" loop doesn't 
work, rather than the "index all files", "reference all files"  approach 
taken at the moment.

Cheers,

Malcolm

P.S.  Good to see you back on the list again :-)

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: <no...@so...> - 2002-06-07 13:10:35

Bugs item #518365, was opened at 2002-02-16 10:04
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree "test" having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

& run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index->toindex($fileid)) {
                $index->empty_cache();
                print(STDERR "--- $pathname $fileid\n");

                my $path = $files->tmpfile($pathname, $release);

                $lang->indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR "$pathname was already indexed\n");
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 13:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-19 07:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field "timestamp" to the "status" table. And remove 
the "status" field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the "definitions" are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is "indexed" and "referenced". 
An easy way out of this seems to be to "index" a file and 
immediately "reference" it.

Related to this there is a problem in "Plain.pm" - the 
current  "filerev" function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by "filerev" even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be > 255 chars 
 #       [length used in the db]
        return join("-", $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 14:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 13:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of "Files.pm" ]. You run genxref, then change a file & genxref again

2. Files are in CVS, and you want to index the "head" tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 12:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: SourceForge.net <no...@so...> - 2003-04-08 00:22:45

Bugs item #518365, was opened at 2002-02-16 01:04
Message generated for change (Comment added) made by kisley
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree "test" having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

& run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index->toindex($fileid)) {
                $index->empty_cache();
                print(STDERR "--- $pathname $fileid\n");

                my $path = $files->tmpfile($pathname, $release);

                $lang->indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR "$pathname was already indexed\n");
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

Comment By: Richard Kisley (kisley)
Date: 2003-04-07 16:39

Message:
Logged In: YES 
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths?  Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree.  So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed.  Now the original links in tree
one are all dead links, with live dupes.  No files changed. 

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax?  That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected. 

----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 05:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-18 22:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field "timestamp" to the "status" table. And remove 
the "status" field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the "definitions" are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is "indexed" and "referenced". 
An easy way out of this seems to be to "index" a file and 
immediately "reference" it.

Related to this there is a problem in "Plain.pm" - the 
current  "filerev" function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by "filerev" even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be > 255 chars 
 #       [length used in the db]
        return join("-", $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 05:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 04:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of "Files.pm" ]. You run genxref, then change a file & genxref again

2. Files are in CVS, and you want to index the "head" tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 03:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: SourceForge.net <no...@so...> - 2004-03-12 18:07:45

Bugs item #518365, was opened at 2002-02-16 02:04
Message generated for change (Comment added) made by hjtoi
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree "test" having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref & when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

& run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index->toindex($fileid)) {
                $index->empty_cache();
                print(STDERR "--- $pathname $fileid\n");

                my $path = $files->tmpfile($pathname, $release);

                $lang->indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR "$pathname was already indexed\n");
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

Comment By: Heikki Toivonen (hjtoi)
Date: 2004-03-12 09:48

Message:
Logged In: YES 
user_id=972898

Anyone have a patch for this?

----------------------------------------------------------------------

Comment By: Richard Kisley (kisley)
Date: 2003-04-07 17:39

Message:
Logged In: YES 
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths?  Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree.  So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed.  Now the original links in tree
one are all dead links, with live dupes.  No files changed. 

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax?  That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected. 

----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 06:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-18 23:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field "timestamp" to the "status" table. And remove 
the "status" field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the "definitions" are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is "indexed" and "referenced". 
An easy way out of this seems to be to "index" a file and 
immediately "reference" it.

Related to this there is a problem in "Plain.pm" - the 
current  "filerev" function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by "filerev" even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be > 255 chars 
 #       [length used in the db]
        return join("-", $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 06:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 05:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of "Files.pm" ]. You run genxref, then change a file & genxref again

2. Files are in CVS, and you want to index the "head" tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 04:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: SourceForge.net <no...@so...> - 2004-07-20 11:31:32

Bugs item #518365, was opened at 2002-02-16 02:04
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree &quot;test&quot; having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref &amp; when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

&amp; run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index-&gt;toindex($fileid)) {
                $index-&gt;empty_cache();
                print(STDERR &quot;--- $pathname $fileid\n&quot;);

                my $path = $files-&gt;tmpfile($pathname, $release);

                $lang-&gt;indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR &quot;$pathname was already indexed\n&quot;);
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 04:31

Message:
Logged In: NO 

Still no fix for this?

----------------------------------------------------------------------

Comment By: Heikki Toivonen (hjtoi)
Date: 2004-03-12 09:48

Message:
Logged In: YES 
user_id=972898

Anyone have a patch for this?

----------------------------------------------------------------------

Comment By: Richard Kisley (kisley)
Date: 2003-04-07 17:39

Message:
Logged In: YES 
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths?  Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree.  So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed.  Now the original links in tree
one are all dead links, with live dupes.  No files changed. 

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax?  That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected. 

----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 06:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-18 23:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field &quot;timestamp&quot; to the &quot;status&quot; table. And remove 
the &quot;status&quot; field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the &quot;definitions&quot; are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is &quot;indexed&quot; and &quot;referenced&quot;. 
An easy way out of this seems to be to &quot;index&quot; a file and 
immediately &quot;reference&quot; it.

Related to this there is a problem in &quot;Plain.pm&quot; - the 
current  &quot;filerev&quot; function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by &quot;filerev&quot; even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be &gt; 255 chars 
 #       [length used in the db]
        return join(&quot;-&quot;, $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 06:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 05:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of &quot;Files.pm&quot; ]. You run genxref, then change a file &amp; genxref again

2. Files are in CVS, and you want to index the &quot;head&quot; tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 04:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: SourceForge.net <no...@so...> - 2004-07-20 11:36:04

Bugs item #518365, was opened at 2002-02-16 02:04
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree &quot;test&quot; having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref &amp; when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

&amp; run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index-&gt;toindex($fileid)) {
                $index-&gt;empty_cache();
                print(STDERR &quot;--- $pathname $fileid\n&quot;);

                my $path = $files-&gt;tmpfile($pathname, $release);

                $lang-&gt;indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR &quot;$pathname was already indexed\n&quot;);
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 04:36

Message:
Logged In: NO 

Shree, you've said you had a patch, could you attach to the 
Tracker?
Thanks,
Dennis

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 04:31

Message:
Logged In: NO 

Still no fix for this?

----------------------------------------------------------------------

Comment By: Heikki Toivonen (hjtoi)
Date: 2004-03-12 09:48

Message:
Logged In: YES 
user_id=972898

Anyone have a patch for this?

----------------------------------------------------------------------

Comment By: Richard Kisley (kisley)
Date: 2003-04-07 17:39

Message:
Logged In: YES 
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths?  Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree.  So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed.  Now the original links in tree
one are all dead links, with live dupes.  No files changed. 

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax?  That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected. 

----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 06:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-18 23:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field &quot;timestamp&quot; to the &quot;status&quot; table. And remove 
the &quot;status&quot; field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the &quot;definitions&quot; are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is &quot;indexed&quot; and &quot;referenced&quot;. 
An easy way out of this seems to be to &quot;index&quot; a file and 
immediately &quot;reference&quot; it.

Related to this there is a problem in &quot;Plain.pm&quot; - the 
current  &quot;filerev&quot; function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by &quot;filerev&quot; even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be &gt; 255 chars 
 #       [length used in the db]
        return join(&quot;-&quot;, $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 06:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 05:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of &quot;Files.pm&quot; ]. You run genxref, then change a file &amp; genxref again

2. Files are in CVS, and you want to index the &quot;head&quot; tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 04:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: SourceForge.net <no...@so...> - 2004-07-20 12:21:27

Bugs item #518365, was opened at 2002-02-16 05:04
Message generated for change (Comment added) made by brondsem
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree &quot;test&quot; having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref &amp; when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

&amp; run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index-&gt;toindex($fileid)) {
                $index-&gt;empty_cache();
                print(STDERR &quot;--- $pathname $fileid\n&quot;);

                my $path = $files-&gt;tmpfile($pathname, $release);

                $lang-&gt;indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR &quot;$pathname was already indexed\n&quot;);
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

>Comment By: Dave Brondsema (brondsem)
Date: 2004-07-20 08:21

Message:
Logged In: YES 
user_id=341298

I have recently added the --reindexall option to genxref (in
CVS).  Please try and see if this works.  Perhaps it should
be default and not an option if it does.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 07:36

Message:
Logged In: NO 

Shree, you've said you had a patch, could you attach to the 
Tracker?
Thanks,
Dennis

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 07:31

Message:
Logged In: NO 

Still no fix for this?

----------------------------------------------------------------------

Comment By: Heikki Toivonen (hjtoi)
Date: 2004-03-12 12:48

Message:
Logged In: YES 
user_id=972898

Anyone have a patch for this?

----------------------------------------------------------------------

Comment By: Richard Kisley (kisley)
Date: 2003-04-07 20:39

Message:
Logged In: YES 
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths?  Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree.  So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed.  Now the original links in tree
one are all dead links, with live dupes.  No files changed. 

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax?  That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected. 

----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 09:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-19 02:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field &quot;timestamp&quot; to the &quot;status&quot; table. And remove 
the &quot;status&quot; field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the &quot;definitions&quot; are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is &quot;indexed&quot; and &quot;referenced&quot;. 
An easy way out of this seems to be to &quot;index&quot; a file and 
immediately &quot;reference&quot; it.

Related to this there is a problem in &quot;Plain.pm&quot; - the 
current  &quot;filerev&quot; function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by &quot;filerev&quot; even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be &gt; 255 chars 
 #       [length used in the db]
        return join(&quot;-&quot;, $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 09:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 08:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of &quot;Files.pm&quot; ]. You run genxref, then change a file &amp; genxref again

2. Files are in CVS, and you want to index the &quot;head&quot; tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 07:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: SourceForge.net <no...@so...> - 2004-07-20 14:42:25

Bugs item #518365, was opened at 2002-02-16 02:04
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree &quot;test&quot; having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref &amp; when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

&amp; run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index-&gt;toindex($fileid)) {
                $index-&gt;empty_cache();
                print(STDERR &quot;--- $pathname $fileid\n&quot;);

                my $path = $files-&gt;tmpfile($pathname, $release);

                $lang-&gt;indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR &quot;$pathname was already indexed\n&quot;);
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 07:42

Message:
Logged In: NO 

Sorry, I can't access CVS (firewall).
Does your switch do "intelligent" job like described in message 
dated "2002-02-18 23:21" below? 

----------------------------------------------------------------------

Comment By: Dave Brondsema (brondsem)
Date: 2004-07-20 05:21

Message:
Logged In: YES 
user_id=341298

I have recently added the --reindexall option to genxref (in
CVS).  Please try and see if this works.  Perhaps it should
be default and not an option if it does.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 04:36

Message:
Logged In: NO 

Shree, you've said you had a patch, could you attach to the 
Tracker?
Thanks,
Dennis

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 04:31

Message:
Logged In: NO 

Still no fix for this?

----------------------------------------------------------------------

Comment By: Heikki Toivonen (hjtoi)
Date: 2004-03-12 09:48

Message:
Logged In: YES 
user_id=972898

Anyone have a patch for this?

----------------------------------------------------------------------

Comment By: Richard Kisley (kisley)
Date: 2003-04-07 17:39

Message:
Logged In: YES 
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths?  Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree.  So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed.  Now the original links in tree
one are all dead links, with live dupes.  No files changed. 

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax?  That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected. 

----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 06:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-18 23:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field &quot;timestamp&quot; to the &quot;status&quot; table. And remove 
the &quot;status&quot; field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the &quot;definitions&quot; are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is &quot;indexed&quot; and &quot;referenced&quot;. 
An easy way out of this seems to be to &quot;index&quot; a file and 
immediately &quot;reference&quot; it.

Related to this there is a problem in &quot;Plain.pm&quot; - the 
current  &quot;filerev&quot; function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by &quot;filerev&quot; even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be &gt; 255 chars 
 #       [length used in the db]
        return join(&quot;-&quot;, $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 06:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 05:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of &quot;Files.pm&quot; ]. You run genxref, then change a file &amp; genxref again

2. Files are in CVS, and you want to index the &quot;head&quot; tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 04:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: SourceForge.net <no...@so...> - 2004-07-20 15:25:19

Bugs item #518365, was opened at 2002-02-16 05:04
Message generated for change (Comment added) made by brondsem
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree &quot;test&quot; having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref &amp; when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

&amp; run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index-&gt;toindex($fileid)) {
                $index-&gt;empty_cache();
                print(STDERR &quot;--- $pathname $fileid\n&quot;);

                my $path = $files-&gt;tmpfile($pathname, $release);

                $lang-&gt;indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR &quot;$pathname was already indexed\n&quot;);
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

>Comment By: Dave Brondsema (brondsem)
Date: 2004-07-20 11:25

Message:
Logged In: YES 
user_id=341298

Yes.  For the version being indexed, it deletes all data
directly related to that version.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 10:42

Message:
Logged In: NO 

Sorry, I can't access CVS (firewall).
Does your switch do "intelligent" job like described in message 
dated "2002-02-18 23:21" below? 

----------------------------------------------------------------------

Comment By: Dave Brondsema (brondsem)
Date: 2004-07-20 08:21

Message:
Logged In: YES 
user_id=341298

I have recently added the --reindexall option to genxref (in
CVS).  Please try and see if this works.  Perhaps it should
be default and not an option if it does.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 07:36

Message:
Logged In: NO 

Shree, you've said you had a patch, could you attach to the 
Tracker?
Thanks,
Dennis

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 07:31

Message:
Logged In: NO 

Still no fix for this?

----------------------------------------------------------------------

Comment By: Heikki Toivonen (hjtoi)
Date: 2004-03-12 12:48

Message:
Logged In: YES 
user_id=972898

Anyone have a patch for this?

----------------------------------------------------------------------

Comment By: Richard Kisley (kisley)
Date: 2003-04-07 20:39

Message:
Logged In: YES 
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths?  Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree.  So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed.  Now the original links in tree
one are all dead links, with live dupes.  No files changed. 

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax?  That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected. 

----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 09:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-19 02:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field &quot;timestamp&quot; to the &quot;status&quot; table. And remove 
the &quot;status&quot; field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the &quot;definitions&quot; are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is &quot;indexed&quot; and &quot;referenced&quot;. 
An easy way out of this seems to be to &quot;index&quot; a file and 
immediately &quot;reference&quot; it.

Related to this there is a problem in &quot;Plain.pm&quot; - the 
current  &quot;filerev&quot; function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by &quot;filerev&quot; even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be &gt; 255 chars 
 #       [length used in the db]
        return join(&quot;-&quot;, $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 09:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 08:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of &quot;Files.pm&quot; ]. You run genxref, then change a file &amp; genxref again

2. Files are in CVS, and you want to index the &quot;head&quot; tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 07:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

[Lxr-dev] [ lxr-Bugs-518365 ] Indexing of files once indexed is buggy!

From: SourceForge.net <no...@so...> - 2004-10-21 19:01:02

Bugs item #518365, was opened at 2002-02-16 04:04
Message generated for change (Comment added) made by bjjohnson
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350

Category: genxref
Group: current cvs
Status: Open
Resolution: None
Priority: 7
Submitted By: Shree Kumar (shreekumar)
Assigned to: Malcolm Box (mbox)
Summary: Indexing of files once indexed is buggy!

Initial Comment:
I am using LXR-0.9.1

Consider this scenario :

There is a source tree &quot;test&quot; having only one file - test.c

test.c
-------
#define TEST 100

now, I run genxref &amp; when I search for TEST in identifiers, I get that it is a macro defined in  
test.c at line 1

now I change test.c  to
-------
#define T 1
#define TEST 100

&amp; run genxref

Now what I get is - TEST is defined as a macro in test.c in line 1 and line 2 !

The culprit is this piece of code in function processfile() [ Tagger.pm ]
------
          if ($index-&gt;toindex($fileid)) {
                $index-&gt;empty_cache();
                print(STDERR &quot;--- $pathname $fileid\n&quot;);

                my $path = $files-&gt;tmpfile($pathname, $release);

                $lang-&gt;indexfile($pathname, $path, $fileid, $index, $config);
                unlink($path);
          } else {
                print(STDERR &quot;$pathname was already indexed\n&quot;);
          }
------
The problem is that if the file already existed and has changed since then [based on the 
timestamp], the identifiers added to the database due to this file in the previous run of genxref are 
not removed from the database, hence the number of definitions will keep on growing...

The same problem is also present in processrefs().


----------------------------------------------------------------------

Comment By: Brian J. Johnson (bjjohnson)
Date: 2004-10-21 14:01

Message:
Logged In: YES 
user_id=85501

Here's a patch to provide incremental reindexing.  The
--reindexall
option was useful, but not really what's required (at least,
not what
I really wanted):  I want to avoid reindexing files which
haven't
changed, remove all index info. for files which have been
deleted, and
correctly reindex (i.e. without duplication) files which
which have
changed.  This patch does so.  ItWorksForMe, please let me
know if it
breaks for you.  (And if someone wants to add support for
databases
besides mysql, please do!)

I _think_ I got the database manipulation right; please let
me know if
I've made any bad assumptions about how the tables relate,
or about
the semantics of the various fields.  I've tested this with
Files::Plain and Index::Mysql.

The patch is against lxr 0.9.3.

Thanks,
Brian

----------------------------------------------------------------------
Patch to add incremental indexing to lxr-0.9.3
By Brian J. Johnson 10/21/04

This patch adds incremental indexing to lxr-0.9.3.  That is,
files
which have not changed are not reindexed, and files which
have changed
or been removed have their old information erased from the
database
before they are reindexed.  On my machine, this saves _hours_ of
reindexing time on trees which don't change much from day to
day.

The patch modifies genxref to add an extra pass before the
existing
gensearch, genindex, and genrefs passes.  For each file in
the current
release, it retrieves the file's revision from the database
and checks
if the file has changed in the file store:  i.e.
$files->filerev() is
different from the revision in the database.  If the file
has changed,
it calls $index->purgefile() to remove that fileid (i.e.
[filename,
revision] pair) from the release.  If that fileid is no
longer active
in any release, $index->purgefile() removes it from the
other tables
as well.  Then the genindex and genrefs passes add the new
revision of
the file (with a new fileid) to the database.

Symbols can still be left around (as with the "reindexall"
option), so
the administrator should drop and regenerate the database
completely
from scratch every so often.

I haven't tried to add support for any database besides
mysql, as I
have no way to test the changes.  It should be pretty easy
for others
to add it, though.  (I don't really even know SQL, and I
could do the
mysql port in a few hours, working solely from the
documentation at
mysql.com.)

Questions and comments:
- $self->{releases_select} was not being undef-ed in the DESTROY
  routine.  Added that.
- When is it necessary to call $self->{xxx}->finish()?  Some dbi
  queries seem to do so, and others don't.



Index: lxr-0.9.3/genxref
===================================================================
--- lxr-0.9.3.orig/genxref	2004-10-21 13:19:44.000000000 -0500
+++ lxr-0.9.3/genxref	2004-10-21 13:38:51.000000000 -0500
@@ -89,11 +89,29 @@
 
 foreach my $version (@versions) {
 	$index->purge($version) if $option{'reindexall'};
+	cleanindex($version);
 	gensearch($version);
 	genindex('/', $version);
 	genrefs('/', $version);
 }
 
+sub cleanindex {
+	my ($release) = @_;
+	my ($f, @files);
+
+	@files = $index->getfiles($release);
+	foreach $f (@files) {
+		# $f == [filename, fileid, revision]
+		# Skip this file if it is still at the same revision
+		next if $files->filerev($f->[0], $release) == $f->[2];
+
+		print(STDERR "%%% DELETED/MODIFIED: ", join(" ", @$f), "\n");
+
+		# Remove old revision from this release.
+		$index->purgefile($f->[1], $release);
+	}
+}
+
 sub genindex {
 	my ($pathname, $release) = @_;
 
Index: lxr-0.9.3/lib/LXR/Index/Mysql.pm
===================================================================
--- lxr-0.9.3.orig/lib/LXR/Index/Mysql.pm	2004-10-21
13:19:44.000000000 -0500
+++ lxr-0.9.3/lib/LXR/Index/Mysql.pm	2004-10-21
13:39:07.000000000 -0500
@@ -56,6 +56,11 @@
 	  $self->{dbh}
 	  ->prepare("insert into ${prefix}files (filename,
revision, fileid) values (?, ?, NULL)");
 
+	$self->{allfiles_select} =
+		$self->{dbh}->prepare("select f.filename, f.fileid,
f.revision "
+		  . "from ${prefix}files f, ${prefix}releases r "
+		  . "where  f.fileid = r.fileid and r.release = ?");
+
 	$self->{symbols_byname} =
 	  $self->{dbh}->prepare("select symid from
${prefix}symbols where  symname = ?");
 	$self->{symbols_byid} =
@@ -81,6 +86,10 @@
 	  $self->{dbh}->prepare("select * from ${prefix}releases
where fileid = ? and  release = ?");
 	$self->{releases_insert} =
 	  $self->{dbh}->prepare("insert into ${prefix}releases
(fileid, release) values (?, ?)");
+	$self->{releases_delete} =
+	  $self->{dbh}->prepare("delete from ${prefix}releases
where fileid = ? and  release = ?");
+	$self->{releases_select_file} =
+	  $self->{dbh}->prepare("select * from ${prefix}releases
where fileid = ?");
 
 	$self->{status_get} =
 	  $self->{dbh}->prepare("select status from
${prefix}status where fileid = ?");
@@ -134,6 +143,19 @@
 		  . "where f.fileid = r.fileid "
 		  . "and r.release = ?");
 
+	$self->{indexes_del_fileid} =
+	  $self->{dbh}->prepare("delete from ${prefix}indexes "
+		  . "where ${prefix}indexes.fileid = ?");
+	$self->{useage_del_fileid} =
+	  $self->{dbh}->prepare("delete from ${prefix}useage "
+		  . "where ${prefix}useage.fileid = ?");
+	$self->{status_del_fileid} =
+	  $self->{dbh}->prepare("delete from ${prefix}status "
+		  . "where ${prefix}status.fileid = ?");
+	$self->{files_del_fileid} =
+	  $self->{dbh}->prepare("delete from ${prefix}files "
+		  . "where ${prefix}files.fileid = ?");
+
 	return $self;
 }
 
@@ -326,6 +348,47 @@
 	return $id;
 }
 
+# List all files in this release
+sub getfiles {
+	my ($self, $release) = @_;
+	my ($rows, @ret);
+
+	$rows = $self->{allfiles_select}->execute("$release");
+
+	while ($rows-- > 0) {
+		push(@ret, [ $self->{allfiles_select}->fetchrow_array ] );
+	}
+
+	$self->{allfiles_select}->finish();
+
+	return @ret;
+}
+
+# Remove all references to $fileid from $release
+sub purgefile {
+	my ($self, $fileid, $release) = @_;
+
+    # Remove $fileid from $release
+	$self->{releases_delete}->execute($fileid, $release);
+	$self->{releases_delete}->finish();
+
+	# Find how many releases still reference $fileid
+	my $rows = $self->{releases_select_file}->execute($fileid
+ 0);
+	$self->{releases_select_file}->finish();
+
+	# If none, remove fileid from all other tables
+	unless ($rows > 0) {
+		# we don't delete symbols, because they might be used by
other
+		# versions so we can end up with unused symbols, but that
+		# doesn't cause any problems.  Drop and rebuild the database
+		# from time to time if it bothers you.
+		$self->{indexes_del_fileid}->execute($fileid);
+		$self->{useage_del_fileid}->execute($fileid);
+		$self->{status_del_fileid}->execute($fileid);
+		$self->{files_del_fileid}->execute($fileid);
+	}
+}
+	
 sub purge {
 	my ($self, $version) = @_;
 
@@ -342,11 +405,15 @@
 	my ($self) = @_;
 	$self->{files_select}    = undef;
 	$self->{files_insert}    = undef;
+	$self->{allfiles_select} = undef;
 	$self->{symbols_byname}  = undef;
 	$self->{symbols_byid}    = undef;
 	$self->{symbols_insert}  = undef;
 	$self->{indexes_insert}  = undef;
+	$self->{releases_select} = undef;
 	$self->{releases_insert} = undef;
+	$self->{releases_delete} = undef;
+	$self->{releases_select_file} = undef;
 	$self->{status_insert}   = undef;
 	$self->{status_update}   = undef;
 	$self->{usage_insert}    = undef;
@@ -358,6 +425,10 @@
 	$self->{delete_status}   = undef;
 	$self->{delete_releases} = undef;
 	$self->{delete_files}    = undef;
+	$self->{indexes_del_fileid} = undef;
+	$self->{useage_del_fileid}  = undef;
+	$self->{status_del_fileid}  = undef;
+	$self->{files_del_fileid}   = undef;
 
 	if ($self->{dbh}) {
 		$self->{dbh}->disconnect();



----------------------------------------------------------------------

Comment By: Dave Brondsema (brondsem)
Date: 2004-07-20 10:25

Message:
Logged In: YES 
user_id=341298

Yes.  For the version being indexed, it deletes all data
directly related to that version.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 09:42

Message:
Logged In: NO 

Sorry, I can't access CVS (firewall).
Does your switch do "intelligent" job like described in message 
dated "2002-02-18 23:21" below? 

----------------------------------------------------------------------

Comment By: Dave Brondsema (brondsem)
Date: 2004-07-20 07:21

Message:
Logged In: YES 
user_id=341298

I have recently added the --reindexall option to genxref (in
CVS).  Please try and see if this works.  Perhaps it should
be default and not an option if it does.

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 06:36

Message:
Logged In: NO 

Shree, you've said you had a patch, could you attach to the 
Tracker?
Thanks,
Dennis

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-07-20 06:31

Message:
Logged In: NO 

Still no fix for this?

----------------------------------------------------------------------

Comment By: Heikki Toivonen (hjtoi)
Date: 2004-03-12 11:48

Message:
Logged In: YES 
user_id=972898

Anyone have a patch for this?

----------------------------------------------------------------------

Comment By: Richard Kisley (kisley)
Date: 2003-04-07 19:39

Message:
Logged In: YES 
user_id=102080

(1) How about an intermediate solution, where someone writes
a VERIFY script which compares the paths in the database
with the version they refer to and deletes entries for
invalid paths?  Same options as genxref?

I indexed a subtree of a sourcetree, then realized I needed
to index the whole source tree.  So I moved the revision
main dir (since it was really a subdir) up a level and added
the other directories at their proper top level as other
subdirs, then re-indexed.  Now the original links in tree
one are all dead links, with live dupes.  No files changed. 

(2) So we all don't have to scurry for our SQL books, buried
in a box in the back of a closet at home (not work) how
about posting the exact drop syntax?  That might also be a
good thing to add to doc short-term, since genxref doesn't
work (prune) as expected. 

----------------------------------------------------------------------

Comment By: Gregor Hartmann (grex)
Date: 2002-06-07 08:10

Message:
Logged In: YES 
user_id=559509

Another similar problem would be files ore whole directories that are deleted from the source tree. They would stay 
in the database forever as well. 

Maybe it could be fixed by iterating through all files in the database and removing those (from the database) which 
have changed or were removed in the source tree.
then proceed indexing as before.


----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-19 01:21

Message:
Logged In: YES 
user_id=142912

Here's my fix for this bug:

Add a field &quot;timestamp&quot; to the &quot;status&quot; table. And remove 
the &quot;status&quot; field.

Before finding identifiers in a file, check whether it's 
modification time is greater that it was previously. If 
yes, then remove all the identifier definitions due to this 
file [and release] from the database. Store the new 
timestamp in the database.

Before finding references in a file, remove all identifier 
references due to this file [and release] from the database.
[ No need to check the timestamp in this case since 
the &quot;definitions&quot; are always found before the references]. 

In a large CVS tree, it is quite possible that a file may 
change between the time it is &quot;indexed&quot; and &quot;referenced&quot;. 
An easy way out of this seems to be to &quot;index&quot; a file and 
immediately &quot;reference&quot; it.

Related to this there is a problem in &quot;Plain.pm&quot; - the 
current  &quot;filerev&quot; function returns a value based on the 
timestamp.  Problem arises if a file changes between runs 
of genxref. What happens is that different values are 
returned by &quot;filerev&quot; even though it is the same 
(file,revision) pair is being indexed [or referenced].

I have changed filerev() for this purpose as

sub filerev {
 my ($self, $filename, $release) = @_;

 # TODO: length of filename+revision
 #       might turn out to be &gt; 255 chars 
 #       [length used in the db]
        return join(&quot;-&quot;, $filename,
                                $release);
}

With this modification filerev() will return the same value 
for (file,revision) pair everytime - thus solving the 
problem.

I have a patch ready for this.


----------------------------------------------------------------------

Comment By: Malcolm Box (mbox)
Date: 2002-02-18 08:20

Message:
Logged In: YES 
user_id=215386

Yes, you're right, this is a bug.  The underlying assumption
that is being broken is that the files in a version are
static - which is true if one is indexing released software,
but not if it is a development tree.

The simplest work-around is to drop and recreate the
database each time, thus avoiding the problem.  For small to
medium repositories with the index updated nightly this
should work fine, but it doesn't work for large repositories.

The full solution would appear to be to check for an
existing entry for the (filename, release) pair and if it is
found delete it and all associated information.

----------------------------------------------------------------------

Comment By: Shree Kumar (shreekumar)
Date: 2002-02-16 07:32

Message:
Logged In: YES 
user_id=142912

There are two cases where the scenario that I've referred to applies:

1. Files are not in CVS [ ie usage of &quot;Files.pm&quot; ]. You run genxref, then change a file &amp; genxref again

2. Files are in CVS, and you want to index the &quot;head&quot; tag. Files change regularly, and you want to keep the 
 
cross reference in sync - probably by running genxref once an hour or so [as a cron job].

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2002-02-16 06:47

Message:
Logged In: NO 

I was in the impression that a file may never ever change again, except if (and only if) the file was changed and has either got a new CVS revision (or tag) or if there is a new directory for a new version of the whole project (if it is not managed by CVS).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=390117&aid=518365&group_id=27350