I have multiple tree, multiple branch setup of lxr. For one of the trees I have 4 versions which I reindex regularly with -reindex all option. It used to work in 0.11 version, but now each indexation creates duplicates entries in lxr_usages, so that identifier search returns as many duplicates as many times i ran lxr.
Check you typed -reindexall as a single word. Perl doc says option can be abbreviated to their unique prefix (-reindex is OK) but I haven't tested what happens with stray argument 'all' (not an option, it should be ignored, but who knows with computers?)
-reindexall uses brute force: it deletes the DB before indexing. Consequently you should not get duplicates.
Incremental indexing has been fully rewritten in v1.0 to prune the DB. My tests were correct but they only targeted small trees; I was not patient enough to try with a kernel tree.
1- There are differences between the DB engines. Which is yours? MySQL, PostgreSQL or SQLite? I made no tests on Oracle.
2- When you run genxref, are the correct files purged? You can tell that from what genxref prints immediately after intial checks ("Selective database purge" section):
-- release_code filename version: result
Result may be be "purged" when all symbols for this version have been purged from the DB or "not purgeable yet" when genxref has detected that this file version is shared between different branches of the tree (with the other branches still alive). But even in the latter case, references for the reindexed version should be removed from DB.
Waiting for your information, best regards
For sure I had -reindexall typed as one word (I switched it off). I would like to mention that every day I parse different version (I have 4 versions), so I also use -version parameter. I am running on postgresql. I also found that lxr_definitions had duplicates. Why there is no primary key on that tables? It would prevent from such situations … I think (or maybe I don't know exactly db structure).
As I said I switched off -reindexall and now also have duplicates.
As to "Selective database purge" logs there is barely nothing between this log message and "Indexing Data Source: "External-Program". For my different version most of files are the same - just code inside changes.
I can't reproduce the bug on my LXR tree under Postgres. In my "selective database purge" section I have both "purged" and "not purgeable yet" files. (The latter means some other version is still using the base file.)
The "purged" files are correctly reindexed, no duplicates.
The "not purgeable yet" files are also OK.
From this test, I'm rather confident in DB queries implementation. Maybe problem lies in one of the parsers. Which is the language leading to duplicates? Does duplication happen on declaration, usage or both?
Can you make a simple test case? a tree with only one of your problematic files, the shorter the better. Index. Modify file. Incremental reindex. If it still duplicates, I'll give you my private mail @dress to forward the offending file.
Why no primary key?
A primary key works on a column. With modern languages, you can have several variables with the same name in different contexts. You can then have (apparent) duplication and cannot constraint the table with a primary key.
What is unique is the combination of name+context, i.e. columns symid, fileid and lineid. Unless duplication happens on the same line, constraint "unique key" is meaningless.
Second thought on your problem:
From rereading carefully genxref, you get "Selected database purge" only with -reindexall and nothing else is printed. When -reindexall is NOT specified and -version=… is simultaneously given (=incremental reindexing), a recursive tree traversal is done and changed files are logged.
If you do not see this tree traversal, something is wrong with genxref.
I index C++ code and duplication happens on both. I try now to create new lxr installation on ubuntu (on local computer) with apache2 and found another probable bug on .htaccess file:
Shouldnt Apache be replaced with ModPerl? But that just off topic I continue with installation. :)
C/C++ is the most tested parser, so it should not be the cause of the problem.
"duplication happens on both": you mean with -reindexall and without? If you have duplications with -reindexall, something is really wrong with the DB because -reindexall removes all content in the tables. Duplicates in this context means stale data remained in the tables. Tell me how long it takes to index your trees (plural = all versions in all trees). If it is not too long, I'll suggest some radical and brute method.
Apache::Registry: this is bug #3583172. Change Apache to ModPerl. Note there is no impact if your Apache is configured for (default) prefork MPM mode. The bug shows up only in worker MPM mode.
I had this problem with -reindexall and without this option. However, with -reindexall it happend more often. I had to reinstall whole database and reindex everything. I will now again setup incremetal indexing and collect postgres logs. It might take some time before I will get back with anything.
I also have created separate lxr installation with one source file causing pb and it did not happen there - no duplicates.