[Lxr-dev] Some patches for review (CVS tree indexing working)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Finally, I have a version of LXR working that indexes our CVS tree
directly.

Here's the significant patches against the current CVS tree on
sourceforge.  I'd be very happy if someone took a look and told me if
there was something fundamentally wrong with this (I'd also appreciate
being told what a genius I am, naturally ;) ).

If I don't hear anything, I will assume the changes don't bother
anyone, and check them in in a day or two.

The bulk of the changes are in CVS.pm, which now doesn't segfault
anymore when faced with binary files.  It now largely uses external
commands for getting files out of CVS.  Unfortunately, it does this in
a way that is probably exploitable.  While this is not a big issue on
our intranet, you'd probably not want to run this on a public server
yet.  

My next task will be to fix this.  I just wanted to get these patches
out the door, as they've taken far too long already.

See also comments in between the parts of the patch.

===================================================================

These patches to the datamodel may be just me being dense, but because
a CVS file contains a lot of versions, there is no single mapping
file->version.  genxref complained loudly without this change.  Also,
the usage table wasn't dropped along with the others in postgres.

Index: initdb
===================================================================
RCS file: /cvsroot/lxr/lxr/initdb,v
retrieving revision 1.10
diff -u -r1.10 initdb

--- initdb	1999/09/17 09:37:37	1.10
+++ initdb	2001/07/24 11:16:34
@@ -4,6 +4,7 @@
 drop table symbols;
 drop table indexes;
 drop table releases;
+drop table usage;
 drop table status;
 
 create sequence filenum cache 50;
@@ -37,7 +38,7 @@
 create table releases 
 	(fileid		int		references files,
 	release		varchar,
-	primary key	(fileid)
+	primary key	(fileid,release)
 );
 
 create table usage
Index: initdb-mysql
===================================================================
RCS file: /cvsroot/lxr/lxr/initdb-mysql,v
retrieving revision 1.4
diff -u -r1.4 initdb-mysql
--- initdb-mysql	2001/05/23 05:33:12	1.4
+++ initdb-mysql	2001/07/24 11:16:34
@@ -34,7 +34,7 @@
 create table releases 
         (fileid         int not null references files,
         release         char(255) binary not null,
-        primary key     (fileid)
+        primary key     (fileid,release)
 );
 
 create table useage
Index: initdb-postgres
===================================================================
RCS file: /cvsroot/lxr/lxr/initdb-postgres,v
retrieving revision 1.1
diff -u -r1.1 initdb-postgres
--- initdb-postgres	1999/12/25 21:58:27	1.1
+++ initdb-postgres	2001/07/24 11:16:34
@@ -4,6 +4,7 @@
 drop table symbols;
 drop table indexes;
 drop table releases;
+drop table usage;
 drop table status;
 
 create sequence filenum cache 50;
@@ -37,7 +38,7 @@
 create table releases 
 	(fileid		int		references files,
 	release		varchar,
-	primary key	(fileid)
+	primary key	(fileid,release)
 );
 
 create table usage
===================================================================

The comment says it all, really.  I just wanted to illustrate a new
way of specifying the range.

Index: lxr.conf.template
===================================================================
RCS file: /cvsroot/lxr/lxr/lxr.conf.template,v
retrieving revision 1.5
diff -u -r1.5 lxr.conf.template
--- lxr.conf.template	1999/06/16 09:17:48	1.5
+++ lxr.conf.template	2001/07/24 11:16:34
@@ -22,6 +22,15 @@
 	 # Define typed variable "v", read valueset from file.
 	 'v' => {'name'    => 'Version',
 		 'range'   => [ readfile('src/cvsversions') ], 
+
+                 # If files within a tree can have different versions,
+		 # e.g in a CVS tree, 'range' can be specified as a
+		 # function to call for each file:
+		 #'range'   => sub { return 
+		 #			($files->allreleases($LXR::Common::pathname),
+		 #			 $files->allrevisions($LXR::Common::pathname))
+		 #			}, # deferred function call.
+
 		 'default' => '1.0.6'},
 	 
 	 # Define typed variable "a".  First value is default.
===================================================================

And this is what implements this new way of specifying the range.

Index: lib/LXR/Config.pm
===================================================================
RCS file: /cvsroot/lxr/lxr/lib/LXR/Config.pm,v
retrieving revision 1.23
diff -u -r1.23 Config.pm
--- lib/LXR/Config.pm	2000/09/04 19:26:28	1.23
+++ lib/LXR/Config.pm	2001/07/24 11:16:36
@@ -117,6 +117,10 @@
 sub varrange {
     my ($self, $var) = @_;
 
+	if (ref($self->{variables}{$var}{range}) eq "CODE") {
+		return &{$self->{variables}{$var}{range}};
+	}
+
     return @{$self->{variables}{$var}{range} || []};
 }

===================================================================

The shebang parsing didn't work previously.  Now it does.

Index: lib/LXR/Lang.pm
===================================================================
RCS file: /cvsroot/lxr/lxr/lib/LXR/Lang.pm,v
retrieving revision 1.19
diff -u -r1.19 Lang.pm
--- lib/LXR/Lang.pm	1999/12/25 21:58:27	1.19
+++ lib/LXR/Lang.pm	2001/07/24 11:16:38
@@ -30,9 +30,9 @@
 		$lang = new LXR::Lang::Perl($pathname, $release);
 	}
 	else {
-		my ($shebang);
+		$files->getfile($pathname, $release) =~ /^#!\s*(\S+)/s;
 
-		$shebang = $files->getfile($pathname, $release) =~ /^#!\s*(\S+)/s;
+		my $shebang = $1;
 
 		if ($shebang =~ /perl/) {
 			require LXR::Lang::Perl;
===================================================================

Quite a lot of changes here.  Some insignificant bugfixes for
undefined values and stuff, and then the big change of having parsecvs
only parse the header of the cvs file, because it will get confused by
binary content in the controlled file.  All places that depended on
the content of the file being available in the %cvs hash have been
changed to use the proper RCS/CVS commands instead (unfortunately not
in a safe way - yet).

Index: lib/LXR/Files/CVS.pm
===================================================================
RCS file: /cvsroot/lxr/lxr/lib/LXR/Files/CVS.pm,v
retrieving revision 1.11
diff -u -r1.11 CVS.pm
--- lib/LXR/Files/CVS.pm	1999/11/24 14:48:53	1.11
+++ lib/LXR/Files/CVS.pm	2001/07/24 11:16:38
@@ -29,6 +29,9 @@
 	if ($release =~ /rev_([\d\.]+)/) {
 		return $1;
 	}
+	elsif ($release =~ /([\d\.]+)/) {
+		return $1;
+	}
 	else {
 		$self->parsecvs($filename);
 		return $cvs{'header'}{'symbols'}{$release};
@@ -47,8 +50,10 @@
 	return undef unless defined($rev);
 
 	my @t = reverse(split(/\./, $cvs{'branch'}{$rev}{'date'}));
-	$t[4]--;
 
+	return undef unless @t;
+
+	$t[4]--;
 	return timegm(@t);
 }
 
@@ -60,38 +65,10 @@
 
 sub getfile {
 	my ($self, $filename, $release) = @_;
-
-	$self->parsecvs($filename);
-
-	my $rev = $self->filerev($filename, $release);
-	return undef unless defined($rev);
-
-	my $hrev = $cvs{'header'}{'head'};
-	my @head = $cvs{'history'}{$hrev}{'text'} =~ /([^\n]*\n)/gs;
-
-	while ($hrev ne $rev && $cvs{'branch'}{$hrev}{'branches'} ne $rev) {
-		$hrev = $cvs{'branch'}{$hrev}{'next'};
-		my @diff = $cvs{'history'}{$hrev}{'text'} =~ /([^\n]*\n)/gs;
- 		my $off = 0;
-
-		while (@diff) {
-			my $dir = shift(@diff);
 
-			if ($dir =~ /^a(\d+)\s+(\d+)/) {
-				splice(@head, $1-$off, 0, splice(@diff, 0, $2));
-				$off -= $2;
-			}
-			elsif ($dir =~ /^d(\d+)\s+(\d+)/) {
-				splice(@head, $1-$off-1, $2);
-				$off += $2;
-			}
-			else {
-				warning("Oops! Out of sync!");
-			}
-		}
-	}
-
-	return join('', @head);
+	my $fileh = $self->getfilehandle($filename, $release);
+	return undef unless $fileh;
+	return join('', $fileh->getlines);
 }
 
 sub getannotations {
@@ -105,22 +82,23 @@
 	my $hrev = $cvs{'header'}{'head'};
 	my $lrev;
 	my @anno;
-	my @head = $cvs{'history'}{$hrev}{'text'} =~ /\n()/gs;
+	my $headfh = $self->getfilehandle($filename, $release);
+	my @head = $headfh->getlines;
 
 	while (1) {
 		if ($rev eq $hrev) {
 			@head = 0..$#head;
 		}
-
+		
 		$lrev = $hrev;
 		$hrev = $cvs{'branch'}{$hrev}{'next'} || last;
-
-		my @diff = $cvs{'history'}{$hrev}{'text'} =~ /([^\n]*\n)/gs;
- 		my $off = 0;
-
+		
+		my @diff = $self->getdiff($filename, $lrev, $hrev);
+		my $off = 0;
+		
 		while (@diff) {
 			my $dir = shift(@diff);
-
+			
 			if ($dir =~ /^a(\d+)\s+(\d+)/) {
 				splice(@diff, 0, $2);
 				splice(@head, $1-$off, 0, ('') x $2);
@@ -130,7 +108,7 @@
 				map {
 					$anno[$_] = $lrev if $_ ne '';
 				} splice(@head, $1-$off-1, $2);
-
+				
 				$off += $2;
 			}
 			else {
@@ -159,22 +137,33 @@
 	my ($self, $filename, $release) = @_;
 	my ($fileh);
 
-#	$fileh = new FileHandle("co -q -pv$release ".
-#							$self->toreal($filename, $release).
-#							" |"); # FIXME: Exploitable?
-
-	my $buffer = $self->getfile($filename, $release);
-
-	&LXR::Common::fflush;
-	my ($readh, $writeh) = FileHandle::pipe;
-	unless (fork) {
-		$writeh->autoflush(1);
-		$writeh->print($buffer);
-		exec("/bin/true");		# Exit without cleanup.
-		exit;
-	}
+	$self->parsecvs($filename);
+
+	my $rev = $self->filerev($filename, $release);
+	return undef unless defined($rev);
+
+	$fileh = new FileHandle("co -q -p$rev ".
+							$self->toreal($filename, $release).
+							" |"); # FIXME: Exploitable?
+	return $fileh;
+}
+
+sub getdiff {
+	my ($self, $filename, $release1, $release2) = @_;
+	my ($fileh);
+
+	$self->parsecvs($filename);
+
+	my $rev1 = $self->filerev($filename, $release1);
+	return undef unless defined($rev1);
+
+	my $rev2 = $self->filerev($filename, $release2);
+	return undef unless defined($rev2);
 
-	return $readh;
+	$fileh = new FileHandle("rcsdiff -q -a -n -r$rev1 -r$rev2 ".
+							$self->toreal($filename, $release1).
+							" |"); # FIXME: Exploitable?
+	return $fileh->getlines;
 }
 
 sub tmpfile {
@@ -297,15 +286,35 @@
 	return sort(keys(%{$cvs{'header'}{'symbols'}}));
 }
 
+sub allrevisions {
+	my ($self, $filename) = @_;
+
+	$self->parsecvs($filename);
+
+	return sort(keys(%{$cvs{'branch'}}));
+}
+
 sub parsecvs {
+	# Actually, these days it just parses the header.
+	# RCS tools are much better at parsing RCS files.
+	# -pok
 	my ($self, $filename) = @_;
 
 	return if $cache_filename eq $filename;
 	$cache_filename = $filename;
+
+	my $file = '';
+	open (CVS, $self->toreal($filename, undef));
+	while (<CVS>) {
+		if (/^text\s*$/) {
+			# stop reading when we hit the text.
+			last;
+		}
+		$file .= $_;
+	}
+	close (CVS);
 
-	open(CVS, $self->toreal($filename, undef));
-	my @cvs = join('', <CVS>) =~ /((?:(?:[^\n@]+|@[^@]*@)\n?)+)/gs;
-	close(CVS);
+	my @cvs = $file =~ /((?:(?:[^\n@]+|@[^@]*@)\n?)+)/gs;
 
 	$cvs{'header'} = { map { s/@@/@/gs;
 							 /^@/s && substr($_, 1, -1) || $_ }
@@ -313,7 +322,7 @@
 
 	$cvs{'header'}{'symbols'}
 	= { $cvs{'header'}{'symbols'} =~ /(\S+?):(\S+)/g };
-
+	
 	my ($orel, $nrel, $rev);
 	while (($orel, $rev) = each %{$cvs{'header'}{'symbols'}}) {
 		$nrel = $config->cvsversion($orel);
@@ -337,13 +346,6 @@
 	$cvs{'desc'} = shift(@cvs) =~ /\s*desc\s+((?:[^\n@]+|@[^@]*@)*)\n/s;
 	$cvs{'desc'} =~ s/^@|@($|@)/$1/gs;
 
-	while (@cvs) {
-		my ($r, $v) = shift(@cvs) =~ /\s*(\S+)\s*(.*)/s;
-		$cvs{'history'}{$r} = { map { s/@@/@/gs; 
-									  /^@/s && substr($_, 1, -1) || $_ }
-								$v =~ /(\w+)\s*((?:[^\n@]+|@[^@]*@)*)\n/gs };
-	}
 }
-
 
 1;
===================================================================

Also, some small changes to the Postgres interface.  I dislike
hardcoded numbers that occur more that one place, so I put the number
of inserts before a commit into a variable.  I also fixed a bug
resulting from a wrong assumption on the return value on an execute of
a select statement.

Index: lib/LXR/Index/Postgres.pm
===================================================================
RCS file: /cvsroot/lxr/lxr/lib/LXR/Index/Postgres.pm,v
retrieving revision 1.4
diff -u -r1.4 Postgres.pm
--- lib/LXR/Index/Postgres.pm	2000/10/31 12:52:12	1.4
+++ lib/LXR/Index/Postgres.pm	2001/07/24 11:16:41
@@ -9,7 +9,7 @@
 use strict;
 use DBI;
 
-use vars qw($dbh $transactions %files %symcache 
+use vars qw($dbh $transactions %files %symcache $commitlimit
 			$files_select $filenum_nextval $files_insert
 			$symbols_byname $symbols_byid $symnum_nextval
 			$symbols_remove $symbols_insert $indexes_select $indexes_insert
@@ -26,7 +26,8 @@
 
 	$$dbh{'AutoCommit'} = 0;
 #	$dbh->trace(1);
-	
+
+	$commitlimit = 100;
 	$transactions = 0;
 	%files = ();
 	%symcache = ();
@@ -96,7 +97,7 @@
 							 $line,
 							 $type,
 							 $relsym ? $self->symid($relsym) : undef);
-	unless (++$transactions % 500) {
+	unless (++$transactions % $commitlimit) {
 		$dbh->commit();
 	}
 }
@@ -108,7 +109,7 @@
 						   $line,
 						   $self->symid($symname));
 
-	unless (++$transactions % 500) {
+	unless (++$transactions % $commitlimit) {
 		$dbh->commit();
 	}
 }
@@ -178,12 +179,16 @@
 # Indicate that this filerevision is part of this release
 sub release {
 	my ($self, $fileid, $release) = @_;
+
+
+	$releases_select->execute($fileid+0, $release);
+	my $firstrow = $releases_select->fetchrow_array();
+
 
-	my $rows = $releases_select->execute($fileid+0, $release);
-	$releases_select->finish();
+#	$releases_select->finish();
 
-	unless ($rows > 0) {
-		$releases_insert->execute($fileid, $release);
+	unless ($firstrow) {
+		$releases_insert->execute($fileid+0, $release);
 	}
 }
 
@@ -239,7 +244,7 @@
 	my ($self, $fileid) = @_;
 
 	$status_insert->execute($fileid+0, $fileid+0);
-	return $status_update->execute(1, $fileid, 0) > 0;
+	return $status_update->execute(1, $fileid+0, 0) > 0;
 }
 
 sub toreference {


...Peder...
-- 
Cogito ergo panta rei.