[Xapian-devel] Re: Quartz Performance Improvements

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Olly Betts wrote:
> On Sun, Jan 11, 2004 at 01:07:14AM +0100, Arjen van der Meijden wrote:
>
> In a mail off the list, Arjen noted that the speedup is greater when
> adding to a database which already contains a lot of data - more than 4
> times faster per 1000 documents when the database has about 50000
> documents!

Our production 0.7.5 (on a dual xeon, 5 10k rpm 36G disk raid5, 4GB RAM) 
doesn't really show the same drop in performance as my local ide-powered 
development box, or it does, but the drop is much less.
It starts off with 0:45 minutes per run and drops to somewhere near 3:00 
for a similar batch (I don't know the preprocessing time, probably 
somewhere near 0:15) after having done over 830k documents (about 6.4G 
of text data).

But the load of the machine, while indexing might have been a bit higher 
when at the end of the data set, compared to the start. (another xapian 
database on the machine was actively searched).
The process of indexing 6.4G of text took a bit over 1 day and 16 hours. 
When a new stable Xapian is out, I'll probably reindex the whole lot 
again, simply to benefit from the proposed changes which will result in 
a yet smaller database (which is now 15G and 10G compacted) and to see 
if it is so much faster on that box aswell.

My local box (athlon 850, 512MB ram, data set in a mysql database on the 
same box (actually, even the same drive, but the data set is not read 
from disk while sending it to scriptindex)) starts at ~2:00 and drops to 
~14:00 minutes, of which about 1 minute is preprocessing time after 
doing, only after having done 45k documents.
The cvs-head version went from ~2:00 to ~3:30 with the same data set.

> Anyway, this is really good news for scaling!

As shown above, my ide-powered development box shows the poorer scaling 
of 0.7.5 much better than our scsi-raid-powered production box, even 
though the production box was actively used while my development box was 
simply idle.

> Actually, this is no longer true - the only difference between my
> working sources and CVS is that I've temporarily reverted to a "every
> 1000 documents" flush criterion to give a fairer comparison between
> the old and new code (better to benchmark one change at a time!)

I applied the patch you sent me. So that was the same for my tests.

> Does your source data contain anything confidential, or is it something
> I could take a copy of for testing?  I've been contemplating setting up
> some nightly tests - graphing the speed and memory requirements for
> indexing and searching with the CVS HEAD version would help keep Xapian
> lean and mean.

The data does not contain confidential texts, its composed of data which 
anyone can extract from our website if he was willing to do so.
But I'll have to discuss this with my colleagues, since the data is not 
100% our own property (i.e. the copy rights on the contents are not 
really ours, while the copy rights on/ownership of the data itself is 
ours, thats a disadvantage of running a forum ;) ).

If the data is not distributed in any way, it will probably be 
relatively easy to allow me this. If you intended to spread it around, 
I'm not sure whether I'll be allowed to provide the data set (for that 
reason).

Anyway, I'm going to ask my colleagues right now (especially the legal 
guys).

Best regards,

Arjen