From: Arjen v. d. M. <ar...@gl...> - 2004-01-11 10:50:07
|
Olly Betts wrote: > On Sun, Jan 11, 2004 at 01:07:14AM +0100, Arjen van der Meijden wrote: > > In a mail off the list, Arjen noted that the speedup is greater when > adding to a database which already contains a lot of data - more than 4 > times faster per 1000 documents when the database has about 50000 > documents! Our production 0.7.5 (on a dual xeon, 5 10k rpm 36G disk raid5, 4GB RAM) doesn't really show the same drop in performance as my local ide-powered development box, or it does, but the drop is much less. It starts off with 0:45 minutes per run and drops to somewhere near 3:00 for a similar batch (I don't know the preprocessing time, probably somewhere near 0:15) after having done over 830k documents (about 6.4G of text data). But the load of the machine, while indexing might have been a bit higher when at the end of the data set, compared to the start. (another xapian database on the machine was actively searched). The process of indexing 6.4G of text took a bit over 1 day and 16 hours. When a new stable Xapian is out, I'll probably reindex the whole lot again, simply to benefit from the proposed changes which will result in a yet smaller database (which is now 15G and 10G compacted) and to see if it is so much faster on that box aswell. My local box (athlon 850, 512MB ram, data set in a mysql database on the same box (actually, even the same drive, but the data set is not read from disk while sending it to scriptindex)) starts at ~2:00 and drops to ~14:00 minutes, of which about 1 minute is preprocessing time after doing, only after having done 45k documents. The cvs-head version went from ~2:00 to ~3:30 with the same data set. > Anyway, this is really good news for scaling! As shown above, my ide-powered development box shows the poorer scaling of 0.7.5 much better than our scsi-raid-powered production box, even though the production box was actively used while my development box was simply idle. > Actually, this is no longer true - the only difference between my > working sources and CVS is that I've temporarily reverted to a "every > 1000 documents" flush criterion to give a fairer comparison between > the old and new code (better to benchmark one change at a time!) I applied the patch you sent me. So that was the same for my tests. > Does your source data contain anything confidential, or is it something > I could take a copy of for testing? I've been contemplating setting up > some nightly tests - graphing the speed and memory requirements for > indexing and searching with the CVS HEAD version would help keep Xapian > lean and mean. The data does not contain confidential texts, its composed of data which anyone can extract from our website if he was willing to do so. But I'll have to discuss this with my colleagues, since the data is not 100% our own property (i.e. the copy rights on the contents are not really ours, while the copy rights on/ownership of the data itself is ours, thats a disadvantage of running a forum ;) ). If the data is not distributed in any way, it will probably be relatively easy to allow me this. If you intended to spread it around, I'm not sure whether I'll be allowed to provide the data set (for that reason). Anyway, I'm going to ask my colleagues right now (especially the legal guys). Best regards, Arjen |