Re: [PyIndexer] MySQL Stuff
Status: Pre-Alpha
Brought to you by:
cduncan
From: Marcus C. <ma...@wr...> - 2001-12-17 10:42:21
|
On Sun, 16 Dec 2001 at 19:17:00 -0800, Casey Duncan wrote: [ using IN ] > I was going to suggest this, but what if both words > return 10,000 document hits? I doubt stuffing 10K ids > into the IN operand of the SQL statment would work... *gulp* yes... I've done some informal tests (real results to follow when they're formalised), and this might not be something to worry about so much. The time spent in processing the query is: - send query to MySQL server (linear) - parse query (doesn't look as bad as I thought it might; appears to be linear) - select using explicit document ids (cheaper than joining to find them) Additionally, I think the space complexity becomes linear rather than exponential for the last item, when compared to a self join. The app-side time to join a list of document ids returned from MySQL into a MySQL list should not be too bad. (The 'obvious' way of doing it takes 50ms for 10000 ids on my box, once the ids have been returned from MySQL. The main time expense, I think, would be inside MySQLdb.) > But I do think that using heap tables would be quite > efficient. I would think that would be the way to go, > and just join against it in the second query. It's still the join that bugs me, because I don't think it operates in linear time or space. But that said, the lookup in the heap tables, using hashes, would be constant. > Maybe both methods could be used, using IN when there > are 50 or fewer hits and heap tables otherwise. The major problem with the join (or at least with the self join) is that MySQL wants to create a temporary table to join two 10K-row tables. Using the heap temporary to store intermediate results would mitigate this to some extent, since MySQL would then be joining one 10K row table with another of fewer rows. I don't think we'd find the same behaviour using IN, but I'm not yet certin. I'll try to get some comparative results using IN for different numbers of return results. > Again > I think storing a document count for each word could > help here. Indeed. Storing the document count will definitely push up indexing time, but the benefits are great. Cheers -- Marcus |