Re: [PyIndexer] MySQL Stuff

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Sun, 16 Dec 2001 at 19:17:00 -0800, Casey Duncan wrote:

[ using IN ]

> I was going to suggest this, but what if both words
> return 10,000 document hits? I doubt stuffing 10K ids
> into the IN operand of the SQL statment would work...

*gulp* yes...

I've done some informal tests (real results to follow when they're
formalised), and this might not be something to worry about so much. 
The time spent in processing the query is:

- send query to MySQL server (linear)
- parse query (doesn't look as bad as I thought it might; appears to be
  linear)
- select using explicit document ids (cheaper than joining to find them)

Additionally, I think the space complexity becomes linear rather than
exponential for the last item, when compared to a self join.

The app-side time to join a list of document ids returned from MySQL
into a MySQL list should not be too bad. (The 'obvious' way of doing it
takes 50ms for 10000 ids on my box, once the ids have been returned from
MySQL. The main time expense, I think, would be inside MySQLdb.)

> But I do think that using heap tables would be quite
> efficient. I would think that would be the way to go,
> and just join against it in the second query.

It's still the join that bugs me, because I don't think it operates in
linear time or space. But that said, the lookup in the heap tables, 
using hashes, would be constant.

> Maybe both methods could be used, using IN when there
> are 50 or fewer hits and heap tables otherwise. 

The major problem with the join (or at least with the self join) is that
MySQL wants to create a temporary table to join two 10K-row tables.
Using the heap temporary to store intermediate results would mitigate
this to some extent, since MySQL would then be joining one 10K row table
with another of fewer rows. 

I don't think we'd find the same behaviour using IN, but I'm not yet
certin. I'll try to get some comparative results using IN for different 
numbers of return results. 

> Again
> I think storing a document count for each word could
> help here.

Indeed. Storing the document count will definitely push up indexing
time, but the benefits are great.

Cheers

-- Marcus