Re: [PyIndexer] Thoughts on MySQL Implementation

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Sun, 16 Dec 2001 at 09:00:53 -0800, Casey Duncan wrote:

[ Data population and LOAD DATA ]

This would definitely be a lot faster than individual INSERTs. The main
reason for the speed-up is that in this case MySQL treats the whole
thing as one transaction, so it defers flushing the key buffers until
all the data is loaded. The same can be accomplished (although not quite
as efficiently) by wrapping the whole lot of inserts inside a
transaction block (if using BDB or InnoDB tables) or LOCK/UNLOCK TABLEs
(if using non-TST's). Use BEGIN/COMMIT in the former case.

[ App-side processing ]

> As the search end, it seems to me that app side
> processing will be best for positional matches, such
> as for phrases. I'd imagine such processing should be
> eventually coded in C, but it should be acceptable in
> Python if done efficiently (like using Python arrays,
> IISets or some-such).

I'd tend to do more of the processing on the database side, as
described, in which case use of C or Python becomes less of an issue.

> http://www.python.org/doc/essays/list2str.html

Very interesting and inspiring :-)

> I agree that storing a document count for each word
> could help with optimizing since you could start with
> the smallest dataset first and prune it from there.
> Perhaps IISets could be used to get UNION/INTERSECT
> functionality efficiently if mySQL can't do it for
> you.

I reckon MySQL can do it pretty efficiently given a set of ids, but
doing such processing as coalescing duplicates, etc., on the app side
would likely be faster than letting the MySQL query parser and optimiser
do it for you. That said, it would be only a matter of ms improvement;
the main benefit of app-side processing is in eliminating joins.

[ prefix indexes ]

> Sometimes less is more 8^). 

:-)

> impossible to tell which would be faster on a given
> architecture without real-world testing.

Indeed. Although I don't see a knob to change the size of the key block,
though, so AFAIK it's set at 1024 bytes. Caching will likely have 
different effects on different platforms, though, and the setting of 
key_buffer_size will have an effect on MySQL's own caching.

[ MySQL's FULLTEXT index ]

> http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search

AFAIK, it doesn't do phrase matching. Their relevance calculation looks
very interesting, though.

Cheers

-- Marcus