Re: [PyIndexer] Searching and Indexing Redux II

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Thu, 13 Dec 2001 at 17:09:23 +0000, Chris Withers wrote:

> Marcus Collins wrote:

[ snip MySQL threads hanging on Windows ]

Ick! Any update yet from MySQL AB? Does it freeze or spin?

[ snip tuning indexing ]

See some comments below.

[ snip InnoDB ]

[ ORs in WHERE not using index ]

> Bit disappointed to see that MySQL's performance goes bad when you have OR's in
> a WHERE, or something like that, which could make boolean stuff unpleasant :-S

Hang, yes... I completely forgot about that... 

I believe it does a table scan in that case! (This could be another tick
next to sub-selects; see later) I'm not sure what the behaviour of the
other database engines is...

[ Redundant prev_textindex_id key in DDL ]

> > Nope, it appears only in possible_keys and ref. The keys in key are used
> > for the index lookup, as I understand it.
> 
> Indeed, so we could drop it?

You can indeed drop it; as it's not used in the SELECTs it won't harm
that at all, and it'll speed up your population.

Another thing you *should* be able to do is index on only the first n
characters of the word, which will (a) make your indexes smaller and (b)
therefore allow more values to be stored in a key buffer of the same
size. It should also reduce the number of seeks needed, as more keys
will be stored in each block. Result: faster population, faster lookup,
better cache hit ratio. You'd probably have to determine a good figure
emperically, but seeing the distribution of word lengths should help:

  SELECT 
    COUNT(word), 
    LENGTH(word) AS len
  FROM dictionary
  GROUP BY len

will give you the distribution of word lengths within your dictionary.

Since MySQL's default key block size is 1024 bytes, you might want to
select a size by which 1024 is divisible. Thirty-two or sixteen, maybe?
Remember that your word index will then not be unique, so don't declare
it as such.

The idea is that the index then just helps narrow down the search -- the
actual rows are still read in order to compare with to the word for
which your searching. For words that are shorter than your key length, you 
should see a real improvement; for longer words, there might be a loss
in performance.

[ Counting result columns ]

> > You can use COUNT(colname | * ), but that's basically going to save time
> > only because there's less data for it to allocate memory for and
> > transfer :-S.
> 
> How does that differ from just COUNT (*), which didn't work for me?

COUNT(*) retrieves all columns from all rows that meet your criteria,
and then counts the number of rows. This is just extra work for the
server, as it retrieves additional columns that are then just thrown
away. You can use COUNT(colname) and it won't retrieve all the columns.

However, the query still has to be executed, and the rows retrieved from
the tables -- so using COUNT() doesn't cut down all that significantly
on the time the query is going to take.

[ functionality in PostGreSQL ]

> Nope... I wonder if it'd be good to look at PostGreSQL here?

I can't asnwer that, since I haven't used PostGreSQL in absolute ages,
but it does implement a greater subset of SQL92 (including subselects,
union, etc.) Not sure about where table and index optimisation currently
stand -- a while back the chief problem with pgsql was that you
basically had to run vaccuum every so often and that doing so in a
production environment was not viable. But that was a MySQL argument and
may have changed in any case.

[ sub-selects vs joins ]

> > Indeed. It's the damned join. That's why I was thinking again of
> > sub-selects, which may be more efficient for very large results sets
> > (but less so for small ones, I imagine).
> 
> What's the difference between a sub-select and a join? (spot the SQL newbie ;-)

Well, it looks like what is taking the time is the large-ish self joins
that have to be done, where MySQL ends up creating large temporary tables. 
If you can use sub-selects, you basically still generate an intermediate
temporary table, but it should be much smaller than in the case of a large 
join. The (current?) query then becomes:

  SELECT
    documents.document_id
   FROM
    documents,
    dictionary AS d,
    textindex AS i
   WHERE
    i.document_id = documents.document_id AND
    d.word_id = i.word_id AND 
    d.word = '2000' AND 
    i.prev_word_id IN 
      (SELECT 
         d.word_id 
       FROM 
         dictionary as d
       WHERE 
         d.word = 'windows'
      )
   GROUP BY
    documents.document_id

So basically, instead of joining a 20 000 row table to a 20 000 row
table, you're finding a single row from a 20 000 row table, and then
finding (from an index) all the rows in the text index that reference
that row.

You can also use sub-selects instead of the INTERSECT operator, by looking
up the document id's of docs that meet your criteria. If you know ahead
of time how many documents contain each word (by storing that info in
the dictionary table, and doing a quick lookup first), then you can
order your use of sub-selects in such a way as to start with the
smallest result set and continue pruning it.

[ UNION and INTERSECT operators ]

> Nope, but it is in Postgres...

Hmmm, indeed. Is your framework such that you can plug in different SQL
engines? It might be nice to play with PostGreSQL...

[ snip UNION not yet in MySQL -stable ]

> > ... which means ugly LEFT OUTER JOINs and other hacks. Or
> > application-side processing.
> 
> or both :-S

Well, application-side processing is not necessarily that bad... You can
quite easily emulate sub-selects in your app, for example.

Cheers

-- Marcus