Re: [PyIndexer] MySQL Stuff

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Sun, 16 Dec 2001 at 14:44:46 +0000, Chris Withers wrote:

> Marcus Collins wrote:

[ MySQL threads hanging on Windows ]

> It freezes, the response I got was essentially 'upgrade to Win2K Server', which
> may be the problem, although I remain unconvinced...

Is this happening consistently with the same query/sequence of actions?

Is it only one thread that hangs?

Does MySQL's trace log tell you anything useful if you start mysqld with
--debug? You'll also want to start it with --standalone and direct the
trace log to a big disk. See

  <URL:http://www.mysql.org/documentation/mysql/bychapter/manual_porting.html#Making_trace_files>

(AFAIK, the Windows binary is compiled with trace debugging built in)

--log might also be useful, just to get an indication of whether the
problem arises once the query is being executed (in which case you'll
know what query is breaking it), or if the thread freezes even before
the query is attempted (in which case there's possibly a problem with
the threading). Other than that, hopefully the mysql-windows list (IIRC)
can help more...

[ ORs in WHERE clause ]

I distinctly remember reading about this somewhere, and I cannot
remember where... Any pointers? I can't find anything about this in the
manual, but I know that (under certain conditions?) MySQL won't use the
index if you use an OR in your WHERE clause.

[ prefix key indexes ]

> > Another thing you *should* be able to do is index on only the first n
> > characters of the word, 
> 
> What's the SQL I need to do that?

  <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#CREATE_INDEX>

  CREATE INDEX word_idx ON dictionary (word(16))

will create a (non-unique) index keyed on the first 16 characters of
dictionary.word

Your application should ensure that the values inserted are unique.

[ snip word length distribution ]

> So I guess we're looking at a key length of 8 or 9?
> 
> > Since MySQL's default key block size is 1024 bytes, you might want to
> > select a size by which 1024 is divisible. Thirty-two or sixteen, maybe?
> 
> 8? Although 32 would catch just about every word and is a lot less than 256 ;-)

Possibly. The *only* way you're going to find out is emperically,
although I think it'll be 8, 16, or 32... If 8 does very well, and 16
and 32 do only marginally worse, I'd be tempted to go with the bigger
length. Especially if the engine is likely to be used in, e.g., indexing
medical literature and the like.

> > Remember that your word index will then not be unique, so don't declare
> > it as such.
> 
> well, my SQL green-ness again, how would I do that?

As above. Indexes are never unique unless you specify them as such
(UNIQUE INDEX | UNIQUE KEY) or they are the PRIMARY KEY.

> I'm wondering whether the filter-index architecture I'm building will help this?
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pythonindexer/Documentation/interfaces/filter.html?rev=1.1&content-type=text/vnd.viewcvs-markup

[ snip benefits of prefix key indexes ]

[ COUNT aggregate function ]

> > COUNT(*) retrieves all columns from all rows that meet your criteria,
> > and then counts the number of rows.
> 
> Ah, by didn't work, I meant I didn't get a single number back, I got a two
> column table, I think the GROUP BY clause was interfering, any ideas how to stop
> that?

Yes, it would've been the GROUP BY. To get the behaviour you're after,
you would have to use COUNT(DISTINCT document_id) and drop the GROUP BY.
The DISTINCT will ensure that no duplicate doc ids are counted. When you
use GROUP BY, each group of equal valuse for the columns in the GROUP BY
clause is subjected to your aggregate function. That would be why you
got two columns. See:

  <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#Group_by_functions>

> > However, the query still has to be executed, and the rows retrieved from
> > the tables -- so using COUNT() doesn't cut down all that significantly
> > on the time the query is going to take.
> 
> :-S

There's really no way of predicting ahead of time how many results a
query is going to return... You could look at the relative number of
occurrences of the search terms, but that would by no means be
foolproof. What you can do is use LIMIT to return only a certain number
of rows, which can help somewhat because (a) if you're ordering the
results, MySQL will stop once it's found that many results; and (b)
there's less data to send to the client.

[ snip PostGreSQL ]

[ sub-selects vs joins ]

> does MySQL support sub-selects yet? I certainly remember it as not doing so...

No, it doesn't:

  <URL:http://www.mysql.com/doc/A/N/ANSI_diff_Sub-selects.html>

> Well, since the filter process will do this bit already, maybe that'll cure the
> problem?
> 
> > You can also use sub-selects instead of the INTERSECT operator, by looking
> > up the document id's of docs that meet your criteria. If you know ahead
> > of time how many documents contain each word (by storing that info in
> > the dictionary table, and doing a quick lookup first), then you can
> > order your use of sub-selects in such a way as to start with the
> > smallest result set and continue pruning it.
> 
> Ah, cunning, how do we do this without sub-selects?

Well, this would require app-side processing in any case, so refer to
the end of the post...

> > Hmmm, indeed. Is your framework such that you can plug in different SQL
> > engines? It might be nice to play with PostGreSQL...
> 
> Sort of, not with ease, mainly due to the different techniques needed for each
> engine.
> In MySQL, I was planning to return a group of join and where clauses from each
> index, which the engine would then combine and perform as a single select.
> In PostGres, that's not necessary, each index could just return a complete
> SELECt which the engine could combine using UNION, INTERSECT, etc. Could be a
> lot of intertia moving across to PostGreSQL though, particularly the other parts
> of the project. I'll have to chat with the other guys working on it on Monday...

Well, we can definitely do better using MySQL, but it's looking like it
will require app-side processing. 

The only MySQL-only alternative I can think of at this stage would be to 
use temporary heap tables to store intermediate results, and then join 
these tables (which use hash indexxing and are in memory, and therefore 
very fast). Since the tables are temporary, you can do pretty much 
anything with them:

 - UNIONs would be implemented by additional INSERT INTO ...  SELECT FROM 
   statements. Ditto for those ORs, possibly.
 - JOIN operations would be much faster, since the tables would be small
   and they use hashes for indexing
 - you can specify a maximum number of rows that the table can hold (and
   therefore limit the number of results returned)

I don't actually think this would be the most efficient -- app-side
processing probably will be more efficient -- but if you're interested,
then see:

  <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#CREATE_TABLE>
  <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Table_types.html#HEAP>
  <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#INSERT_SELECT>

It's also nice that this (excepting specifying the table type) still
complies with SQL92, which makes provision for temporary tables.

[ Application-side processing ]

> > Well, application-side processing is not necessarily that bad... You can
> > quite easily emulate sub-selects in your app, for example.
> 
> Ooo... do tell :-)

Take for example the query:

  SELECT
    documents.document_id
  FROM
    documents,
    dictionary AS d,
    textindex AS i
  WHERE
    i.document_id = documents.document_id AND
    d.word_id = i.word_id AND
    d.word = '2000' AND
    i.prev_word_id IN
      (SELECT
         d.word_id
       FROM
         dictionary as d
       WHERE
         d.word = 'windows'
      )
  GROUP BY
    documents.document_id

1. Start with your inner-most SELECT:

   SELECT word_id FROM dictionary WHERE word = 'windows'

2. Retrieve the word id returned (it *should* be a single id)

4. Form the outer SELECT:

  SELECT
    documents.document_id
  FROM
    documents,
    dictionary AS d,
    textindex AS i
  WHERE
    i.document_id = documents.document_id AND
    d.word_id = i.word_id AND
    d.word = '2000' AND
    i.prev_word_id = the_id_returned_for_windows
  GROUP BY
    documents.document_id

That eliminates one of the present self joins, replacing it with a very
cheap single-row result lookup.

You could nest any number of queries like that. When it comes to boolean
searching, you can apply the same logic, and retrieve all document
ids that you need to test further. Then generate an SQL list and use the
IN membership operator. For example, if you have the above query and
need to test further for articles that also refer to FreeBSD, you'd
perform two additional SELECTS:

1. Get list of documents referring to FreeBSD:

  SELECT
    documents.document_id
  FROM
    documents,
    dictionary AS d,
    textindex AS i
  WHERE
    i.document_id = documents.document_id AND
    d.word_id = i.word_id AND
    d.word = 'freebsd'
  GROUP BY
    documents.document_id

2. Assuming the document ids returned are (1, 3, 5, 7, 11, 13) :

  SELECT
    documents.document_id
  FROM
    documents,
    dictionary AS d,
    textindex AS i
  WHERE
    i.document_id = documents.document_id AND
    d.word_id = i.word_id AND
    d.word = '2000' AND
    i.prev_word_id = the_id_returned_for_windows AND 
    documents.document_id IN (1, 3, 5, 7, 11, 13)
  GROUP BY
    documents.document_id

Total of three queries (although you would do well to use an additional 
one first to determine optimum order), one of which has no joins, the 
others only two. If you don't mind doing the app-side processing, then 
this is the way to go. Just make sure that, if you're using lots of 
document ids in your list, your query buffer (query_buffer_size) is 
large enough.

I think you should also be able to use this *very* efficiently instead
of the INTERESCT operator. If MySQL sees the following:

    documents.document_id IN (1, 3, 5, 7, 11, 13) AND
    documents.document_id IN (5, 7, 11, 13, 17, 19, 23)

it will do constant folding and eliminate the impossibilities. You can
also manipulate your app's lists to get around the OR problem by merging
multiple doc id lists into one.

Cheers

-- Marcus