Re: [PyIndexer] Searching and Indexing Redux II
Status: Pre-Alpha
Brought to you by:
cduncan
From: Marcus C. <ma...@wr...> - 2001-12-15 23:13:30
|
On Thu, 13 Dec 2001 at 17:09:23 +0000, Chris Withers wrote: > Marcus Collins wrote: [ snip MySQL threads hanging on Windows ] Ick! Any update yet from MySQL AB? Does it freeze or spin? [ snip tuning indexing ] See some comments below. [ snip InnoDB ] [ ORs in WHERE not using index ] > Bit disappointed to see that MySQL's performance goes bad when you have OR's in > a WHERE, or something like that, which could make boolean stuff unpleasant :-S Hang, yes... I completely forgot about that... I believe it does a table scan in that case! (This could be another tick next to sub-selects; see later) I'm not sure what the behaviour of the other database engines is... [ Redundant prev_textindex_id key in DDL ] > > Nope, it appears only in possible_keys and ref. The keys in key are used > > for the index lookup, as I understand it. > > Indeed, so we could drop it? You can indeed drop it; as it's not used in the SELECTs it won't harm that at all, and it'll speed up your population. Another thing you *should* be able to do is index on only the first n characters of the word, which will (a) make your indexes smaller and (b) therefore allow more values to be stored in a key buffer of the same size. It should also reduce the number of seeks needed, as more keys will be stored in each block. Result: faster population, faster lookup, better cache hit ratio. You'd probably have to determine a good figure emperically, but seeing the distribution of word lengths should help: SELECT COUNT(word), LENGTH(word) AS len FROM dictionary GROUP BY len will give you the distribution of word lengths within your dictionary. Since MySQL's default key block size is 1024 bytes, you might want to select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? Remember that your word index will then not be unique, so don't declare it as such. The idea is that the index then just helps narrow down the search -- the actual rows are still read in order to compare with to the word for which your searching. For words that are shorter than your key length, you should see a real improvement; for longer words, there might be a loss in performance. [ Counting result columns ] > > You can use COUNT(colname | * ), but that's basically going to save time > > only because there's less data for it to allocate memory for and > > transfer :-S. > > How does that differ from just COUNT (*), which didn't work for me? COUNT(*) retrieves all columns from all rows that meet your criteria, and then counts the number of rows. This is just extra work for the server, as it retrieves additional columns that are then just thrown away. You can use COUNT(colname) and it won't retrieve all the columns. However, the query still has to be executed, and the rows retrieved from the tables -- so using COUNT() doesn't cut down all that significantly on the time the query is going to take. [ functionality in PostGreSQL ] > Nope... I wonder if it'd be good to look at PostGreSQL here? I can't asnwer that, since I haven't used PostGreSQL in absolute ages, but it does implement a greater subset of SQL92 (including subselects, union, etc.) Not sure about where table and index optimisation currently stand -- a while back the chief problem with pgsql was that you basically had to run vaccuum every so often and that doing so in a production environment was not viable. But that was a MySQL argument and may have changed in any case. [ sub-selects vs joins ] > > Indeed. It's the damned join. That's why I was thinking again of > > sub-selects, which may be more efficient for very large results sets > > (but less so for small ones, I imagine). > > What's the difference between a sub-select and a join? (spot the SQL newbie ;-) Well, it looks like what is taking the time is the large-ish self joins that have to be done, where MySQL ends up creating large temporary tables. If you can use sub-selects, you basically still generate an intermediate temporary table, but it should be much smaller than in the case of a large join. The (current?) query then becomes: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id IN (SELECT d.word_id FROM dictionary as d WHERE d.word = 'windows' ) GROUP BY documents.document_id So basically, instead of joining a 20 000 row table to a 20 000 row table, you're finding a single row from a 20 000 row table, and then finding (from an index) all the rows in the text index that reference that row. You can also use sub-selects instead of the INTERSECT operator, by looking up the document id's of docs that meet your criteria. If you know ahead of time how many documents contain each word (by storing that info in the dictionary table, and doing a quick lookup first), then you can order your use of sub-selects in such a way as to start with the smallest result set and continue pruning it. [ UNION and INTERSECT operators ] > Nope, but it is in Postgres... Hmmm, indeed. Is your framework such that you can plug in different SQL engines? It might be nice to play with PostGreSQL... [ snip UNION not yet in MySQL -stable ] > > ... which means ugly LEFT OUTER JOINs and other hacks. Or > > application-side processing. > > or both :-S Well, application-side processing is not necessarily that bad... You can quite easily emulate sub-selects in your app, for example. Cheers -- Marcus |