Re: [PyIndexer] MySQL Stuff
Status: Pre-Alpha
Brought to you by:
cduncan
From: Chris W. <ch...@ni...> - 2001-12-16 14:44:31
|
Marcus Collins wrote: > > [ snip MySQL threads hanging on Windows ] > > Ick! Any update yet from MySQL AB? Does it freeze or spin? It freezes, the response I got was essentially 'upgrade to Win2K Server', which may be the problem, although I remain unconvinced... > > Bit disappointed to see that MySQL's performance goes bad when you have OR's in > > a WHERE, or something like that, which could make boolean stuff unpleasant :-S > > Hang, yes... I completely forgot about that... Anyone got ideas how/if we can mitigate this? [snip remove prev_textindex_id key in DDL] > Another thing you *should* be able to do is index on only the first n > characters of the word, What's the SQL I need to do that? > which will (a) make your indexes smaller that's not currently a problem... > and (b) > therefore allow more values to be stored in a key buffer of the same > size. Ah, yes... > SELECT > COUNT(word), > LENGTH(word) AS len > FROM dictionary > GROUP BY len gave: +-------------+------+ | COUNT(word) | len | +-------------+------+ | 41 | 1 | | 1224 | 2 | | 15818 | 3 | | 27076 | 4 | | 21709 | 5 | | 38393 | 6 | | 26047 | 7 | | 24425 | 8 | | 18607 | 9 | | 14351 | 10 | | 10923 | 11 | | 6931 | 12 | | 5081 | 13 | | 3835 | 14 | | 2839 | 15 | | 2144 | 16 | | 1680 | 17 | | 1216 | 18 | | 873 | 19 | | 649 | 20 | | 434 | 21 | | 281 | 22 | | 208 | 23 | | 138 | 24 | | 84 | 25 | | 65 | 26 | | 45 | 27 | | 27 | 28 | | 15 | 29 | | 10 | 30 | | 6 | 31 | | 4 | 32 | | 4 | 33 | | 2 | 34 | | 1 | 36 | | 1 | 44 | | 2 | 46 | | 1 | 60 | | 1 | 64 | | 1 | 99 | +-------------+------+ So I guess we're looking at a key length of 8 or 9? > Since MySQL's default key block size is 1024 bytes, you might want to > select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? 8? Although 32 would catch just about every word and is a lot less than 256 ;-) > Remember that your word index will then not be unique, so don't declare > it as such. well, my SQL green-ness again, how would I do that? I'm wondering whether the filter-index architecture I'm building will help this? http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pythonindexer/Documentation/interfaces/filter.html?rev=1.1&content-type=text/vnd.viewcvs-markup > The idea is that the index then just helps narrow down the search -- the > actual rows are still read in order to compare with to the word for > which your searching. For words that are shorter than your key length, you > should see a real improvement; for longer words, there might be a loss > in performance. 32 might be the best length then... > COUNT(*) retrieves all columns from all rows that meet your criteria, > and then counts the number of rows. Ah, by didn't work, I meant I didn't get a single number back, I got a two column table, I think the GROUP BY clause was interfering, any ideas how to stop that? > This is just extra work for the > server, as it retrieves additional columns that are then just thrown > away. You can use COUNT(colname) and it won't retrieve all the columns. Cool... > However, the query still has to be executed, and the rows retrieved from > the tables -- so using COUNT() doesn't cut down all that significantly > on the time the query is going to take. :-S > I can't asnwer that, since I haven't used PostGreSQL in absolute ages, > but it does implement a greater subset of SQL92 (including subselects, > union, etc.) Not sure about where table and index optimisation currently > stand -- a while back the chief problem with pgsql was that you > basically had to run vaccuum every so often and that doing so in a > production environment was not viable. But that was a MySQL argument and > may have changed in any case. Hehe, if I get a chance I'll take a look, if only out of curiosity... > [ sub-selects vs joins ] > > > > Indeed. It's the damned join. That's why I was thinking again of > > > sub-selects, which may be more efficient for very large results sets > > > (but less so for small ones, I imagine). does MySQL support sub-selects yet? I certainly remember it as not doing so... [snip sub select] Well, since the filter process will do this bit already, maybe that'll cure the problem? > You can also use sub-selects instead of the INTERSECT operator, by looking > up the document id's of docs that meet your criteria. If you know ahead > of time how many documents contain each word (by storing that info in > the dictionary table, and doing a quick lookup first), then you can > order your use of sub-selects in such a way as to start with the > smallest result set and continue pruning it. Ah, cunning, how do we do this without sub-selects? > Hmmm, indeed. Is your framework such that you can plug in different SQL > engines? It might be nice to play with PostGreSQL... Sort of, not with ease, mainly due to the different techniques needed for each engine. In MySQL, I was planning to return a group of join and where clauses from each index, which the engine would then combine and perform as a single select. In PostGres, that's not necessary, each index could just return a complete SELECt which the engine could combine using UNION, INTERSECT, etc. Could be a lot of intertia moving across to PostGreSQL though, particularly the other parts of the project. I'll have to chat with the other guys working on it on Monday... > > > ... which means ugly LEFT OUTER JOINs and other hacks. Or > > > application-side processing. > > > > or both :-S > > Well, application-side processing is not necessarily that bad... You can > quite easily emulate sub-selects in your app, for example. Ooo... do tell :-) cheers, Chris |