Re: [PyIndexer] MySQL Stuff

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Marcus Collins wrote:
> 
> [ snip MySQL threads hanging on Windows ]
> 
> Ick! Any update yet from MySQL AB? Does it freeze or spin?

It freezes, the response I got was essentially 'upgrade to Win2K Server', which
may be the problem, although I remain unconvinced...

> > Bit disappointed to see that MySQL's performance goes bad when you have OR's in
> > a WHERE, or something like that, which could make boolean stuff unpleasant :-S
> 
> Hang, yes... I completely forgot about that...

Anyone got ideas how/if we can mitigate this?

[snip remove prev_textindex_id key in DDL]

> Another thing you *should* be able to do is index on only the first n
> characters of the word, 

What's the SQL I need to do that?

> which will (a) make your indexes smaller 

that's not currently a problem...

> and (b)
> therefore allow more values to be stored in a key buffer of the same
> size.

Ah, yes...

>   SELECT
>     COUNT(word),
>     LENGTH(word) AS len
>   FROM dictionary
>   GROUP BY len

gave:

+-------------+------+
| COUNT(word) | len  |
+-------------+------+
|          41 |    1 |
|        1224 |    2 |
|       15818 |    3 |
|       27076 |    4 |
|       21709 |    5 |
|       38393 |    6 |
|       26047 |    7 |
|       24425 |    8 |
|       18607 |    9 |
|       14351 |   10 |
|       10923 |   11 |
|        6931 |   12 |
|        5081 |   13 |
|        3835 |   14 |
|        2839 |   15 |
|        2144 |   16 |
|        1680 |   17 |
|        1216 |   18 |
|         873 |   19 |
|         649 |   20 |
|         434 |   21 |
|         281 |   22 |
|         208 |   23 |
|         138 |   24 |
|          84 |   25 |
|          65 |   26 |
|          45 |   27 |
|          27 |   28 |
|          15 |   29 |
|          10 |   30 |
|           6 |   31 |
|           4 |   32 |
|           4 |   33 |
|           2 |   34 |
|           1 |   36 |
|           1 |   44 |
|           2 |   46 |
|           1 |   60 |
|           1 |   64 |
|           1 |   99 |
+-------------+------+

So I guess we're looking at a key length of 8 or 9?

> Since MySQL's default key block size is 1024 bytes, you might want to
> select a size by which 1024 is divisible. Thirty-two or sixteen, maybe?

8? Although 32 would catch just about every word and is a lot less than 256 ;-)

> Remember that your word index will then not be unique, so don't declare
> it as such.

well, my SQL green-ness again, how would I do that?

I'm wondering whether the filter-index architecture I'm building will help this?
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pythonindexer/Documentation/interfaces/filter.html?rev=1.1&content-type=text/vnd.viewcvs-markup

> The idea is that the index then just helps narrow down the search -- the
> actual rows are still read in order to compare with to the word for
> which your searching. For words that are shorter than your key length, you
> should see a real improvement; for longer words, there might be a loss
> in performance.

32 might be the best length then...

> COUNT(*) retrieves all columns from all rows that meet your criteria,
> and then counts the number of rows.

Ah, by didn't work, I meant I didn't get a single number back, I got a two
column table, I think the GROUP BY clause was interfering, any ideas how to stop
that?

> This is just extra work for the
> server, as it retrieves additional columns that are then just thrown
> away. You can use COUNT(colname) and it won't retrieve all the columns.

Cool...

> However, the query still has to be executed, and the rows retrieved from
> the tables -- so using COUNT() doesn't cut down all that significantly
> on the time the query is going to take.

:-S

> I can't asnwer that, since I haven't used PostGreSQL in absolute ages,
> but it does implement a greater subset of SQL92 (including subselects,
> union, etc.) Not sure about where table and index optimisation currently
> stand -- a while back the chief problem with pgsql was that you
> basically had to run vaccuum every so often and that doing so in a
> production environment was not viable. But that was a MySQL argument and
> may have changed in any case.

Hehe, if I get a chance I'll take a look, if only out of curiosity...

> [ sub-selects vs joins ]
> 
> > > Indeed. It's the damned join. That's why I was thinking again of
> > > sub-selects, which may be more efficient for very large results sets
> > > (but less so for small ones, I imagine).

does MySQL support sub-selects yet? I certainly remember it as not doing so...

[snip sub select]

Well, since the filter process will do this bit already, maybe that'll cure the
problem?

> You can also use sub-selects instead of the INTERSECT operator, by looking
> up the document id's of docs that meet your criteria. If you know ahead
> of time how many documents contain each word (by storing that info in
> the dictionary table, and doing a quick lookup first), then you can
> order your use of sub-selects in such a way as to start with the
> smallest result set and continue pruning it.

Ah, cunning, how do we do this without sub-selects?

> Hmmm, indeed. Is your framework such that you can plug in different SQL
> engines? It might be nice to play with PostGreSQL...

Sort of, not with ease, mainly due to the different techniques needed for each
engine.
In MySQL, I was planning to return a group of join and where clauses from each
index, which the engine would then combine and perform as a single select.
In PostGres, that's not necessary, each index could just return a complete
SELECt which the engine could combine using UNION, INTERSECT, etc. Could be a
lot of intertia moving across to PostGreSQL though, particularly the other parts
of the project. I'll have to chat with the other guys working on it on Monday...

> > > ... which means ugly LEFT OUTER JOINs and other hacks. Or
> > > application-side processing.
> >
> > or both :-S
> 
> Well, application-side processing is not necessarily that bad... You can
> quite easily emulate sub-selects in your app, for example.

Ooo... do tell :-)

cheers,

Chris