[PyIndexer] Searching and Indexing Redux II

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I'm cc'ing the list on this, in case anyone finds this interesting or can help,
sorry about the lack of context for anyone reading it...

Marcus Collins wrote:
> 
> On Tue,  4 Dec 2001 at 00:31:10 +0000, Chris Withers wrote:
> 
> > Indeed it is and I've found it particularly helpful, especially when diagnosing
> > MySQL hangs...
> 
> So why was MySQL hanging? 

No idea yet, still trying to help the MySQL AB guys reproduce the problem...

> Also useful is that you can kill (mysqladmin
> kill) the offending threads.

Hmmm... it hangs in such a way that 'mysqladmin kill' doesnt' work, only the
windows task manager can kill it :-(

> You can still use ANALYZE TABLE with BDB, and apparently it does do some
> stuff. That's the equivalent of the {,my}isamchk -a yourtablename.

Ah, okay...

<snip insert speed>

Well, we can optimize indexign speed later :-)

> It would be useful to know how those eight seconds are being used --
> what proportion is used by the Python script; by the MySQL server; by
> the OS (on I/O). 

Well, most of it is inside the loop that does the inserts, so I'm guessing split
between MySQLdb, MySQL and the OS.

> Your best bet in determining this would be to test on
> your Linux box. times (under bash(1)) will give you wall clock time, CPU
> time, and system time.

Hmmm... will have to give that a go when it comes to optimising...

> What's the SQL nut book like?

Looks pretty cool, haven't had to use it in anger yet...

[snip RDB normalisation]
> FWIW, the chapter on Performance and Design in the Oracle reference
> begins, "No major application will run in Third Normal Form."

Hehe, 'cept searchign and indexing ;-)

> > [slow InnoDB]
> > > Yep, it'll be interesting to see. Did you broach the subject at all to
> > > the list?
> >
> > Yup, and one of the Innobase guys has got back to me, so we shall see...
> 
> What did he have to say, BTW?

Turns out the recent windows binaries were compiled with what was effectively a
"make_me_cripplingly_slow" flag, so I might try the next release with InnoDB and
see what happens...

> Are they using some sort of hash function? I'm guessing yes, since
> they're also compressing. Probably a page index system. That's
> impressive performace, but how long does it take them to index? Also, is
> their index updatable?

Dunno, guess 'll find out when I get there...

> > That said, I think the SQL engine could be pretty f'ing cool...

Hmmm... well, we have searching not-quite-quick-enough and indexing slow so we
need some kind of tweak still.

Anyone got any ideas?

Bit disappointed to see that MySQL's performance goes bad when you have OR's in
a WHERE, or something like that, which could make boolean stuff unpleasant :-S

> Quick reply now, and hopefully more in-depth after Friday... Do you want
> to post this on the SF list maybe, or personal?

SF List it is :-)

> > > BTW, I'm not sure if the key on prev_textindex_id is necessary... Not
> > > sure how MySQL handles that query... Wanna post EXPLAIN output?

<snip>

> Nope, it appears only in possible_keys and ref. The keys in key are used
> for the index lookup, as I understand it.

Indeed, so we could drop it?

> > searching for "db 2" brings back 138 results...
> >
> > I hope there's not a bug lurking here :-S
> 
> Well, that's obviously the first thing to determine. Maybe post the
> actual SQL used, the DDL, and the EXPLAIN output?

Indeed. I think I'm gonna tackle this from the other end now.
Get the framework and unit tests up and running before returning to build the
proper SQL engine now that I know what it should smell like...

> > > - search terms that would return many, many results (impacts GROUP BY)
> >
> > well, when "windows 2000" brought back 1-2K results, it took 10 seconds
> 
> Hmmm... Ten secs still too long.

Indeed :-( (seen higher figures than that since...)

> You can use COUNT(colname | * ), but that's basically going to save time
> only because there's less data for it to allocate memory for and
> transfer :-S.

How does that differ from just COUNT (*), which didn't work for me?

> I'm also leaning towards experimenting with sub-selects for this, but
> not sure if they're yet implemented in MySQL (don't have latest version
> of manual handy...)

Nope... I wonder if it'd be good to look at PostGreSQL here?

> Hmm, was reading through the MySQLdb code the other day, and saw that
> the cursor.execute method can take a list. Don't remember details, but
> may be worth looking at.

Will do...

> Indeed. It's the damned join. That's why I was thinking again of
> sub-selects, which may be more efficient for very large results sets
> (but less so for small ones, I imagine).

What's the difference between a sub-select and a join? (spot the SQL newbie ;-)

> I think I mentioned UNION and INTERSECT a while back, but looking at my
> (out-of-date) MySQL reference, it looks like it's not supported.

Nope, but it is in Postgres...

> And it playing right now, it doesn't look like UNION is in the latest
> version either :-S

It is, but the latest unstable release...

> ... which means ugly LEFT OUTER JOINs and other hacks. Or
> application-side processing.

or both :-S

Well, hope to hear from Marcus some time, but if anyone else can dive in in the
meantime, all the better...

cheers,

Chris