Re: [PyIndexer] MySQL Stuff
Status: Pre-Alpha
Brought to you by:
cduncan
From: Marcus C. <ma...@wr...> - 2001-12-16 22:01:10
|
On Sun, 16 Dec 2001 at 14:44:46 +0000, Chris Withers wrote: > Marcus Collins wrote: [ MySQL threads hanging on Windows ] > It freezes, the response I got was essentially 'upgrade to Win2K Server', which > may be the problem, although I remain unconvinced... Is this happening consistently with the same query/sequence of actions? Is it only one thread that hangs? Does MySQL's trace log tell you anything useful if you start mysqld with --debug? You'll also want to start it with --standalone and direct the trace log to a big disk. See <URL:http://www.mysql.org/documentation/mysql/bychapter/manual_porting.html#Making_trace_files> (AFAIK, the Windows binary is compiled with trace debugging built in) --log might also be useful, just to get an indication of whether the problem arises once the query is being executed (in which case you'll know what query is breaking it), or if the thread freezes even before the query is attempted (in which case there's possibly a problem with the threading). Other than that, hopefully the mysql-windows list (IIRC) can help more... [ ORs in WHERE clause ] I distinctly remember reading about this somewhere, and I cannot remember where... Any pointers? I can't find anything about this in the manual, but I know that (under certain conditions?) MySQL won't use the index if you use an OR in your WHERE clause. [ prefix key indexes ] > > Another thing you *should* be able to do is index on only the first n > > characters of the word, > > What's the SQL I need to do that? <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#CREATE_INDEX> CREATE INDEX word_idx ON dictionary (word(16)) will create a (non-unique) index keyed on the first 16 characters of dictionary.word Your application should ensure that the values inserted are unique. [ snip word length distribution ] > So I guess we're looking at a key length of 8 or 9? > > > Since MySQL's default key block size is 1024 bytes, you might want to > > select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? > > 8? Although 32 would catch just about every word and is a lot less than 256 ;-) Possibly. The *only* way you're going to find out is emperically, although I think it'll be 8, 16, or 32... If 8 does very well, and 16 and 32 do only marginally worse, I'd be tempted to go with the bigger length. Especially if the engine is likely to be used in, e.g., indexing medical literature and the like. > > Remember that your word index will then not be unique, so don't declare > > it as such. > > well, my SQL green-ness again, how would I do that? As above. Indexes are never unique unless you specify them as such (UNIQUE INDEX | UNIQUE KEY) or they are the PRIMARY KEY. > I'm wondering whether the filter-index architecture I'm building will help this? > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pythonindexer/Documentation/interfaces/filter.html?rev=1.1&content-type=text/vnd.viewcvs-markup [ snip benefits of prefix key indexes ] [ COUNT aggregate function ] > > COUNT(*) retrieves all columns from all rows that meet your criteria, > > and then counts the number of rows. > > Ah, by didn't work, I meant I didn't get a single number back, I got a two > column table, I think the GROUP BY clause was interfering, any ideas how to stop > that? Yes, it would've been the GROUP BY. To get the behaviour you're after, you would have to use COUNT(DISTINCT document_id) and drop the GROUP BY. The DISTINCT will ensure that no duplicate doc ids are counted. When you use GROUP BY, each group of equal valuse for the columns in the GROUP BY clause is subjected to your aggregate function. That would be why you got two columns. See: <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#Group_by_functions> > > However, the query still has to be executed, and the rows retrieved from > > the tables -- so using COUNT() doesn't cut down all that significantly > > on the time the query is going to take. > > :-S There's really no way of predicting ahead of time how many results a query is going to return... You could look at the relative number of occurrences of the search terms, but that would by no means be foolproof. What you can do is use LIMIT to return only a certain number of rows, which can help somewhat because (a) if you're ordering the results, MySQL will stop once it's found that many results; and (b) there's less data to send to the client. [ snip PostGreSQL ] [ sub-selects vs joins ] > does MySQL support sub-selects yet? I certainly remember it as not doing so... No, it doesn't: <URL:http://www.mysql.com/doc/A/N/ANSI_diff_Sub-selects.html> > Well, since the filter process will do this bit already, maybe that'll cure the > problem? > > > You can also use sub-selects instead of the INTERSECT operator, by looking > > up the document id's of docs that meet your criteria. If you know ahead > > of time how many documents contain each word (by storing that info in > > the dictionary table, and doing a quick lookup first), then you can > > order your use of sub-selects in such a way as to start with the > > smallest result set and continue pruning it. > > Ah, cunning, how do we do this without sub-selects? Well, this would require app-side processing in any case, so refer to the end of the post... > > Hmmm, indeed. Is your framework such that you can plug in different SQL > > engines? It might be nice to play with PostGreSQL... > > Sort of, not with ease, mainly due to the different techniques needed for each > engine. > In MySQL, I was planning to return a group of join and where clauses from each > index, which the engine would then combine and perform as a single select. > In PostGres, that's not necessary, each index could just return a complete > SELECt which the engine could combine using UNION, INTERSECT, etc. Could be a > lot of intertia moving across to PostGreSQL though, particularly the other parts > of the project. I'll have to chat with the other guys working on it on Monday... Well, we can definitely do better using MySQL, but it's looking like it will require app-side processing. The only MySQL-only alternative I can think of at this stage would be to use temporary heap tables to store intermediate results, and then join these tables (which use hash indexxing and are in memory, and therefore very fast). Since the tables are temporary, you can do pretty much anything with them: - UNIONs would be implemented by additional INSERT INTO ... SELECT FROM statements. Ditto for those ORs, possibly. - JOIN operations would be much faster, since the tables would be small and they use hashes for indexing - you can specify a maximum number of rows that the table can hold (and therefore limit the number of results returned) I don't actually think this would be the most efficient -- app-side processing probably will be more efficient -- but if you're interested, then see: <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#CREATE_TABLE> <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Table_types.html#HEAP> <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#INSERT_SELECT> It's also nice that this (excepting specifying the table type) still complies with SQL92, which makes provision for temporary tables. [ Application-side processing ] > > Well, application-side processing is not necessarily that bad... You can > > quite easily emulate sub-selects in your app, for example. > > Ooo... do tell :-) Take for example the query: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id IN (SELECT d.word_id FROM dictionary as d WHERE d.word = 'windows' ) GROUP BY documents.document_id 1. Start with your inner-most SELECT: SELECT word_id FROM dictionary WHERE word = 'windows' 2. Retrieve the word id returned (it *should* be a single id) 4. Form the outer SELECT: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id = the_id_returned_for_windows GROUP BY documents.document_id That eliminates one of the present self joins, replacing it with a very cheap single-row result lookup. You could nest any number of queries like that. When it comes to boolean searching, you can apply the same logic, and retrieve all document ids that you need to test further. Then generate an SQL list and use the IN membership operator. For example, if you have the above query and need to test further for articles that also refer to FreeBSD, you'd perform two additional SELECTS: 1. Get list of documents referring to FreeBSD: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = 'freebsd' GROUP BY documents.document_id 2. Assuming the document ids returned are (1, 3, 5, 7, 11, 13) : SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id = the_id_returned_for_windows AND documents.document_id IN (1, 3, 5, 7, 11, 13) GROUP BY documents.document_id Total of three queries (although you would do well to use an additional one first to determine optimum order), one of which has no joins, the others only two. If you don't mind doing the app-side processing, then this is the way to go. Just make sure that, if you're using lots of document ids in your list, your query buffer (query_buffer_size) is large enough. I think you should also be able to use this *very* efficiently instead of the INTERESCT operator. If MySQL sees the following: documents.document_id IN (1, 3, 5, 7, 11, 13) AND documents.document_id IN (5, 7, 11, 13, 17, 19, 23) it will do constant folding and eliminate the impossibilities. You can also manipulate your app's lists to get around the OR problem by merging multiple doc id lists into one. Cheers -- Marcus |