Thread: [PyIndexer] Searching and Indexing Redux II
Status: Pre-Alpha
Brought to you by:
cduncan
From: Chris W. <ch...@ni...> - 2001-12-13 17:11:02
|
I'm cc'ing the list on this, in case anyone finds this interesting or can help, sorry about the lack of context for anyone reading it... Marcus Collins wrote: > > On Tue, 4 Dec 2001 at 00:31:10 +0000, Chris Withers wrote: > > > Indeed it is and I've found it particularly helpful, especially when diagnosing > > MySQL hangs... > > So why was MySQL hanging? No idea yet, still trying to help the MySQL AB guys reproduce the problem... > Also useful is that you can kill (mysqladmin > kill) the offending threads. Hmmm... it hangs in such a way that 'mysqladmin kill' doesnt' work, only the windows task manager can kill it :-( > You can still use ANALYZE TABLE with BDB, and apparently it does do some > stuff. That's the equivalent of the {,my}isamchk -a yourtablename. Ah, okay... <snip insert speed> Well, we can optimize indexign speed later :-) > It would be useful to know how those eight seconds are being used -- > what proportion is used by the Python script; by the MySQL server; by > the OS (on I/O). Well, most of it is inside the loop that does the inserts, so I'm guessing split between MySQLdb, MySQL and the OS. > Your best bet in determining this would be to test on > your Linux box. times (under bash(1)) will give you wall clock time, CPU > time, and system time. Hmmm... will have to give that a go when it comes to optimising... > What's the SQL nut book like? Looks pretty cool, haven't had to use it in anger yet... [snip RDB normalisation] > FWIW, the chapter on Performance and Design in the Oracle reference > begins, "No major application will run in Third Normal Form." Hehe, 'cept searchign and indexing ;-) > > [slow InnoDB] > > > Yep, it'll be interesting to see. Did you broach the subject at all to > > > the list? > > > > Yup, and one of the Innobase guys has got back to me, so we shall see... > > What did he have to say, BTW? Turns out the recent windows binaries were compiled with what was effectively a "make_me_cripplingly_slow" flag, so I might try the next release with InnoDB and see what happens... > Are they using some sort of hash function? I'm guessing yes, since > they're also compressing. Probably a page index system. That's > impressive performace, but how long does it take them to index? Also, is > their index updatable? Dunno, guess 'll find out when I get there... > > That said, I think the SQL engine could be pretty f'ing cool... Hmmm... well, we have searching not-quite-quick-enough and indexing slow so we need some kind of tweak still. Anyone got any ideas? Bit disappointed to see that MySQL's performance goes bad when you have OR's in a WHERE, or something like that, which could make boolean stuff unpleasant :-S > Quick reply now, and hopefully more in-depth after Friday... Do you want > to post this on the SF list maybe, or personal? SF List it is :-) > > > BTW, I'm not sure if the key on prev_textindex_id is necessary... Not > > > sure how MySQL handles that query... Wanna post EXPLAIN output? <snip> > Nope, it appears only in possible_keys and ref. The keys in key are used > for the index lookup, as I understand it. Indeed, so we could drop it? > > searching for "db 2" brings back 138 results... > > > > I hope there's not a bug lurking here :-S > > Well, that's obviously the first thing to determine. Maybe post the > actual SQL used, the DDL, and the EXPLAIN output? Indeed. I think I'm gonna tackle this from the other end now. Get the framework and unit tests up and running before returning to build the proper SQL engine now that I know what it should smell like... > > > - search terms that would return many, many results (impacts GROUP BY) > > > > well, when "windows 2000" brought back 1-2K results, it took 10 seconds > > Hmmm... Ten secs still too long. Indeed :-( (seen higher figures than that since...) > You can use COUNT(colname | * ), but that's basically going to save time > only because there's less data for it to allocate memory for and > transfer :-S. How does that differ from just COUNT (*), which didn't work for me? > I'm also leaning towards experimenting with sub-selects for this, but > not sure if they're yet implemented in MySQL (don't have latest version > of manual handy...) Nope... I wonder if it'd be good to look at PostGreSQL here? > Hmm, was reading through the MySQLdb code the other day, and saw that > the cursor.execute method can take a list. Don't remember details, but > may be worth looking at. Will do... > Indeed. It's the damned join. That's why I was thinking again of > sub-selects, which may be more efficient for very large results sets > (but less so for small ones, I imagine). What's the difference between a sub-select and a join? (spot the SQL newbie ;-) > I think I mentioned UNION and INTERSECT a while back, but looking at my > (out-of-date) MySQL reference, it looks like it's not supported. Nope, but it is in Postgres... > And it playing right now, it doesn't look like UNION is in the latest > version either :-S It is, but the latest unstable release... > ... which means ugly LEFT OUTER JOINs and other hacks. Or > application-side processing. or both :-S Well, hope to hear from Marcus some time, but if anyone else can dive in in the meantime, all the better... cheers, Chris |
From: Chris W. <ch...@ni...> - 2001-12-15 13:00:35
|
The SQL code refered to in this mail cna be found at: http://sourceforge.net/tracker/index.php?func=detail&aid=489417&group_id=41154&atid=430556 cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-15 23:13:30
|
On Thu, 13 Dec 2001 at 17:09:23 +0000, Chris Withers wrote: > Marcus Collins wrote: [ snip MySQL threads hanging on Windows ] Ick! Any update yet from MySQL AB? Does it freeze or spin? [ snip tuning indexing ] See some comments below. [ snip InnoDB ] [ ORs in WHERE not using index ] > Bit disappointed to see that MySQL's performance goes bad when you have OR's in > a WHERE, or something like that, which could make boolean stuff unpleasant :-S Hang, yes... I completely forgot about that... I believe it does a table scan in that case! (This could be another tick next to sub-selects; see later) I'm not sure what the behaviour of the other database engines is... [ Redundant prev_textindex_id key in DDL ] > > Nope, it appears only in possible_keys and ref. The keys in key are used > > for the index lookup, as I understand it. > > Indeed, so we could drop it? You can indeed drop it; as it's not used in the SELECTs it won't harm that at all, and it'll speed up your population. Another thing you *should* be able to do is index on only the first n characters of the word, which will (a) make your indexes smaller and (b) therefore allow more values to be stored in a key buffer of the same size. It should also reduce the number of seeks needed, as more keys will be stored in each block. Result: faster population, faster lookup, better cache hit ratio. You'd probably have to determine a good figure emperically, but seeing the distribution of word lengths should help: SELECT COUNT(word), LENGTH(word) AS len FROM dictionary GROUP BY len will give you the distribution of word lengths within your dictionary. Since MySQL's default key block size is 1024 bytes, you might want to select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? Remember that your word index will then not be unique, so don't declare it as such. The idea is that the index then just helps narrow down the search -- the actual rows are still read in order to compare with to the word for which your searching. For words that are shorter than your key length, you should see a real improvement; for longer words, there might be a loss in performance. [ Counting result columns ] > > You can use COUNT(colname | * ), but that's basically going to save time > > only because there's less data for it to allocate memory for and > > transfer :-S. > > How does that differ from just COUNT (*), which didn't work for me? COUNT(*) retrieves all columns from all rows that meet your criteria, and then counts the number of rows. This is just extra work for the server, as it retrieves additional columns that are then just thrown away. You can use COUNT(colname) and it won't retrieve all the columns. However, the query still has to be executed, and the rows retrieved from the tables -- so using COUNT() doesn't cut down all that significantly on the time the query is going to take. [ functionality in PostGreSQL ] > Nope... I wonder if it'd be good to look at PostGreSQL here? I can't asnwer that, since I haven't used PostGreSQL in absolute ages, but it does implement a greater subset of SQL92 (including subselects, union, etc.) Not sure about where table and index optimisation currently stand -- a while back the chief problem with pgsql was that you basically had to run vaccuum every so often and that doing so in a production environment was not viable. But that was a MySQL argument and may have changed in any case. [ sub-selects vs joins ] > > Indeed. It's the damned join. That's why I was thinking again of > > sub-selects, which may be more efficient for very large results sets > > (but less so for small ones, I imagine). > > What's the difference between a sub-select and a join? (spot the SQL newbie ;-) Well, it looks like what is taking the time is the large-ish self joins that have to be done, where MySQL ends up creating large temporary tables. If you can use sub-selects, you basically still generate an intermediate temporary table, but it should be much smaller than in the case of a large join. The (current?) query then becomes: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id IN (SELECT d.word_id FROM dictionary as d WHERE d.word = 'windows' ) GROUP BY documents.document_id So basically, instead of joining a 20 000 row table to a 20 000 row table, you're finding a single row from a 20 000 row table, and then finding (from an index) all the rows in the text index that reference that row. You can also use sub-selects instead of the INTERSECT operator, by looking up the document id's of docs that meet your criteria. If you know ahead of time how many documents contain each word (by storing that info in the dictionary table, and doing a quick lookup first), then you can order your use of sub-selects in such a way as to start with the smallest result set and continue pruning it. [ UNION and INTERSECT operators ] > Nope, but it is in Postgres... Hmmm, indeed. Is your framework such that you can plug in different SQL engines? It might be nice to play with PostGreSQL... [ snip UNION not yet in MySQL -stable ] > > ... which means ugly LEFT OUTER JOINs and other hacks. Or > > application-side processing. > > or both :-S Well, application-side processing is not necessarily that bad... You can quite easily emulate sub-selects in your app, for example. Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-16 14:44:31
|
Marcus Collins wrote: > > [ snip MySQL threads hanging on Windows ] > > Ick! Any update yet from MySQL AB? Does it freeze or spin? It freezes, the response I got was essentially 'upgrade to Win2K Server', which may be the problem, although I remain unconvinced... > > Bit disappointed to see that MySQL's performance goes bad when you have OR's in > > a WHERE, or something like that, which could make boolean stuff unpleasant :-S > > Hang, yes... I completely forgot about that... Anyone got ideas how/if we can mitigate this? [snip remove prev_textindex_id key in DDL] > Another thing you *should* be able to do is index on only the first n > characters of the word, What's the SQL I need to do that? > which will (a) make your indexes smaller that's not currently a problem... > and (b) > therefore allow more values to be stored in a key buffer of the same > size. Ah, yes... > SELECT > COUNT(word), > LENGTH(word) AS len > FROM dictionary > GROUP BY len gave: +-------------+------+ | COUNT(word) | len | +-------------+------+ | 41 | 1 | | 1224 | 2 | | 15818 | 3 | | 27076 | 4 | | 21709 | 5 | | 38393 | 6 | | 26047 | 7 | | 24425 | 8 | | 18607 | 9 | | 14351 | 10 | | 10923 | 11 | | 6931 | 12 | | 5081 | 13 | | 3835 | 14 | | 2839 | 15 | | 2144 | 16 | | 1680 | 17 | | 1216 | 18 | | 873 | 19 | | 649 | 20 | | 434 | 21 | | 281 | 22 | | 208 | 23 | | 138 | 24 | | 84 | 25 | | 65 | 26 | | 45 | 27 | | 27 | 28 | | 15 | 29 | | 10 | 30 | | 6 | 31 | | 4 | 32 | | 4 | 33 | | 2 | 34 | | 1 | 36 | | 1 | 44 | | 2 | 46 | | 1 | 60 | | 1 | 64 | | 1 | 99 | +-------------+------+ So I guess we're looking at a key length of 8 or 9? > Since MySQL's default key block size is 1024 bytes, you might want to > select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? 8? Although 32 would catch just about every word and is a lot less than 256 ;-) > Remember that your word index will then not be unique, so don't declare > it as such. well, my SQL green-ness again, how would I do that? I'm wondering whether the filter-index architecture I'm building will help this? http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pythonindexer/Documentation/interfaces/filter.html?rev=1.1&content-type=text/vnd.viewcvs-markup > The idea is that the index then just helps narrow down the search -- the > actual rows are still read in order to compare with to the word for > which your searching. For words that are shorter than your key length, you > should see a real improvement; for longer words, there might be a loss > in performance. 32 might be the best length then... > COUNT(*) retrieves all columns from all rows that meet your criteria, > and then counts the number of rows. Ah, by didn't work, I meant I didn't get a single number back, I got a two column table, I think the GROUP BY clause was interfering, any ideas how to stop that? > This is just extra work for the > server, as it retrieves additional columns that are then just thrown > away. You can use COUNT(colname) and it won't retrieve all the columns. Cool... > However, the query still has to be executed, and the rows retrieved from > the tables -- so using COUNT() doesn't cut down all that significantly > on the time the query is going to take. :-S > I can't asnwer that, since I haven't used PostGreSQL in absolute ages, > but it does implement a greater subset of SQL92 (including subselects, > union, etc.) Not sure about where table and index optimisation currently > stand -- a while back the chief problem with pgsql was that you > basically had to run vaccuum every so often and that doing so in a > production environment was not viable. But that was a MySQL argument and > may have changed in any case. Hehe, if I get a chance I'll take a look, if only out of curiosity... > [ sub-selects vs joins ] > > > > Indeed. It's the damned join. That's why I was thinking again of > > > sub-selects, which may be more efficient for very large results sets > > > (but less so for small ones, I imagine). does MySQL support sub-selects yet? I certainly remember it as not doing so... [snip sub select] Well, since the filter process will do this bit already, maybe that'll cure the problem? > You can also use sub-selects instead of the INTERSECT operator, by looking > up the document id's of docs that meet your criteria. If you know ahead > of time how many documents contain each word (by storing that info in > the dictionary table, and doing a quick lookup first), then you can > order your use of sub-selects in such a way as to start with the > smallest result set and continue pruning it. Ah, cunning, how do we do this without sub-selects? > Hmmm, indeed. Is your framework such that you can plug in different SQL > engines? It might be nice to play with PostGreSQL... Sort of, not with ease, mainly due to the different techniques needed for each engine. In MySQL, I was planning to return a group of join and where clauses from each index, which the engine would then combine and perform as a single select. In PostGres, that's not necessary, each index could just return a complete SELECt which the engine could combine using UNION, INTERSECT, etc. Could be a lot of intertia moving across to PostGreSQL though, particularly the other parts of the project. I'll have to chat with the other guys working on it on Monday... > > > ... which means ugly LEFT OUTER JOINs and other hacks. Or > > > application-side processing. > > > > or both :-S > > Well, application-side processing is not necessarily that bad... You can > quite easily emulate sub-selects in your app, for example. Ooo... do tell :-) cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-16 22:01:10
|
On Sun, 16 Dec 2001 at 14:44:46 +0000, Chris Withers wrote: > Marcus Collins wrote: [ MySQL threads hanging on Windows ] > It freezes, the response I got was essentially 'upgrade to Win2K Server', which > may be the problem, although I remain unconvinced... Is this happening consistently with the same query/sequence of actions? Is it only one thread that hangs? Does MySQL's trace log tell you anything useful if you start mysqld with --debug? You'll also want to start it with --standalone and direct the trace log to a big disk. See <URL:http://www.mysql.org/documentation/mysql/bychapter/manual_porting.html#Making_trace_files> (AFAIK, the Windows binary is compiled with trace debugging built in) --log might also be useful, just to get an indication of whether the problem arises once the query is being executed (in which case you'll know what query is breaking it), or if the thread freezes even before the query is attempted (in which case there's possibly a problem with the threading). Other than that, hopefully the mysql-windows list (IIRC) can help more... [ ORs in WHERE clause ] I distinctly remember reading about this somewhere, and I cannot remember where... Any pointers? I can't find anything about this in the manual, but I know that (under certain conditions?) MySQL won't use the index if you use an OR in your WHERE clause. [ prefix key indexes ] > > Another thing you *should* be able to do is index on only the first n > > characters of the word, > > What's the SQL I need to do that? <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#CREATE_INDEX> CREATE INDEX word_idx ON dictionary (word(16)) will create a (non-unique) index keyed on the first 16 characters of dictionary.word Your application should ensure that the values inserted are unique. [ snip word length distribution ] > So I guess we're looking at a key length of 8 or 9? > > > Since MySQL's default key block size is 1024 bytes, you might want to > > select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? > > 8? Although 32 would catch just about every word and is a lot less than 256 ;-) Possibly. The *only* way you're going to find out is emperically, although I think it'll be 8, 16, or 32... If 8 does very well, and 16 and 32 do only marginally worse, I'd be tempted to go with the bigger length. Especially if the engine is likely to be used in, e.g., indexing medical literature and the like. > > Remember that your word index will then not be unique, so don't declare > > it as such. > > well, my SQL green-ness again, how would I do that? As above. Indexes are never unique unless you specify them as such (UNIQUE INDEX | UNIQUE KEY) or they are the PRIMARY KEY. > I'm wondering whether the filter-index architecture I'm building will help this? > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pythonindexer/Documentation/interfaces/filter.html?rev=1.1&content-type=text/vnd.viewcvs-markup [ snip benefits of prefix key indexes ] [ COUNT aggregate function ] > > COUNT(*) retrieves all columns from all rows that meet your criteria, > > and then counts the number of rows. > > Ah, by didn't work, I meant I didn't get a single number back, I got a two > column table, I think the GROUP BY clause was interfering, any ideas how to stop > that? Yes, it would've been the GROUP BY. To get the behaviour you're after, you would have to use COUNT(DISTINCT document_id) and drop the GROUP BY. The DISTINCT will ensure that no duplicate doc ids are counted. When you use GROUP BY, each group of equal valuse for the columns in the GROUP BY clause is subjected to your aggregate function. That would be why you got two columns. See: <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#Group_by_functions> > > However, the query still has to be executed, and the rows retrieved from > > the tables -- so using COUNT() doesn't cut down all that significantly > > on the time the query is going to take. > > :-S There's really no way of predicting ahead of time how many results a query is going to return... You could look at the relative number of occurrences of the search terms, but that would by no means be foolproof. What you can do is use LIMIT to return only a certain number of rows, which can help somewhat because (a) if you're ordering the results, MySQL will stop once it's found that many results; and (b) there's less data to send to the client. [ snip PostGreSQL ] [ sub-selects vs joins ] > does MySQL support sub-selects yet? I certainly remember it as not doing so... No, it doesn't: <URL:http://www.mysql.com/doc/A/N/ANSI_diff_Sub-selects.html> > Well, since the filter process will do this bit already, maybe that'll cure the > problem? > > > You can also use sub-selects instead of the INTERSECT operator, by looking > > up the document id's of docs that meet your criteria. If you know ahead > > of time how many documents contain each word (by storing that info in > > the dictionary table, and doing a quick lookup first), then you can > > order your use of sub-selects in such a way as to start with the > > smallest result set and continue pruning it. > > Ah, cunning, how do we do this without sub-selects? Well, this would require app-side processing in any case, so refer to the end of the post... > > Hmmm, indeed. Is your framework such that you can plug in different SQL > > engines? It might be nice to play with PostGreSQL... > > Sort of, not with ease, mainly due to the different techniques needed for each > engine. > In MySQL, I was planning to return a group of join and where clauses from each > index, which the engine would then combine and perform as a single select. > In PostGres, that's not necessary, each index could just return a complete > SELECt which the engine could combine using UNION, INTERSECT, etc. Could be a > lot of intertia moving across to PostGreSQL though, particularly the other parts > of the project. I'll have to chat with the other guys working on it on Monday... Well, we can definitely do better using MySQL, but it's looking like it will require app-side processing. The only MySQL-only alternative I can think of at this stage would be to use temporary heap tables to store intermediate results, and then join these tables (which use hash indexxing and are in memory, and therefore very fast). Since the tables are temporary, you can do pretty much anything with them: - UNIONs would be implemented by additional INSERT INTO ... SELECT FROM statements. Ditto for those ORs, possibly. - JOIN operations would be much faster, since the tables would be small and they use hashes for indexing - you can specify a maximum number of rows that the table can hold (and therefore limit the number of results returned) I don't actually think this would be the most efficient -- app-side processing probably will be more efficient -- but if you're interested, then see: <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#CREATE_TABLE> <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Table_types.html#HEAP> <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#INSERT_SELECT> It's also nice that this (excepting specifying the table type) still complies with SQL92, which makes provision for temporary tables. [ Application-side processing ] > > Well, application-side processing is not necessarily that bad... You can > > quite easily emulate sub-selects in your app, for example. > > Ooo... do tell :-) Take for example the query: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id IN (SELECT d.word_id FROM dictionary as d WHERE d.word = 'windows' ) GROUP BY documents.document_id 1. Start with your inner-most SELECT: SELECT word_id FROM dictionary WHERE word = 'windows' 2. Retrieve the word id returned (it *should* be a single id) 4. Form the outer SELECT: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id = the_id_returned_for_windows GROUP BY documents.document_id That eliminates one of the present self joins, replacing it with a very cheap single-row result lookup. You could nest any number of queries like that. When it comes to boolean searching, you can apply the same logic, and retrieve all document ids that you need to test further. Then generate an SQL list and use the IN membership operator. For example, if you have the above query and need to test further for articles that also refer to FreeBSD, you'd perform two additional SELECTS: 1. Get list of documents referring to FreeBSD: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = 'freebsd' GROUP BY documents.document_id 2. Assuming the document ids returned are (1, 3, 5, 7, 11, 13) : SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id = the_id_returned_for_windows AND documents.document_id IN (1, 3, 5, 7, 11, 13) GROUP BY documents.document_id Total of three queries (although you would do well to use an additional one first to determine optimum order), one of which has no joins, the others only two. If you don't mind doing the app-side processing, then this is the way to go. Just make sure that, if you're using lots of document ids in your list, your query buffer (query_buffer_size) is large enough. I think you should also be able to use this *very* efficiently instead of the INTERESCT operator. If MySQL sees the following: documents.document_id IN (1, 3, 5, 7, 11, 13) AND documents.document_id IN (5, 7, 11, 13, 17, 19, 23) it will do constant folding and eliminate the impossibilities. You can also manipulate your app's lists to get around the OR problem by merging multiple doc id lists into one. Cheers -- Marcus |
From: Casey D. <cas...@ya...> - 2001-12-17 03:17:00
|
[snip lots 'o stuff] > You could nest any number of queries like that. When > it comes to boolean searching, you can apply the > same logic, and retrieve all document ids that you > need to test further. Then generate an SQL list and > use the IN membership operator. [snip example] I was going to suggest this, but what if both words return 10,000 document hits? I doubt stuffing 10K ids into the IN operand of the SQL statment would work... But I do think that using heap tables would be quite efficient. I would think that would be the way to go, and just join against it in the second query. Maybe both methods could be used, using IN when there are 50 or fewer hits and heap tables otherwise. Again I think storing a document count for each word could help here. -Casey __________________________________________________ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com |
From: Marcus C. <ma...@wr...> - 2001-12-17 10:42:21
|
On Sun, 16 Dec 2001 at 19:17:00 -0800, Casey Duncan wrote: [ using IN ] > I was going to suggest this, but what if both words > return 10,000 document hits? I doubt stuffing 10K ids > into the IN operand of the SQL statment would work... *gulp* yes... I've done some informal tests (real results to follow when they're formalised), and this might not be something to worry about so much. The time spent in processing the query is: - send query to MySQL server (linear) - parse query (doesn't look as bad as I thought it might; appears to be linear) - select using explicit document ids (cheaper than joining to find them) Additionally, I think the space complexity becomes linear rather than exponential for the last item, when compared to a self join. The app-side time to join a list of document ids returned from MySQL into a MySQL list should not be too bad. (The 'obvious' way of doing it takes 50ms for 10000 ids on my box, once the ids have been returned from MySQL. The main time expense, I think, would be inside MySQLdb.) > But I do think that using heap tables would be quite > efficient. I would think that would be the way to go, > and just join against it in the second query. It's still the join that bugs me, because I don't think it operates in linear time or space. But that said, the lookup in the heap tables, using hashes, would be constant. > Maybe both methods could be used, using IN when there > are 50 or fewer hits and heap tables otherwise. The major problem with the join (or at least with the self join) is that MySQL wants to create a temporary table to join two 10K-row tables. Using the heap temporary to store intermediate results would mitigate this to some extent, since MySQL would then be joining one 10K row table with another of fewer rows. I don't think we'd find the same behaviour using IN, but I'm not yet certin. I'll try to get some comparative results using IN for different numbers of return results. > Again > I think storing a document count for each word could > help here. Indeed. Storing the document count will definitely push up indexing time, but the benefits are great. Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-17 13:50:42
|
Marcus Collins wrote: > > The app-side time to join a list of document ids returned from MySQL > into a MySQL list should not be too bad. (The 'obvious' way of doing it > takes 50ms for 10000 ids on my box, once the ids have been returned from > MySQL. The main time expense, I think, would be inside MySQLdb.) I'm guessing by this you mean doing each boolean term as a seperate SELECT and using python to combine the returned results? What was the method you used? cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-17 15:51:18
|
On Mon, 17 Dec 2001 at 13:49:05 +0000, Chris Withers wrote: > Marcus Collins wrote: > > > > The app-side time to join a list of document ids returned from MySQL > > into a MySQL list should not be too bad. (The 'obvious' way of doing it > > takes 50ms for 10000 ids on my box, once the ids have been returned from > > MySQL. The main time expense, I think, would be inside MySQLdb.) > > I'm guessing by this you mean doing each boolean term as a seperate SELECT and > using python to combine the returned results? > What was the method you used? Oh hang, sorry... I meant fetching all the results from a previous query and using string.join() to generate an SQL list (the list of doc ids to be checked in the next query). Those are both cheap. Results to follow, but essentially MySQL has no problem handling IN tests with 10K members on a 1M row table -- worst-case, where all results are returned, is under two seconds, of which 10% is spent by Python fetching the results and string.joining the list. With one in ten results being returned, you're looking at a query taking around 0.5s. This excludes the initial lookups and other joins, but it is positive. Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-17 13:42:13
|
Marcus Collins wrote: > > [ MySQL threads hanging on Windows ] > > > It freezes, the response I got was essentially 'upgrade to Win2K Server', which > > may be the problem, although I remain unconvinced... > > Is this happening consistently with the same query/sequence of actions? Well, after indexing a certain number of docs, yup... > Is it only one thread that hangs? Nope, anything that trys to do any operation hangs on opening the table. MySQL can't be shut down and I have to resort to the Windows task manager to kill the process. I'm taking Andy McKay's MS approach and ignoring it for now, since the production system will be Debian, not Win2K ;-) cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-17 15:51:23
|
On Mon, 17 Dec 2001 at 13:40:36 +0000, Chris Withers wrote: > Marcus Collins wrote: [ MySQL threads hanging on Windows ] > > Is this happening consistently with the same query/sequence of actions? > > Well, after indexing a certain number of docs, yup... Always the same number, if you've restarted the server? If you turn query logging on, does it still break? > > Is it only one thread that hangs? > > Nope, anything that trys to do any operation hangs on opening the table. Can you open other tables, or connect to the server in any other way? > MySQL can't be shut down and I have to resort to the Windows task manager to > kill the process. > I'm taking Andy McKay's MS approach and ignoring it for now, since the > production system will be Debian, not Win2K ;-) *grins* Cheers -- Marcus |