pythonindexer-discuss Mailing List for PythonIndexer
Status: Pre-Alpha
Brought to you by:
cduncan
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(36) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Chris W. <ch...@ni...> - 2002-01-18 16:17:02
|
Casey Duncan wrote: > > > Zope 2.4.3 (http://www.zope.org) > > JPE (http://sf.net/projects/jpe) > > Lucene (http://jakarta.apache.org/lucene/) > > > > ...I got the indexing I want. > > I'm happy for you, but sad for me... Any reason why you can't use Lucene in your project? > >From the looks it seemed like very cool technology. I guess I'm just a Python > bigot, tho. But hey, you can't argue with something that works! Well, I'm a Python biggot too, that's why all the code I wrote was in python. I _love_ JPE ;-) cheers, Chris |
From: Casey D. <c.d...@nl...> - 2002-01-18 14:43:32
|
On Friday 18 January 2002 09:14 am, Chris Withers allegedly wrote: > Hi, > > Just to let you guys know, using a combination of: > > Zope 2.4.3 (http://www.zope.org) > JPE (http://sf.net/projects/jpe) > Lucene (http://jakarta.apache.org/lucene/) > > ...I got the indexing I want. I'm happy for you, but sad for me... > JPE is really very cool. Lets you use python from java and java from > python. > > Sadly, this means I don't really have any driving need for this project > anymore which means it won't get a lot of attention from me. > > Of course, this is open source, so if anyoen finds it useful, they're > welcome to pick it up :-) I intend to continue working on this. I doubt I will have much time until next month, but hey, time flies anyway! > Thanks to everyone for your time and help. I'm mroe than happy to answer > any questiosn I can on Lucene/JPE/etc. > > I'll probably drop off this list in a coupla weeks time though... Chris, Thanks so much for jump starting this project. The work you have done on the interfaces is very valuable (to me). I really appreciate the effort. Hey, maybe we'll come up with something so compelling, you'll come groveling back 8^) > cheers, > > Chris > > PS: Lucene indexed 30,000 docs in just over an hour, indexes totalled 40Mb > when optimised and queries return in an avergae of 0.2s From the looks it seemed like very cool technology. I guess I'm just a Python bigot, tho. But hey, you can't argue with something that works! If I get hot and heavy into this project, I'm sure I will do a bit of code skimming from them. Use the source, Luke! I'm sure I'll "see" you around... /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |
From: Chris W. <ch...@ni...> - 2002-01-18 14:13:54
|
Hi, Just to let you guys know, using a combination of: Zope 2.4.3 (http://www.zope.org) JPE (http://sf.net/projects/jpe) Lucene (http://jakarta.apache.org/lucene/) ...I got the indexing I want. JPE is really very cool. Lets you use python from java and java from python. Sadly, this means I don't really have any driving need for this project anymore which means it won't get a lot of attention from me. Of course, this is open source, so if anyoen finds it useful, they're welcome to pick it up :-) Thanks to everyone for your time and help. I'm mroe than happy to answer any questiosn I can on Lucene/JPE/etc. I'll probably drop off this list in a coupla weeks time though... cheers, Chris PS: Lucene indexed 30,000 docs in just over an hour, indexes totalled 40Mb when optimised and queries return in an avergae of 0.2s |
From: Chris W. <ch...@ni...> - 2002-01-02 23:34:53
|
Casey Duncan wrote: > > > AND, with the amazing software that is JPE, I might not actually have to > > write any indexing code, which would certainyl make my life easier :-) > > Maybe I missed this when I was sleeping, but what is JPE? Lets you run Java from CPython and Python from Java. Pretty cool :-) > > Of course, Lucene might not quite cut it, and JPE might not quite do what I > > need, in which case we'll be back to SQL and then moving onto a C > > implementation. > > BTW: I came across another SQL-based indexing system in my travels that might > be worth a look for implementation ideas: > > http://www.mnogosearch.org/ That's tailored more to 'web site' searches, isn't it? > In case things don't work out, I should have much more time to help come > February, when my own project that I want indexing for gets cranking. That is > assuming I don't find a tool that I can integrate with Zope that already does > everything I need. Indeed :-) Of course, if you do find such a tool, please let me know! > Regardless I have a vested (if sick) interest in getting something better in > Zope with regard to indexing. Likewise, and Lucene will never be that. I guess I have an unhealthy interest in 'difficult' computing *grinz* Especially since anything that ends up in 'Zope Core' will have to use ZODB, which basically sucks for indexes... cheers, Chris |
From: Casey D. <c.d...@nl...> - 2002-01-02 20:04:32
|
On Wednesday 02 January 2002 02:50 pm, Chris Withers allegedly wrote: > Hi Guys, > > Just to let you know, I'm gonna be experimenting shortly with actually > using Lucene from python. The more I see of it, the more it looks like what > I need > > AND, with the amazing software that is JPE, I might not actually have to > write any indexing code, which would certainyl make my life easier :-) Maybe I missed this when I was sleeping, but what is JPE? > Of course, Lucene might not quite cut it, and JPE might not quite do what I > need, in which case we'll be back to SQL and then moving onto a C > implementation. BTW: I came across another SQL-based indexing system in my travels that might be worth a look for implementation ideas: http://www.mnogosearch.org/ > I'll let you know how I get on... In case things don't work out, I should have much more time to help come February, when my own project that I want indexing for gets cranking. That is assuming I don't find a tool that I can integrate with Zope that already does everything I need. Regardless I have a vested (if sick) interest in getting something better in Zope with regard to indexing. > > cheers, > > Chris /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |
From: Chris W. <ch...@ni...> - 2002-01-02 19:49:23
|
Hi Guys, Just to let you know, I'm gonna be experimenting shortly with actually using Lucene from python. The more I see of it, the more it looks like what I need AND, with the amazing software that is JPE, I might not actually have to write any indexing code, which would certainyl make my life easier :-) Of course, Lucene might not quite cut it, and JPE might not quite do what I need, in which case we'll be back to SQL and then moving onto a C implementation. I'll let you know how I get on... cheers, Chris |
From: Chris W. <ch...@ni...> - 2001-12-21 20:34:47
|
Hi Andreas, This looks cool :-) Hopefully it won't be too hard to turn it into a Filter for the PythonIndexer... Andreas Jung wrote: > > FYI, I released this week PyStemmer (www.sf.net/projects/pystemmer). It's > a Python library with stemmers for eleven languages. Maybe you can need it. Guys, whatcha think? cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-17 18:54:10
|
I was stunned. Passing a list of 10K document ids to MySQL hardly made it sweat. Testing environment: FreeBSD 4.4-STABLE MySQL 3.23.46, compiled from sounce, all knobs set to default Python 2.1.1, compiled from source GCC 2.95.3 MySQLdb 0.9.0 Celeron 500 384MB SDRAM Seagate Barracuda IDE 28GB / 7 200 RPM / 2MB cache Test table: CREATE TABLE document ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, foo VARCHAR(255) ) TYPE=BDB Test data: 1M rows with sequence ids and 16-character random string for foo Method: I generated a list of random ids and made an SQL list from that. A simple SELECT, returning the whole row, was used. The WHERE clause restricted results to those that had id IN (the_random_list). The standard cursor class's fetchall() was then used to retrieve the results. The list length and percentage of ID hits were varied. Percentage ID hits is the percentage of the random IDs in the list which actually appear in the table. Ideally, one would want to vary the number of rows also... All tests were repeated 1 000 times, and times are wallclock times. Load from other processes was reasonably constant, accounting for < 5% of total CPU. Results: Table rows IN list length % ID hits Time / select Parse time / select 1 000 000 10 10 % 2.472 ms 2.068 ms 1 000 000 10 50 % 2.524 ms 1 000 000 10 100 % 2.800 ms 1 000 000 50 10 % 4.183 ms 2.661 ms 1 000 000 50 50 % 5.253 ms 1 000 000 50 100 % 6.366 ms 1 000 000 100 10 % 6.918 ms 3.367 ms 1 000 000 100 50 % 9.242 ms 1 000 000 100 100 % 11.817 ms 1 000 000 500 10 % 28.891 ms 12.129 ms 1 000 000 500 50 % 40.089 ms 1 000 000 500 100 % 54.389 ms 1 000 000 1 000 10 % 57.018 ms 21.876 ms 1 000 000 1 000 50 % 81.902 ms 1 000 000 1 000 100 % 108.954 ms 1 000 000 5 000 10 % 301.138 ms 118.765 ms 1 000 000 5 000 50 % 719.837 ms 1 000 000 5 000 100 % 1 168.819 ms 1 000 000 10 000 10 % 628.415 ms 255.822 ms 1 000 000 10 000 50 % 1 425.179 ms 1 000 000 10 000 100 % 1 962.224 ms Some notes: - The parse / select time is iffy -- it includes the time taken to send the query to the server and return the EXPLAIN output. It is listed only once for each combination of table size and list length, as the % id hits is irrelevant to the time taken by the parser. It should probably be thought of more properly as MySQL overhead time. - Python accounts for 10 - 15% of CPU usage while huge result sets are being returned, and system CPU utilisation shoots up to as much as 30% Both these factors indicate that dividing the load between app and db server over separate machines might be quite worthwhile, although neither process was starved for CPU. - Both the time taken to parse the query, and the overall time, are almost linear functions of the number of list items. This will probably become less relevant with real-world data, especially in the case of the parsing, but it's a Good Thing! - The process is not as scientific as it might be. For one thing, later queries might benefit from MySQL or OS caching, as I did not restart the MySQL server or the machine between tests. However, since 1 000 tests are run in each case, this might not affect the overall picture significantly. I did test a few "dry runs", and found a variance of less than 5%. - No joins or other comparisons are done in the test. It is therefore not a very real-world example (in fact, it's basically selecting values we already know ;-). We still need to determine how it will run along with joins. - It might be useful to compare this to a "recipe" SELECT returning the same number of rows using a range for the id. - I should have varied the length of the random data inserted into the foo column. That would possibly have exercised the table handler and/or I/O subsystem more. I guess the next step would be to incorporate this in the searching proper, unless there are other tests I should run first? Cheers -- Marcus |
From: Marcus C. <ma...@wr...> - 2001-12-17 15:51:27
|
On Mon, 17 Dec 2001 at 09:08:06 -0500, Casey Duncan wrote: [ LOAD DATA and temporary files ] > On Monday 17 December 2001 07:57 am, Chris Withers allegedly wrote: > [snip load data stuff] > > Well, my reluctance on this is that it needs a temporary file. Where do we > > put this temporary file? How do we know we're going to be allowed to write > > to the filesystem? > > Yup, other than using mktemp or creating some sort or var directory in the > install, dunno. That said, I don't think it's too much to ask for a program > to have write access to the temp dir... 8^) I think you can use a FIFO with LOAD DATA LOCAL INFILE... Anyone got experience with this? Probably something for the mysql list ;-) [ snip separating app and db server ] A good idea particularly if the app is handling some of the grunt work :-) TCP/IP comms between app and server may be slower than communication using a unix socket, though. > > That said, the only other option is to build a big > > INSERT INTO tbl_name VALUES (expression,...),(...),... > > > > ...and then we have to worry about max. sql length I guess? > > Anyone know what the maximum length of SQL you can shove down a single > > c.execute() is? Whatever you set it to ;-) The buffer starts off at net_buffer_length, and grows to a maximum of max_allowed_packet (default 16MB, which should be enough ;-). See the manual entries for SHOW VARIABLES. [ app-side processing for positional matches ] > > Can you elaborate? > > My thought was that you would treat the query just like A and B and C on the > SQL side of things. But you would bring in word position information with the > queries, so that you could on the application-side determine which queries > actually statisfy the phrase match. I think there will be relatively few > comparisons there for any meaningful search terms. > > I just think that trying to do that type of vertical comparison on the > SQL-side will be a pain. IIRC a search for 'windows 2000' on the test data Chris has been using has ~ 10K rows for each of 'windows' and '2000'. Doing that processing app-side would be costly (first have to match doc ids, then compare each in O(n**2)), whereas using the previous word id in the textindex when you already know what the word must be, should be cheaper. > Also, how are you planning to deal with stop-words in phrase searches? I > notice you have included a previous word reference in the index. I'm assuming > stop words are thrown out of both the word index and the search words, > correct? The same splitter is applied to source docs and search string. > > > I agree that storing a document count for each word > > > could help with optimizing since you could start with > > > the smallest dataset first and prune it from there. > > > > Does this still hold true when you're OR'ign terms together? > > For simplicity, I would just treat each part of the OR as a separate query > and combine the results on the application side. I think ORs are going to be > expensive no matter what you do, so you might as well keep is simple. Fully agree. I'm not sure what the other DBMS's do with ORs in the WHERE, but this looks like the best way of doing it all-round. [ MySQL's FULLTEXT index ] > Bleah. Real phrase matching is a definite requirement for me too. I wonder if > MySQL stores any positional info in its full-text index that you could get > your hands on... I'd imagine it is using a table internally to store the > index... No, MySQL doesn't store the position; it uses relative occurrences only in determining relevance, and does so at the row level. Cheers -- Marcus |
From: Marcus C. <ma...@wr...> - 2001-12-17 15:51:23
|
On Mon, 17 Dec 2001 at 13:40:36 +0000, Chris Withers wrote: > Marcus Collins wrote: [ MySQL threads hanging on Windows ] > > Is this happening consistently with the same query/sequence of actions? > > Well, after indexing a certain number of docs, yup... Always the same number, if you've restarted the server? If you turn query logging on, does it still break? > > Is it only one thread that hangs? > > Nope, anything that trys to do any operation hangs on opening the table. Can you open other tables, or connect to the server in any other way? > MySQL can't be shut down and I have to resort to the Windows task manager to > kill the process. > I'm taking Andy McKay's MS approach and ignoring it for now, since the > production system will be Debian, not Win2K ;-) *grins* Cheers -- Marcus |
From: Marcus C. <ma...@wr...> - 2001-12-17 15:51:18
|
On Mon, 17 Dec 2001 at 13:49:05 +0000, Chris Withers wrote: > Marcus Collins wrote: > > > > The app-side time to join a list of document ids returned from MySQL > > into a MySQL list should not be too bad. (The 'obvious' way of doing it > > takes 50ms for 10000 ids on my box, once the ids have been returned from > > MySQL. The main time expense, I think, would be inside MySQLdb.) > > I'm guessing by this you mean doing each boolean term as a seperate SELECT and > using python to combine the returned results? > What was the method you used? Oh hang, sorry... I meant fetching all the results from a previous query and using string.join() to generate an SQL list (the list of doc ids to be checked in the next query). Those are both cheap. Results to follow, but essentially MySQL has no problem handling IN tests with 10K members on a 1M row table -- worst-case, where all results are returned, is under two seconds, of which 10% is spent by Python fetching the results and string.joining the list. With one in ten results being returned, you're looking at a query taking around 0.5s. This excludes the initial lookups and other joins, but it is positive. Cheers -- Marcus |
From: Marcus C. <ma...@wr...> - 2001-12-17 14:47:56
|
A few notes on index population in the MySQL implementation: 0. The time to index each document really does need to be improved, especially since reads on the textindex table are largely blocked during this time. 1. All the indexing is already within a transaction block per document. LOAD DATA should still increase INSERT performance, as might using MySQL's extended INSERT syntax. 2. Indexing on the first 32 characters of dictionary.word, instead of the full length, resulted in an overall 5% increase in performance, with a < 1% loss in performance in the 'words to wordids' block. Not clear yet why there was a _loss_ of performance in the lookup; would have to do more tests/tests with other key lengths... (Only one test performed) 3. Deferring indexing of the textindex indexes over prev_textindex_id and (word_id, prev_word_id) (IOW, creating indexes initially only for identity columns and dictionary.word) did not result in a pronounced improvement ( < 5%) on textindex generation with 0.7M rows. Overall performance actually sufferred (by 5%). This could be because MySQL creates a temporary working copy of the table so that it can index it in a consistent state. Anyway, this wouldn't be feasible for real- world practice.... (Only one test performed) 4. I haven't yet played with recording the document count for dictionary words, but that will increase the time a bit. How much depends... More positive test results on searching to follow... Cheers -- Marcus |
From: Casey D. <c.d...@nl...> - 2001-12-17 14:05:13
|
On Monday 17 December 2001 07:57 am, Chris Withers allegedly wrote: [snip load data stuff] > Well, my reluctance on this is that it needs a temporary file. Where do we > put this temporary file? How do we know we're going to be allowed to write > to the filesystem? Yup, other than using mktemp or creating some sort or var directory in the install, dunno. That said, I don't think it's too much to ask for a program to have write access to the temp dir... 8^) I think you could potentially see a big speed improvement and even more so if the mySQL server is on another box... > That said, the only other option is to build a big > INSERT INTO tbl_name VALUES (expression,...),(...),... > > ...and then we have to worry about max. sql length I guess? > Anyone know what the maximum length of SQL you can shove down a single > c.execute() is? I dunno. Perhaps the C API would allow you to do a LOAD DATA from an in-memory data structure? Just a thought, we can't be the only ones trying to stuff the proverbial goose here 8^) > > As the search end, it seems to me that app side > > processing will be best for positional matches, such > > as for phrases. > > Can you elaborate? My thought was that you would treat the query just like A and B and C on the SQL side of things. But you would bring in word position information with the queries, so that you could on the application-side determine which queries actually statisfy the phrase match. I think there will be relatively few comparisons there for any meaningful search terms. I just think that trying to do that type of vertical comparison on the SQL-side will be a pain. Also, how are you planning to deal with stop-words in phrase searches? I notice you have included a previous word reference in the index. I'm assuming stop words are thrown out of both the word index and the search words, correct? > > I agree that storing a document count for each word > > could help with optimizing since you could start with > > the smallest dataset first and prune it from there. > > Does this still hold true when you're OR'ign terms together? For simplicity, I would just treat each part of the OR as a separate query and combine the results on the application side. I think ORs are going to be expensive no matter what you do, so you might as well keep is simple. So I guess that would be a no 8^) > I prefer to not require the BTrees module, as it's not part of the standard > python library, but if needs must ;-) I hear you. Just thinking out loud. > Thanks :-) Once I get the initial implementation and scalability testing > package finalized, mayeb you guys could try some tweakage? What are friends for? [snip full-text] > Yeah, sadly doesn't do phrase matching :-S 8^( > We could use this and do the usual cheap hack that ZCatalog does, matching > the phrase "x y z" simply gets turned into "x" AND "y" AND "z", but that > won't be good enough for the specific application I need the indexer for > :-S Bleah. Real phrase matching is a definite requirement for me too. I wonder if MySQL stores any positional info in its full-text index that you could get your hands on... I'd imagine it is using a table internally to store the index... /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |
From: Chris W. <ch...@ni...> - 2001-12-17 13:50:42
|
Marcus Collins wrote: > > The app-side time to join a list of document ids returned from MySQL > into a MySQL list should not be too bad. (The 'obvious' way of doing it > takes 50ms for 10000 ids on my box, once the ids have been returned from > MySQL. The main time expense, I think, would be inside MySQLdb.) I'm guessing by this you mean doing each boolean term as a seperate SELECT and using python to combine the returned results? What was the method you used? cheers, Chris |
From: Chris W. <ch...@ni...> - 2001-12-17 13:42:13
|
Marcus Collins wrote: > > [ MySQL threads hanging on Windows ] > > > It freezes, the response I got was essentially 'upgrade to Win2K Server', which > > may be the problem, although I remain unconvinced... > > Is this happening consistently with the same query/sequence of actions? Well, after indexing a certain number of docs, yup... > Is it only one thread that hangs? Nope, anything that trys to do any operation hangs on opening the table. MySQL can't be shut down and I have to resort to the Windows task manager to kill the process. I'm taking Andy McKay's MS approach and ignoring it for now, since the production system will be Debian, not Win2K ;-) cheers, Chris |
From: Chris W. <ch...@ni...> - 2001-12-17 13:25:09
|
Dammit, reply-all :-) Chris Withers wrote: > > Marcus Collins wrote: > > > > I reckon MySQL can do it pretty efficiently given a set of ids, but > > doing such processing as coalescing duplicates, etc., on the app side > > would likely be faster than letting the MySQL query parser and optimiser > > do it for you. That said, it would be only a matter of ms improvement; > > the main benefit of app-side processing is in eliminating joins. > > In that case, shall we try and keep it on the SQL side? ;-) > > > > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search > > > > AFAIK, it doesn't do phrase matching. Their relevance calculation looks > > very interesting, though. > > Oooo, I'll have to take a look :-) > (Although relevence ranking is pretty low on the list right now...) > > cheers, > > Chris |
From: Chris W. <ch...@ni...> - 2001-12-17 12:59:30
|
Casey Duncan wrote: > > This would involve writing the rows to be inserted > into textindex to a temporary file and then using LOAD > DATA to batch load it to mySQL. See the following for > docs on LOAD DATA: > > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#LOAD_DATA Well, my reluctance on this is that it needs a temporary file. Where do we put this temporary file? How do we know we're going to be allowed to write to the filesystem? That said, the only other option is to build a big INSERT INTO tbl_name VALUES (expression,...),(...),... ...and then we have to worry about max. sql length I guess? Anyone know what the maximum length of SQL you can shove down a single c.execute() is? > As the search end, it seems to me that app side > processing will be best for positional matches, such > as for phrases. Can you elaborate? > I agree that storing a document count for each word > could help with optimizing since you could start with > the smallest dataset first and prune it from there. Does this still hold true when you're OR'ign terms together? > Perhaps IISets could be used to get UNION/INTERSECT > functionality efficiently if mySQL can't do it for > you. I prefer to not require the BTrees module, as it's not part of the standard python library, but if needs must ;-) > As for partial indexes, they may help, especially > reducing memory paging and improving cache usage. Here > are the docs for that: > > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#CREATE_INDEX Thanks :-) Once I get the initial implementation and scalability testing package finalized, mayeb you guys could try some tweakage? > BTW: Have you seen mySQL's built-in full-text indexing > support? > > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search Yeah, sadly doesn't do phrase matching :-S We could use this and do the usual cheap hack that ZCatalog does, matching the phrase "x y z" simply gets turned into "x" AND "y" AND "z", but that won't be good enough for the specific application I need the indexer for :-S thanks for the comments :-) Chris |
From: Marcus C. <ma...@wr...> - 2001-12-17 10:42:21
|
On Sun, 16 Dec 2001 at 19:17:00 -0800, Casey Duncan wrote: [ using IN ] > I was going to suggest this, but what if both words > return 10,000 document hits? I doubt stuffing 10K ids > into the IN operand of the SQL statment would work... *gulp* yes... I've done some informal tests (real results to follow when they're formalised), and this might not be something to worry about so much. The time spent in processing the query is: - send query to MySQL server (linear) - parse query (doesn't look as bad as I thought it might; appears to be linear) - select using explicit document ids (cheaper than joining to find them) Additionally, I think the space complexity becomes linear rather than exponential for the last item, when compared to a self join. The app-side time to join a list of document ids returned from MySQL into a MySQL list should not be too bad. (The 'obvious' way of doing it takes 50ms for 10000 ids on my box, once the ids have been returned from MySQL. The main time expense, I think, would be inside MySQLdb.) > But I do think that using heap tables would be quite > efficient. I would think that would be the way to go, > and just join against it in the second query. It's still the join that bugs me, because I don't think it operates in linear time or space. But that said, the lookup in the heap tables, using hashes, would be constant. > Maybe both methods could be used, using IN when there > are 50 or fewer hits and heap tables otherwise. The major problem with the join (or at least with the self join) is that MySQL wants to create a temporary table to join two 10K-row tables. Using the heap temporary to store intermediate results would mitigate this to some extent, since MySQL would then be joining one 10K row table with another of fewer rows. I don't think we'd find the same behaviour using IN, but I'm not yet certin. I'll try to get some comparative results using IN for different numbers of return results. > Again > I think storing a document count for each word could > help here. Indeed. Storing the document count will definitely push up indexing time, but the benefits are great. Cheers -- Marcus |
From: Casey D. <cas...@ya...> - 2001-12-17 03:17:00
|
[snip lots 'o stuff] > You could nest any number of queries like that. When > it comes to boolean searching, you can apply the > same logic, and retrieve all document ids that you > need to test further. Then generate an SQL list and > use the IN membership operator. [snip example] I was going to suggest this, but what if both words return 10,000 document hits? I doubt stuffing 10K ids into the IN operand of the SQL statment would work... But I do think that using heap tables would be quite efficient. I would think that would be the way to go, and just join against it in the second query. Maybe both methods could be used, using IN when there are 50 or fewer hits and heap tables otherwise. Again I think storing a document count for each word could help here. -Casey __________________________________________________ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com |
From: Marcus C. <ma...@wr...> - 2001-12-16 22:54:03
|
On Sun, 16 Dec 2001 at 09:00:53 -0800, Casey Duncan wrote: [ Data population and LOAD DATA ] This would definitely be a lot faster than individual INSERTs. The main reason for the speed-up is that in this case MySQL treats the whole thing as one transaction, so it defers flushing the key buffers until all the data is loaded. The same can be accomplished (although not quite as efficiently) by wrapping the whole lot of inserts inside a transaction block (if using BDB or InnoDB tables) or LOCK/UNLOCK TABLEs (if using non-TST's). Use BEGIN/COMMIT in the former case. [ App-side processing ] > As the search end, it seems to me that app side > processing will be best for positional matches, such > as for phrases. I'd imagine such processing should be > eventually coded in C, but it should be acceptable in > Python if done efficiently (like using Python arrays, > IISets or some-such). I'd tend to do more of the processing on the database side, as described, in which case use of C or Python becomes less of an issue. > http://www.python.org/doc/essays/list2str.html Very interesting and inspiring :-) > I agree that storing a document count for each word > could help with optimizing since you could start with > the smallest dataset first and prune it from there. > Perhaps IISets could be used to get UNION/INTERSECT > functionality efficiently if mySQL can't do it for > you. I reckon MySQL can do it pretty efficiently given a set of ids, but doing such processing as coalescing duplicates, etc., on the app side would likely be faster than letting the MySQL query parser and optimiser do it for you. That said, it would be only a matter of ms improvement; the main benefit of app-side processing is in eliminating joins. [ prefix indexes ] > Sometimes less is more 8^). :-) > impossible to tell which would be faster on a given > architecture without real-world testing. Indeed. Although I don't see a knob to change the size of the key block, though, so AFAIK it's set at 1024 bytes. Caching will likely have different effects on different platforms, though, and the setting of key_buffer_size will have an effect on MySQL's own caching. [ MySQL's FULLTEXT index ] > http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search AFAIK, it doesn't do phrase matching. Their relevance calculation looks very interesting, though. Cheers -- Marcus |
From: Marcus C. <ma...@wr...> - 2001-12-16 22:01:10
|
On Sun, 16 Dec 2001 at 14:44:46 +0000, Chris Withers wrote: > Marcus Collins wrote: [ MySQL threads hanging on Windows ] > It freezes, the response I got was essentially 'upgrade to Win2K Server', which > may be the problem, although I remain unconvinced... Is this happening consistently with the same query/sequence of actions? Is it only one thread that hangs? Does MySQL's trace log tell you anything useful if you start mysqld with --debug? You'll also want to start it with --standalone and direct the trace log to a big disk. See <URL:http://www.mysql.org/documentation/mysql/bychapter/manual_porting.html#Making_trace_files> (AFAIK, the Windows binary is compiled with trace debugging built in) --log might also be useful, just to get an indication of whether the problem arises once the query is being executed (in which case you'll know what query is breaking it), or if the thread freezes even before the query is attempted (in which case there's possibly a problem with the threading). Other than that, hopefully the mysql-windows list (IIRC) can help more... [ ORs in WHERE clause ] I distinctly remember reading about this somewhere, and I cannot remember where... Any pointers? I can't find anything about this in the manual, but I know that (under certain conditions?) MySQL won't use the index if you use an OR in your WHERE clause. [ prefix key indexes ] > > Another thing you *should* be able to do is index on only the first n > > characters of the word, > > What's the SQL I need to do that? <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#CREATE_INDEX> CREATE INDEX word_idx ON dictionary (word(16)) will create a (non-unique) index keyed on the first 16 characters of dictionary.word Your application should ensure that the values inserted are unique. [ snip word length distribution ] > So I guess we're looking at a key length of 8 or 9? > > > Since MySQL's default key block size is 1024 bytes, you might want to > > select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? > > 8? Although 32 would catch just about every word and is a lot less than 256 ;-) Possibly. The *only* way you're going to find out is emperically, although I think it'll be 8, 16, or 32... If 8 does very well, and 16 and 32 do only marginally worse, I'd be tempted to go with the bigger length. Especially if the engine is likely to be used in, e.g., indexing medical literature and the like. > > Remember that your word index will then not be unique, so don't declare > > it as such. > > well, my SQL green-ness again, how would I do that? As above. Indexes are never unique unless you specify them as such (UNIQUE INDEX | UNIQUE KEY) or they are the PRIMARY KEY. > I'm wondering whether the filter-index architecture I'm building will help this? > http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pythonindexer/Documentation/interfaces/filter.html?rev=1.1&content-type=text/vnd.viewcvs-markup [ snip benefits of prefix key indexes ] [ COUNT aggregate function ] > > COUNT(*) retrieves all columns from all rows that meet your criteria, > > and then counts the number of rows. > > Ah, by didn't work, I meant I didn't get a single number back, I got a two > column table, I think the GROUP BY clause was interfering, any ideas how to stop > that? Yes, it would've been the GROUP BY. To get the behaviour you're after, you would have to use COUNT(DISTINCT document_id) and drop the GROUP BY. The DISTINCT will ensure that no duplicate doc ids are counted. When you use GROUP BY, each group of equal valuse for the columns in the GROUP BY clause is subjected to your aggregate function. That would be why you got two columns. See: <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#Group_by_functions> > > However, the query still has to be executed, and the rows retrieved from > > the tables -- so using COUNT() doesn't cut down all that significantly > > on the time the query is going to take. > > :-S There's really no way of predicting ahead of time how many results a query is going to return... You could look at the relative number of occurrences of the search terms, but that would by no means be foolproof. What you can do is use LIMIT to return only a certain number of rows, which can help somewhat because (a) if you're ordering the results, MySQL will stop once it's found that many results; and (b) there's less data to send to the client. [ snip PostGreSQL ] [ sub-selects vs joins ] > does MySQL support sub-selects yet? I certainly remember it as not doing so... No, it doesn't: <URL:http://www.mysql.com/doc/A/N/ANSI_diff_Sub-selects.html> > Well, since the filter process will do this bit already, maybe that'll cure the > problem? > > > You can also use sub-selects instead of the INTERSECT operator, by looking > > up the document id's of docs that meet your criteria. If you know ahead > > of time how many documents contain each word (by storing that info in > > the dictionary table, and doing a quick lookup first), then you can > > order your use of sub-selects in such a way as to start with the > > smallest result set and continue pruning it. > > Ah, cunning, how do we do this without sub-selects? Well, this would require app-side processing in any case, so refer to the end of the post... > > Hmmm, indeed. Is your framework such that you can plug in different SQL > > engines? It might be nice to play with PostGreSQL... > > Sort of, not with ease, mainly due to the different techniques needed for each > engine. > In MySQL, I was planning to return a group of join and where clauses from each > index, which the engine would then combine and perform as a single select. > In PostGres, that's not necessary, each index could just return a complete > SELECt which the engine could combine using UNION, INTERSECT, etc. Could be a > lot of intertia moving across to PostGreSQL though, particularly the other parts > of the project. I'll have to chat with the other guys working on it on Monday... Well, we can definitely do better using MySQL, but it's looking like it will require app-side processing. The only MySQL-only alternative I can think of at this stage would be to use temporary heap tables to store intermediate results, and then join these tables (which use hash indexxing and are in memory, and therefore very fast). Since the tables are temporary, you can do pretty much anything with them: - UNIONs would be implemented by additional INSERT INTO ... SELECT FROM statements. Ditto for those ORs, possibly. - JOIN operations would be much faster, since the tables would be small and they use hashes for indexing - you can specify a maximum number of rows that the table can hold (and therefore limit the number of results returned) I don't actually think this would be the most efficient -- app-side processing probably will be more efficient -- but if you're interested, then see: <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#CREATE_TABLE> <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Table_types.html#HEAP> <URL:http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#INSERT_SELECT> It's also nice that this (excepting specifying the table type) still complies with SQL92, which makes provision for temporary tables. [ Application-side processing ] > > Well, application-side processing is not necessarily that bad... You can > > quite easily emulate sub-selects in your app, for example. > > Ooo... do tell :-) Take for example the query: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id IN (SELECT d.word_id FROM dictionary as d WHERE d.word = 'windows' ) GROUP BY documents.document_id 1. Start with your inner-most SELECT: SELECT word_id FROM dictionary WHERE word = 'windows' 2. Retrieve the word id returned (it *should* be a single id) 4. Form the outer SELECT: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id = the_id_returned_for_windows GROUP BY documents.document_id That eliminates one of the present self joins, replacing it with a very cheap single-row result lookup. You could nest any number of queries like that. When it comes to boolean searching, you can apply the same logic, and retrieve all document ids that you need to test further. Then generate an SQL list and use the IN membership operator. For example, if you have the above query and need to test further for articles that also refer to FreeBSD, you'd perform two additional SELECTS: 1. Get list of documents referring to FreeBSD: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = 'freebsd' GROUP BY documents.document_id 2. Assuming the document ids returned are (1, 3, 5, 7, 11, 13) : SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id = the_id_returned_for_windows AND documents.document_id IN (1, 3, 5, 7, 11, 13) GROUP BY documents.document_id Total of three queries (although you would do well to use an additional one first to determine optimum order), one of which has no joins, the others only two. If you don't mind doing the app-side processing, then this is the way to go. Just make sure that, if you're using lots of document ids in your list, your query buffer (query_buffer_size) is large enough. I think you should also be able to use this *very* efficiently instead of the INTERESCT operator. If MySQL sees the following: documents.document_id IN (1, 3, 5, 7, 11, 13) AND documents.document_id IN (5, 7, 11, 13, 17, 19, 23) it will do constant folding and eliminate the impossibilities. You can also manipulate your app's lists to get around the OR problem by merging multiple doc id lists into one. Cheers -- Marcus |
From: Casey D. <cas...@ya...> - 2001-12-16 17:00:54
|
A couple of observations: The index population could seems pretty inefficient to me in sqlindexer. You are essentially sending one sql INSERT per word when something is indexed. I think it would be worthwhile to test an implementation that uses a single LOAD DATA statement when the text is indexed. This would involve writing the rows to be inserted into textindex to a temporary file and then using LOAD DATA to batch load it to mySQL. See the following for docs on LOAD DATA: http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#LOAD_DATA As the search end, it seems to me that app side processing will be best for positional matches, such as for phrases. I'd imagine such processing should be eventually coded in C, but it should be acceptable in Python if done efficiently (like using Python arrays, IISets or some-such). You might want to check out this Guido-essay for some ideas here: http://www.python.org/doc/essays/list2str.html I agree that storing a document count for each word could help with optimizing since you could start with the smallest dataset first and prune it from there. Perhaps IISets could be used to get UNION/INTERSECT functionality efficiently if mySQL can't do it for you. As for partial indexes, they may help, especially reducing memory paging and improving cache usage. Here are the docs for that: http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#CREATE_INDEX It would probably be worthwhile to test times for a relatively large size (32 bytes?) vs a small one (4 or 8 bytes?). Sometimes less is more 8^). Its near impossible to tell which would be faster on a given architecture without real-world testing. BTW: Have you seen mySQL's built-in full-text indexing support? http://www.mysql.org/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search Not sure whether it does everything we need, but it probably worth a look... -Casey __________________________________________________ Do You Yahoo!? Check out Yahoo! Shopping and Yahoo! Auctions for all of your unique holiday gifts! Buy at http://shopping.yahoo.com or bid at http://auctions.yahoo.com |
From: Chris W. <ch...@ni...> - 2001-12-16 14:44:31
|
Marcus Collins wrote: > > [ snip MySQL threads hanging on Windows ] > > Ick! Any update yet from MySQL AB? Does it freeze or spin? It freezes, the response I got was essentially 'upgrade to Win2K Server', which may be the problem, although I remain unconvinced... > > Bit disappointed to see that MySQL's performance goes bad when you have OR's in > > a WHERE, or something like that, which could make boolean stuff unpleasant :-S > > Hang, yes... I completely forgot about that... Anyone got ideas how/if we can mitigate this? [snip remove prev_textindex_id key in DDL] > Another thing you *should* be able to do is index on only the first n > characters of the word, What's the SQL I need to do that? > which will (a) make your indexes smaller that's not currently a problem... > and (b) > therefore allow more values to be stored in a key buffer of the same > size. Ah, yes... > SELECT > COUNT(word), > LENGTH(word) AS len > FROM dictionary > GROUP BY len gave: +-------------+------+ | COUNT(word) | len | +-------------+------+ | 41 | 1 | | 1224 | 2 | | 15818 | 3 | | 27076 | 4 | | 21709 | 5 | | 38393 | 6 | | 26047 | 7 | | 24425 | 8 | | 18607 | 9 | | 14351 | 10 | | 10923 | 11 | | 6931 | 12 | | 5081 | 13 | | 3835 | 14 | | 2839 | 15 | | 2144 | 16 | | 1680 | 17 | | 1216 | 18 | | 873 | 19 | | 649 | 20 | | 434 | 21 | | 281 | 22 | | 208 | 23 | | 138 | 24 | | 84 | 25 | | 65 | 26 | | 45 | 27 | | 27 | 28 | | 15 | 29 | | 10 | 30 | | 6 | 31 | | 4 | 32 | | 4 | 33 | | 2 | 34 | | 1 | 36 | | 1 | 44 | | 2 | 46 | | 1 | 60 | | 1 | 64 | | 1 | 99 | +-------------+------+ So I guess we're looking at a key length of 8 or 9? > Since MySQL's default key block size is 1024 bytes, you might want to > select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? 8? Although 32 would catch just about every word and is a lot less than 256 ;-) > Remember that your word index will then not be unique, so don't declare > it as such. well, my SQL green-ness again, how would I do that? I'm wondering whether the filter-index architecture I'm building will help this? http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/pythonindexer/Documentation/interfaces/filter.html?rev=1.1&content-type=text/vnd.viewcvs-markup > The idea is that the index then just helps narrow down the search -- the > actual rows are still read in order to compare with to the word for > which your searching. For words that are shorter than your key length, you > should see a real improvement; for longer words, there might be a loss > in performance. 32 might be the best length then... > COUNT(*) retrieves all columns from all rows that meet your criteria, > and then counts the number of rows. Ah, by didn't work, I meant I didn't get a single number back, I got a two column table, I think the GROUP BY clause was interfering, any ideas how to stop that? > This is just extra work for the > server, as it retrieves additional columns that are then just thrown > away. You can use COUNT(colname) and it won't retrieve all the columns. Cool... > However, the query still has to be executed, and the rows retrieved from > the tables -- so using COUNT() doesn't cut down all that significantly > on the time the query is going to take. :-S > I can't asnwer that, since I haven't used PostGreSQL in absolute ages, > but it does implement a greater subset of SQL92 (including subselects, > union, etc.) Not sure about where table and index optimisation currently > stand -- a while back the chief problem with pgsql was that you > basically had to run vaccuum every so often and that doing so in a > production environment was not viable. But that was a MySQL argument and > may have changed in any case. Hehe, if I get a chance I'll take a look, if only out of curiosity... > [ sub-selects vs joins ] > > > > Indeed. It's the damned join. That's why I was thinking again of > > > sub-selects, which may be more efficient for very large results sets > > > (but less so for small ones, I imagine). does MySQL support sub-selects yet? I certainly remember it as not doing so... [snip sub select] Well, since the filter process will do this bit already, maybe that'll cure the problem? > You can also use sub-selects instead of the INTERSECT operator, by looking > up the document id's of docs that meet your criteria. If you know ahead > of time how many documents contain each word (by storing that info in > the dictionary table, and doing a quick lookup first), then you can > order your use of sub-selects in such a way as to start with the > smallest result set and continue pruning it. Ah, cunning, how do we do this without sub-selects? > Hmmm, indeed. Is your framework such that you can plug in different SQL > engines? It might be nice to play with PostGreSQL... Sort of, not with ease, mainly due to the different techniques needed for each engine. In MySQL, I was planning to return a group of join and where clauses from each index, which the engine would then combine and perform as a single select. In PostGres, that's not necessary, each index could just return a complete SELECt which the engine could combine using UNION, INTERSECT, etc. Could be a lot of intertia moving across to PostGreSQL though, particularly the other parts of the project. I'll have to chat with the other guys working on it on Monday... > > > ... which means ugly LEFT OUTER JOINs and other hacks. Or > > > application-side processing. > > > > or both :-S > > Well, application-side processing is not necessarily that bad... You can > quite easily emulate sub-selects in your app, for example. Ooo... do tell :-) cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-15 23:13:30
|
On Thu, 13 Dec 2001 at 17:09:23 +0000, Chris Withers wrote: > Marcus Collins wrote: [ snip MySQL threads hanging on Windows ] Ick! Any update yet from MySQL AB? Does it freeze or spin? [ snip tuning indexing ] See some comments below. [ snip InnoDB ] [ ORs in WHERE not using index ] > Bit disappointed to see that MySQL's performance goes bad when you have OR's in > a WHERE, or something like that, which could make boolean stuff unpleasant :-S Hang, yes... I completely forgot about that... I believe it does a table scan in that case! (This could be another tick next to sub-selects; see later) I'm not sure what the behaviour of the other database engines is... [ Redundant prev_textindex_id key in DDL ] > > Nope, it appears only in possible_keys and ref. The keys in key are used > > for the index lookup, as I understand it. > > Indeed, so we could drop it? You can indeed drop it; as it's not used in the SELECTs it won't harm that at all, and it'll speed up your population. Another thing you *should* be able to do is index on only the first n characters of the word, which will (a) make your indexes smaller and (b) therefore allow more values to be stored in a key buffer of the same size. It should also reduce the number of seeks needed, as more keys will be stored in each block. Result: faster population, faster lookup, better cache hit ratio. You'd probably have to determine a good figure emperically, but seeing the distribution of word lengths should help: SELECT COUNT(word), LENGTH(word) AS len FROM dictionary GROUP BY len will give you the distribution of word lengths within your dictionary. Since MySQL's default key block size is 1024 bytes, you might want to select a size by which 1024 is divisible. Thirty-two or sixteen, maybe? Remember that your word index will then not be unique, so don't declare it as such. The idea is that the index then just helps narrow down the search -- the actual rows are still read in order to compare with to the word for which your searching. For words that are shorter than your key length, you should see a real improvement; for longer words, there might be a loss in performance. [ Counting result columns ] > > You can use COUNT(colname | * ), but that's basically going to save time > > only because there's less data for it to allocate memory for and > > transfer :-S. > > How does that differ from just COUNT (*), which didn't work for me? COUNT(*) retrieves all columns from all rows that meet your criteria, and then counts the number of rows. This is just extra work for the server, as it retrieves additional columns that are then just thrown away. You can use COUNT(colname) and it won't retrieve all the columns. However, the query still has to be executed, and the rows retrieved from the tables -- so using COUNT() doesn't cut down all that significantly on the time the query is going to take. [ functionality in PostGreSQL ] > Nope... I wonder if it'd be good to look at PostGreSQL here? I can't asnwer that, since I haven't used PostGreSQL in absolute ages, but it does implement a greater subset of SQL92 (including subselects, union, etc.) Not sure about where table and index optimisation currently stand -- a while back the chief problem with pgsql was that you basically had to run vaccuum every so often and that doing so in a production environment was not viable. But that was a MySQL argument and may have changed in any case. [ sub-selects vs joins ] > > Indeed. It's the damned join. That's why I was thinking again of > > sub-selects, which may be more efficient for very large results sets > > (but less so for small ones, I imagine). > > What's the difference between a sub-select and a join? (spot the SQL newbie ;-) Well, it looks like what is taking the time is the large-ish self joins that have to be done, where MySQL ends up creating large temporary tables. If you can use sub-selects, you basically still generate an intermediate temporary table, but it should be much smaller than in the case of a large join. The (current?) query then becomes: SELECT documents.document_id FROM documents, dictionary AS d, textindex AS i WHERE i.document_id = documents.document_id AND d.word_id = i.word_id AND d.word = '2000' AND i.prev_word_id IN (SELECT d.word_id FROM dictionary as d WHERE d.word = 'windows' ) GROUP BY documents.document_id So basically, instead of joining a 20 000 row table to a 20 000 row table, you're finding a single row from a 20 000 row table, and then finding (from an index) all the rows in the text index that reference that row. You can also use sub-selects instead of the INTERSECT operator, by looking up the document id's of docs that meet your criteria. If you know ahead of time how many documents contain each word (by storing that info in the dictionary table, and doing a quick lookup first), then you can order your use of sub-selects in such a way as to start with the smallest result set and continue pruning it. [ UNION and INTERSECT operators ] > Nope, but it is in Postgres... Hmmm, indeed. Is your framework such that you can plug in different SQL engines? It might be nice to play with PostGreSQL... [ snip UNION not yet in MySQL -stable ] > > ... which means ugly LEFT OUTER JOINs and other hacks. Or > > application-side processing. > > or both :-S Well, application-side processing is not necessarily that bad... You can quite easily emulate sub-selects in your app, for example. Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-15 13:00:35
|
The SQL code refered to in this mail cna be found at: http://sourceforge.net/tracker/index.php?func=detail&aid=489417&group_id=41154&atid=430556 cheers, Chris |