pythonindexer-discuss Mailing List for PythonIndexer (Page 2)
Status: Pre-Alpha
Brought to you by:
cduncan
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(36) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(6) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Casey D. <c.d...@nl...> - 2001-12-14 13:40:04
|
On Thursday 13 December 2001 11:34 am, Chris Withers allegedly wrote: > Hi, > > Just to let you guys know, the interfaces for the indexer are now finished. > > I'm gonns start working on the framework, unit tests and MySQL engien > today... > > cheers, > > Chris Yay, thanks for your work on this. I'll let you know if I have any comments. /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |
From: Chris W. <ch...@ni...> - 2001-12-13 17:11:02
|
I'm cc'ing the list on this, in case anyone finds this interesting or can help, sorry about the lack of context for anyone reading it... Marcus Collins wrote: > > On Tue, 4 Dec 2001 at 00:31:10 +0000, Chris Withers wrote: > > > Indeed it is and I've found it particularly helpful, especially when diagnosing > > MySQL hangs... > > So why was MySQL hanging? No idea yet, still trying to help the MySQL AB guys reproduce the problem... > Also useful is that you can kill (mysqladmin > kill) the offending threads. Hmmm... it hangs in such a way that 'mysqladmin kill' doesnt' work, only the windows task manager can kill it :-( > You can still use ANALYZE TABLE with BDB, and apparently it does do some > stuff. That's the equivalent of the {,my}isamchk -a yourtablename. Ah, okay... <snip insert speed> Well, we can optimize indexign speed later :-) > It would be useful to know how those eight seconds are being used -- > what proportion is used by the Python script; by the MySQL server; by > the OS (on I/O). Well, most of it is inside the loop that does the inserts, so I'm guessing split between MySQLdb, MySQL and the OS. > Your best bet in determining this would be to test on > your Linux box. times (under bash(1)) will give you wall clock time, CPU > time, and system time. Hmmm... will have to give that a go when it comes to optimising... > What's the SQL nut book like? Looks pretty cool, haven't had to use it in anger yet... [snip RDB normalisation] > FWIW, the chapter on Performance and Design in the Oracle reference > begins, "No major application will run in Third Normal Form." Hehe, 'cept searchign and indexing ;-) > > [slow InnoDB] > > > Yep, it'll be interesting to see. Did you broach the subject at all to > > > the list? > > > > Yup, and one of the Innobase guys has got back to me, so we shall see... > > What did he have to say, BTW? Turns out the recent windows binaries were compiled with what was effectively a "make_me_cripplingly_slow" flag, so I might try the next release with InnoDB and see what happens... > Are they using some sort of hash function? I'm guessing yes, since > they're also compressing. Probably a page index system. That's > impressive performace, but how long does it take them to index? Also, is > their index updatable? Dunno, guess 'll find out when I get there... > > That said, I think the SQL engine could be pretty f'ing cool... Hmmm... well, we have searching not-quite-quick-enough and indexing slow so we need some kind of tweak still. Anyone got any ideas? Bit disappointed to see that MySQL's performance goes bad when you have OR's in a WHERE, or something like that, which could make boolean stuff unpleasant :-S > Quick reply now, and hopefully more in-depth after Friday... Do you want > to post this on the SF list maybe, or personal? SF List it is :-) > > > BTW, I'm not sure if the key on prev_textindex_id is necessary... Not > > > sure how MySQL handles that query... Wanna post EXPLAIN output? <snip> > Nope, it appears only in possible_keys and ref. The keys in key are used > for the index lookup, as I understand it. Indeed, so we could drop it? > > searching for "db 2" brings back 138 results... > > > > I hope there's not a bug lurking here :-S > > Well, that's obviously the first thing to determine. Maybe post the > actual SQL used, the DDL, and the EXPLAIN output? Indeed. I think I'm gonna tackle this from the other end now. Get the framework and unit tests up and running before returning to build the proper SQL engine now that I know what it should smell like... > > > - search terms that would return many, many results (impacts GROUP BY) > > > > well, when "windows 2000" brought back 1-2K results, it took 10 seconds > > Hmmm... Ten secs still too long. Indeed :-( (seen higher figures than that since...) > You can use COUNT(colname | * ), but that's basically going to save time > only because there's less data for it to allocate memory for and > transfer :-S. How does that differ from just COUNT (*), which didn't work for me? > I'm also leaning towards experimenting with sub-selects for this, but > not sure if they're yet implemented in MySQL (don't have latest version > of manual handy...) Nope... I wonder if it'd be good to look at PostGreSQL here? > Hmm, was reading through the MySQLdb code the other day, and saw that > the cursor.execute method can take a list. Don't remember details, but > may be worth looking at. Will do... > Indeed. It's the damned join. That's why I was thinking again of > sub-selects, which may be more efficient for very large results sets > (but less so for small ones, I imagine). What's the difference between a sub-select and a join? (spot the SQL newbie ;-) > I think I mentioned UNION and INTERSECT a while back, but looking at my > (out-of-date) MySQL reference, it looks like it's not supported. Nope, but it is in Postgres... > And it playing right now, it doesn't look like UNION is in the latest > version either :-S It is, but the latest unstable release... > ... which means ugly LEFT OUTER JOINs and other hacks. Or > application-side processing. or both :-S Well, hope to hear from Marcus some time, but if anyone else can dive in in the meantime, all the better... cheers, Chris |
From: Chris W. <ch...@ni...> - 2001-12-13 16:36:22
|
Hi, Just to let you guys know, the interfaces for the indexer are now finished. I'm gonns start working on the framework, unit tests and MySQL engien today... cheers, Chris |
From: Chris W. <ch...@ni...> - 2001-12-13 16:05:20
|
Marcus Collins wrote: > > I think it's probably implemented like this because returning it as a > string is the most universal Yeah, I think I actually see why it needs to be like that back now, just a shame we need to take the implicit performacne hit... > However, it would indeed be good to see the C API support the data types > C knows about natively -- and these do account for the majority of > MySQL's native data types. It would be interesting to determine what it > would take to add this. That sounds like a question for the MySQL lists ;-) <snip> > to examine the column type, treat the binary data appropriately, and > return a native Python object of the appropriate type. That would be cool, if I understood it entirely... I got the native Python object bit :-S cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-12 17:09:56
|
On Mon, 10 Dec 2001 at 15:53:47 +0000, Chris Withers wrote: > Marcus Collins wrote: > > > The MySQL C API returns rows as an array of strings. See the typedefs > > and prototypes in mysql.h... > > Booo hiss and shame on MySQL :-( > > Has anyone requested this 'feature' be changed? I think it's probably implemented like this because returning it as a string is the most universal -- some of MySQL's data types (such as DECIMAL, for example) have no native counterparts in C. Also, MySQL has locale support which may not be available in the C libraries on all the platforms on which the C API might be used (which could very well be different to the platform on which the server is running!). And most importantly, the return type of the C function needs to be known! However, it would indeed be good to see the C API support the data types C knows about natively -- and these do account for the majority of MySQL's native data types. It would be interesting to determine what it would take to add this. I've looked only at the typedefs and declarations, and not at the definitions in the source, but I think that it could be done by allocating a pointer to the binary representation in MYSQL_RES, and having an access function in the Python wrapper (not in C; it wouldn't be all that useful since the return type would be ambiguous at compile time) to examine the column type, treat the binary data appropriately, and return a native Python object of the appropriate type. Then again, it depends how these are returned by the server itself ;-) Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-10 16:00:53
|
Hi, Had a bit fo a revelation today... There are only three types of index: SingleValueIndex Indexes a single value per identifier. These can be sorted on, grouped by, etc. MultipleValueIndex Indexes multiple values for identifier Can these be sorted on, grouped by, etc? OrderedIndex Indexes multiple values for an identifier and remembers the order the values were presented in. Now I defy someone to prove me wrong ;-) Chris |
From: Chris W. <ch...@ni...> - 2001-12-10 15:55:28
|
Marcus Collins wrote: > > On Mon, 10 Dec 2001 at 12:37:11 +0000, Chris Withers wrote: > > > Marcus Collins wrote: > > > > > > probably most DBMS)? The database server returns the float as a string, > > > the database API converts it back to a float, > > > > It does? why? that sucks :-S > > The MySQL C API returns rows as an array of strings. See the typedefs > and prototypes in mysql.h... Booo hiss and shame on MySQL :-( Has anyone requested this 'feature' be changed? cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-10 14:22:34
|
On Mon, 10 Dec 2001 at 12:37:11 +0000, Chris Withers wrote: > Marcus Collins wrote: > > > > probably most DBMS)? The database server returns the float as a string, > > the database API converts it back to a float, > > It does? why? that sucks :-S The MySQL C API returns rows as an array of strings. See the typedefs and prototypes in mysql.h... I don't know the dark magic, but AFAIK in MySQLdb there's a 'converters' module (which you can override) which examines the column type reported by the API and converts the string output to native Python objects. [ snip sample data ] Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-10 12:39:05
|
Marcus Collins wrote: > > probably most DBMS)? The database server returns the float as a string, > the database API converts it back to a float, It does? why? that sucks :-S > Probably better to use a long int for handling dates and times, and if > you end up doing and searching, the comparisons will be faster in any > case. True :-) > Any reason not to? Alternatively, map to the underlying storage's > date / time types if that fits in with your architecture. Yeah, that was my plan... > Maybe check out all the public domain works at Project Gutenberg? > > <URL:http://promo.net/pg/> > > and many FTP mirrors. > > There's a good few thousand books, etc., in text form. Easily a few > hundred megs of text, I reckon. cool... more sample data :-) cheers, Chris |
From: Chris W. <ch...@ni...> - 2001-12-07 18:15:16
|
Looks interesting: http://jakarta.apache.org/lucene/docs/index.html Chris |
From: Marcus C. <ma...@wr...> - 2001-12-07 17:49:16
|
On Fri, 7 Dec 2001 at 15:57:45 +0000, Chris Withers wrote: [ DateTime handling ] > Casey Duncan wrote: > > > > We probably don't need a specific index type for dates and times (because > > they can be distilled down to floats), unless there is some date-specific > > search functionality that is needed (none comes to mind). > > Hmmm... maybe not at first, but indexing Dates and Times in this way feels very unnatural to me. > In addition, DateTime combined objects also feel pretty unnatural :-( Coming in late here (*wave* everyone!), so excuse me if I'm missing the mark... Couldn't using floats for the dates result in an accuracy problem, particularly if they're stored in a database which may have a default display width assigned to its column (as is the case with MySQL and probably most DBMS)? The database server returns the float as a string, the database API converts it back to a float, and you may end up with a different float to the one you had originally. Probably better to use a long int for handling dates and times, and if you end up doing and searching, the comparisons will be faster in any case. Any reason not to? Alternatively, map to the underlying storage's date / time types if that fits in with your architecture. [ snip index identifiers ] [ Big text files ] Maybe check out all the public domain works at Project Gutenberg? <URL:http://promo.net/pg/> and many FTP mirrors. There's a good few thousand books, etc., in text form. Easily a few hundred megs of text, I reckon. Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-07 16:01:30
|
Casey Duncan wrote: > > +1 on posting commits to the mailing list. *grinz* Still waiting for the others to comment... Chris |
From: Chris W. <ch...@ni...> - 2001-12-07 15:59:28
|
Casey Duncan wrote: > > We probably don't need a specific index type for dates and times (because > they can be distilled down to floats), unless there is some date-specific > search functionality that is needed (none comes to mind). Hmmm... maybe not at first, but indexing Dates and Times in this way feels very unnatural to me. In addition, DateTime combined objects also feel pretty unnatural :-( What do other people think? > It would be beneficial for the index to have a lower level identifer (a 64 > bit int probably) that is exposed in the indexer interface. That way a > reverse mapping to the string identifier would not be necessary for > applications that don't need it (such as the ZODB, which has (at least > currently) an 8-byte object id). Yeah, this is a toughy. Dunno how to nicely expose it in the interface. Ideas? Personally, I think a string covers it all, just turn your number into a string if you need to use it as an identifier. Of course, I'd like to chaneg my mind later ;-) > Whether the string ids are supported should probably be decided at the time > the index is instantiated. We could acheive that by having two index classes. > A basic class that only supports integer ids and a subclass that supports > strings. Well, what the Indexing Engine does internally is entirely up to it ;-) > On the testing front, I found in my travels some big old piles of text data > for use in testing IR software. perhaps a sample database such as one of > these can be used for scalability testing. > > see: http://192.115.216.71/webir/resources.html > under "Free for all text/web files collection" Coooool... I wonder if they mind having them hit every time you run the scalability test? Chris |
From: Casey D. <c.d...@nl...> - 2001-12-06 22:03:54
|
+1 on posting commits to the mailing list. /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |
From: Casey D. <c.d...@nl...> - 2001-12-06 18:51:05
|
Here are a few notes on the docs: We probably don't need a specific index type for dates and times (because they can be distilled down to floats), unless there is some date-specific search functionality that is needed (none comes to mind). It would be beneficial for the index to have a lower level identifer (a 64 bit int probably) that is exposed in the indexer interface. That way a reverse mapping to the string identifier would not be necessary for applications that don't need it (such as the ZODB, which has (at least currently) an 8-byte object id). Whether the string ids are supported should probably be decided at the time the index is instantiated. We could acheive that by having two index classes. A basic class that only supports integer ids and a subclass that supports strings. On the testing front, I found in my travels some big old piles of text data for use in testing IR software. perhaps a sample database such as one of these can be used for scalability testing. see: http://192.115.216.71/webir/resources.html under "Free for all text/web files collection" /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |
From: Chris W. <ch...@ni...> - 2001-12-06 15:36:17
|
Hi, Quick straw poll: Should the spec have "Numeric Index" covering integer and floating point numbers or "Integer Index" and "Floating Point" index? +1 in a reply please :-) The argument goes that Numeric is easier to use, but splitting it into two makes it more efficient... cheers, Chris |
From: Chris W. <ch...@ni...> - 2001-12-06 13:29:15
|
Hi, Welcome to the PythonIndexer list :-) I'm wondering if it'd be useful to get CVS change notifications sent to this list... Would it be useful? Should they go to another list? What notifications should go to which lists? answers-on-a-postcard-ly-yours, Chris |