Thread: [PyIndexer] Notes on Docs and Testing
Status: Pre-Alpha
Brought to you by:
cduncan
From: Casey D. <c.d...@nl...> - 2001-12-06 18:51:05
|
Here are a few notes on the docs: We probably don't need a specific index type for dates and times (because they can be distilled down to floats), unless there is some date-specific search functionality that is needed (none comes to mind). It would be beneficial for the index to have a lower level identifer (a 64 bit int probably) that is exposed in the indexer interface. That way a reverse mapping to the string identifier would not be necessary for applications that don't need it (such as the ZODB, which has (at least currently) an 8-byte object id). Whether the string ids are supported should probably be decided at the time the index is instantiated. We could acheive that by having two index classes. A basic class that only supports integer ids and a subclass that supports strings. On the testing front, I found in my travels some big old piles of text data for use in testing IR software. perhaps a sample database such as one of these can be used for scalability testing. see: http://192.115.216.71/webir/resources.html under "Free for all text/web files collection" /---------------------------------------------------\ Casey Duncan, Sr. Web Developer National Legal Aid and Defender Association c.d...@nl... \---------------------------------------------------/ |
From: Chris W. <ch...@ni...> - 2001-12-07 15:59:28
|
Casey Duncan wrote: > > We probably don't need a specific index type for dates and times (because > they can be distilled down to floats), unless there is some date-specific > search functionality that is needed (none comes to mind). Hmmm... maybe not at first, but indexing Dates and Times in this way feels very unnatural to me. In addition, DateTime combined objects also feel pretty unnatural :-( What do other people think? > It would be beneficial for the index to have a lower level identifer (a 64 > bit int probably) that is exposed in the indexer interface. That way a > reverse mapping to the string identifier would not be necessary for > applications that don't need it (such as the ZODB, which has (at least > currently) an 8-byte object id). Yeah, this is a toughy. Dunno how to nicely expose it in the interface. Ideas? Personally, I think a string covers it all, just turn your number into a string if you need to use it as an identifier. Of course, I'd like to chaneg my mind later ;-) > Whether the string ids are supported should probably be decided at the time > the index is instantiated. We could acheive that by having two index classes. > A basic class that only supports integer ids and a subclass that supports > strings. Well, what the Indexing Engine does internally is entirely up to it ;-) > On the testing front, I found in my travels some big old piles of text data > for use in testing IR software. perhaps a sample database such as one of > these can be used for scalability testing. > > see: http://192.115.216.71/webir/resources.html > under "Free for all text/web files collection" Coooool... I wonder if they mind having them hit every time you run the scalability test? Chris |
From: Marcus C. <ma...@wr...> - 2001-12-07 17:49:16
|
On Fri, 7 Dec 2001 at 15:57:45 +0000, Chris Withers wrote: [ DateTime handling ] > Casey Duncan wrote: > > > > We probably don't need a specific index type for dates and times (because > > they can be distilled down to floats), unless there is some date-specific > > search functionality that is needed (none comes to mind). > > Hmmm... maybe not at first, but indexing Dates and Times in this way feels very unnatural to me. > In addition, DateTime combined objects also feel pretty unnatural :-( Coming in late here (*wave* everyone!), so excuse me if I'm missing the mark... Couldn't using floats for the dates result in an accuracy problem, particularly if they're stored in a database which may have a default display width assigned to its column (as is the case with MySQL and probably most DBMS)? The database server returns the float as a string, the database API converts it back to a float, and you may end up with a different float to the one you had originally. Probably better to use a long int for handling dates and times, and if you end up doing and searching, the comparisons will be faster in any case. Any reason not to? Alternatively, map to the underlying storage's date / time types if that fits in with your architecture. [ snip index identifiers ] [ Big text files ] Maybe check out all the public domain works at Project Gutenberg? <URL:http://promo.net/pg/> and many FTP mirrors. There's a good few thousand books, etc., in text form. Easily a few hundred megs of text, I reckon. Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-10 12:39:05
|
Marcus Collins wrote: > > probably most DBMS)? The database server returns the float as a string, > the database API converts it back to a float, It does? why? that sucks :-S > Probably better to use a long int for handling dates and times, and if > you end up doing and searching, the comparisons will be faster in any > case. True :-) > Any reason not to? Alternatively, map to the underlying storage's > date / time types if that fits in with your architecture. Yeah, that was my plan... > Maybe check out all the public domain works at Project Gutenberg? > > <URL:http://promo.net/pg/> > > and many FTP mirrors. > > There's a good few thousand books, etc., in text form. Easily a few > hundred megs of text, I reckon. cool... more sample data :-) cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-10 14:22:34
|
On Mon, 10 Dec 2001 at 12:37:11 +0000, Chris Withers wrote: > Marcus Collins wrote: > > > > probably most DBMS)? The database server returns the float as a string, > > the database API converts it back to a float, > > It does? why? that sucks :-S The MySQL C API returns rows as an array of strings. See the typedefs and prototypes in mysql.h... I don't know the dark magic, but AFAIK in MySQLdb there's a 'converters' module (which you can override) which examines the column type reported by the API and converts the string output to native Python objects. [ snip sample data ] Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-10 15:55:28
|
Marcus Collins wrote: > > On Mon, 10 Dec 2001 at 12:37:11 +0000, Chris Withers wrote: > > > Marcus Collins wrote: > > > > > > probably most DBMS)? The database server returns the float as a string, > > > the database API converts it back to a float, > > > > It does? why? that sucks :-S > > The MySQL C API returns rows as an array of strings. See the typedefs > and prototypes in mysql.h... Booo hiss and shame on MySQL :-( Has anyone requested this 'feature' be changed? cheers, Chris |
From: Marcus C. <ma...@wr...> - 2001-12-12 17:09:56
|
On Mon, 10 Dec 2001 at 15:53:47 +0000, Chris Withers wrote: > Marcus Collins wrote: > > > The MySQL C API returns rows as an array of strings. See the typedefs > > and prototypes in mysql.h... > > Booo hiss and shame on MySQL :-( > > Has anyone requested this 'feature' be changed? I think it's probably implemented like this because returning it as a string is the most universal -- some of MySQL's data types (such as DECIMAL, for example) have no native counterparts in C. Also, MySQL has locale support which may not be available in the C libraries on all the platforms on which the C API might be used (which could very well be different to the platform on which the server is running!). And most importantly, the return type of the C function needs to be known! However, it would indeed be good to see the C API support the data types C knows about natively -- and these do account for the majority of MySQL's native data types. It would be interesting to determine what it would take to add this. I've looked only at the typedefs and declarations, and not at the definitions in the source, but I think that it could be done by allocating a pointer to the binary representation in MYSQL_RES, and having an access function in the Python wrapper (not in C; it wouldn't be all that useful since the return type would be ambiguous at compile time) to examine the column type, treat the binary data appropriately, and return a native Python object of the appropriate type. Then again, it depends how these are returned by the server itself ;-) Cheers -- Marcus |
From: Chris W. <ch...@ni...> - 2001-12-13 16:05:20
|
Marcus Collins wrote: > > I think it's probably implemented like this because returning it as a > string is the most universal Yeah, I think I actually see why it needs to be like that back now, just a shame we need to take the implicit performacne hit... > However, it would indeed be good to see the C API support the data types > C knows about natively -- and these do account for the majority of > MySQL's native data types. It would be interesting to determine what it > would take to add this. That sounds like a question for the MySQL lists ;-) <snip> > to examine the column type, treat the binary data appropriately, and > return a native Python object of the appropriate type. That would be cool, if I understood it entirely... I got the native Python object bit :-S cheers, Chris |