From: Mike O. <slu...@gm...> - 2006-02-10 02:05:38
|
I've got some complicated queries coming up and I'm wondering what the options are. The current task is to find the chemicals whose canonical name start with a certain letter. Or find the chemicals whose name starts with a non-alphabetic character. There are 6000 chemicals with 200 fields each, so I'm not sure I want the entire objects in memory. I need only the ID and name (for hyperlinks in TurboGears). The relevant fields are: class Chemical(SQLObject): canonical_name =3D ForeignKey("Name") names =3D RelatedJoin("Name") class Name(SQLObject): name =3D UnicodeCol(length=3D500, notNone=3DTrue) chemicals =3D RelatedJoin("Chemical") '.names' is a list of all names (canonical and synonyms).=20 '.canonical_name' points to one of the '.names' records. I'm ignoring synonyms for this task. (The reason it's a many:many relationship is there are 85,000 names.=20 Sometimes the same name is canonical for one chemical but a synonym for one or more others.) I can see the SQL in my head and could just do connection.queryAll with: SELECT Chemical.id AS id, Name.name AS name FROM Chemical, Chemical_Name, Name WHERE Chemical.id =3D Chemical_Name.Chemical_id AND Name.id =3D Chemical_Name.Name_id AND Chemical.canonical_nameID =3D Name.id AND Name.name LIKE 'A%' ORDER BY 2 But I'm looking for more SQLObject-ish alternatives. Chemical.select(??.startswith('A')) =3D> Er, but my criteria is actually in the related table, not Chemical. Chemical.q.canonical_name.name.startswith('A') =3D> Illegal. .canonical_name doesn't exist but .canonical_nameID does. Chemical.select(Name.q.name.startswith('A')).count() >>> Chemical.select(Name.q.name.startswith('A')).count() 1/QueryOne: SELECT COUNT(*) FROM Chemical, Name WHERE (Name.name LIKE 'A%= ') =3D> Missing the join expression, will return too many records. >>> Chemical.select(AND(Name.q.name.startswith('A'), Chemical.q.canonical_nameID =3D=3D Name.q.id)).count() 1/QueryOne: SELECT COUNT(*) FROM Chemical, Name WHERE ((Name.name LIKE 'A%') AND (Chemical.canonical_nameID =3D Name.id)) 446L =3D> This works but it requires me manually setting the join condition.=20 I'm not supposed to know that .canonical_nameID exists, right? There's a Select object mentioned in "How to use database agnostic SQL in SQLObject", but it doesn't seem to be officially documented. http://www.groovie.org/articles/2005/11/01/how-to-use-database-agnostic-sql= -in-sqlobject I played with that and got: >>> sql =3D conn.sqlrepr(Select([Chemical.q.id, Name.q.name], where=3DAND(Name.q.name.startswith('A'), Chemical.q.canonical_nameID =3D=3D Name.q.id),orderBy=3DName.q.name)) >>> sql "SELECT Chemical.id, Name.name FROM Chemical, Name WHERE ((Name.name LIKE 'A%') AND (Chemical.canonical_nameID =3D Name.id)) ORDER BY Name.name" >>> conn.queryAll(sql) ( tuple of several 2-tuples listed ) =3D> That works. I guess that's what I want. Any other ideas? -- Mike Orr <slu...@gm...> (ms...@oz... address is semi-reliable) |
From: Oleg B. <ph...@ma...> - 2006-02-10 09:12:26
|
On Thu, Feb 09, 2006 at 06:05:29PM -0800, Mike Orr wrote: > I can see the SQL in my head and could just do connection.queryAll with: > SELECT Chemical.id AS id, Name.name AS name > FROM Chemical, Chemical_Name, Name > WHERE Chemical.id = Chemical_Name.Chemical_id > AND Name.id = Chemical_Name.Name_id > AND Chemical.canonical_nameID = Name.id > AND Name.name LIKE 'A%' > ORDER BY 2 > > But I'm looking for more SQLObject-ish alternatives. http://svn.colorstudy.com/SQLObject/docs/FAQ.txt "What about GROUP BY, UNION, etc?" "In short - not every query can be represented in SQLObject..." > There's a Select object mentioned in "How to use database agnostic SQL > in SQLObject", but it doesn't seem to be officially documented. It is not "officially" documented not because it is "internal" but because SQLObject desperately needs documentation writers! > I played with that and got: > > >>> sql = conn.sqlrepr(Select([Chemical.q.id, Name.q.name], > where=AND(Name.q.name.startswith('A'), Chemical.q.canonical_nameID == > Name.q.id),orderBy=Name.q.name)) > >>> sql > "SELECT Chemical.id, Name.name FROM Chemical, Name WHERE ((Name.name > LIKE 'A%') AND (Chemical.canonical_nameID = Name.id)) ORDER BY > Name.name" > >>> conn.queryAll(sql) > ( tuple of several 2-tuples listed ) > > => That works. I guess that's what I want. Any other ideas? That's what you want, and it works, and it's SQLObject-ish enough... Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: Mike O. <slu...@gm...> - 2006-02-10 11:16:26
|
On 2/10/06, Oleg Broytmann <ph...@ma...> wrote: > On Thu, Feb 09, 2006 at 06:05:29PM -0800, Mike Orr wrote: > > I can see the SQL in my head and could just do connection.queryAll with= : > > SELECT Chemical.id AS id, Name.name AS name > > FROM Chemical, Chemical_Name, Name > > WHERE Chemical.id =3D Chemical_Name.Chemical_id > > AND Name.id =3D Chemical_Name.Name_id > > AND Chemical.canonical_nameID =3D Name.id > > AND Name.name LIKE 'A%' > > ORDER BY 2 > > > > But I'm looking for more SQLObject-ish alternatives. > > http://svn.colorstudy.com/SQLObject/docs/FAQ.txt > > "What about GROUP BY, UNION, etc?" > > "In short - not every query can be represented in SQLObject..." > > > There's a Select object mentioned in "How to use database agnostic SQL > > in SQLObject", but it doesn't seem to be officially documented. > > It is not "officially" documented not because it is "internal" but > because SQLObject desperately needs documentation writers! Yes, I understand. Just pointing it out in case it got forgotten. I=20 can help with docs as soon as I understand it. It looks like I'll be doing a lot with SQLBuilder and "unorthodox" queries because my dataset is pretty ornery in places, but I'm tired of maintaining ad hoc SQL libraries and am looking for something standard like SQLObject. So far what I've learned is: 1) SQLObject has three "levels". * Col.get(), Col.byName(), etc, are the most OO but can only represent simple queries. * Col.select() can do more complicated joins and WHEREs but you're still getting entire records back; that is, all the fields in one table with the usual access to related tables, but not some fields from one table and some from another. * Select() with connection.queryAll(sql) can return arbitrary values rather than entire sqlobjects, and also handle unsupported queries. connection.query(sql) does backend-specific commands, can create/drop databases, etc. 2) Table.select(RelatedTable.q...) works, and RelatedTable.q can also be on the right side of an expression and in an orderBy. > > I played with that and got: > > > > >>> sql =3D conn.sqlrepr(Select([Chemical.q.id, Name.q.name], > > where=3DAND(Name.q.name.startswith('A'), Chemical.q.canonical_nameID = =3D=3D > > Name.q.id),orderBy=3DName.q.name)) > > >>> sql > > "SELECT Chemical.id, Name.name FROM Chemical, Name WHERE ((Name.name > > LIKE 'A%') AND (Chemical.canonical_nameID =3D Name.id)) ORDER BY > > Name.name" > > >>> conn.queryAll(sql) > > ( tuple of several 2-tuples listed ) > > > > =3D> That works. I guess that's what I want. Any other ideas? > > That's what you want, and it works, and it's SQLObject-ish enough... OK, thanks. Good to have developer confirmation. Now I just have to find out where TurboGears stores the connection object. But I do have it working with Chemical.select() for the meantime. Getting heavy performance differences between MySQL and SQLite. My original query takes two seconds in MySQL but around 20 in SQLite. I added a unique index on the Name table and that halved the MySQL time (1 second) but didn't help the SQLite time. So I'll either have to get a lot smarter with indexes, or build some cache tables, or forget about SQLite for this. One thing with indexes. My darned dataset has names up to 407 characters, so I defined it as length 500. That's too long for an alternateId index in MySQL 5. I had to create a special index to index just the left part: "CREATE UNIQUE INDEX name_index ON (name(255))". So I don't get a byName method. :( SQLite meanwhile can index the entire field but doesn't allow a length, so it needs a different CREATE INDEX statment. In MySQL 4 it's worse because VARCHAR columns cannot be > 255 chars, so a length 500 UnicodeCol is illegal. An unlengthed UnicodeCol is OK, but you can't have an anternateId index on unlengthed columns (which are TEXT). You can have an index on the left 255 characters, but that's the same problem as above. The 255 limit on VARCHAR columns is what finally made me upgrade to MySQL 5. :) I may just put the canonical name directly in the Chemical table and ignore the normalization faux pas. Either that or make a .name property for it. -- Mike Orr <slu...@gm...> (ms...@oz... address is semi-reliable) |
From: Oleg B. <ph...@ph...> - 2006-02-10 11:37:08
|
On Fri, Feb 10, 2006 at 03:16:13AM -0800, Mike Orr wrote: > I can help with docs as soon as I understand it. It would be very kind of you! > So I don't get a byName method. .selectBy(name="long_name") > SQLite meanwhile > can index the entire field but doesn't allow a length, so it needs a > different CREATE INDEX statment. Writing portable applications is a hard task, either portable GUI or portable SQL... Oleg. -- Oleg Broytmann http://phd.pp.ru/ ph...@ph... Programmers don't die, they just GOSUB without RETURN. |
From: Mike O. <slu...@gm...> - 2006-02-11 01:59:21
|
On 2/10/06, Mike Orr <slu...@gm...> wrote: So far what I've learned is: > > 1) SQLObject has three "levels". > * Col.get(), Col.byName(), etc, are the most OO but can only > represent simple queries. > * Col.select() can do more complicated joins and WHEREs but you're > still getting entire records back; that is, all the fields in one > table with the usual access to related tables, but not some fields > from one table and some from another. > * Select() with connection.queryAll(sql) can return arbitrary > values rather than entire sqlobjects, and also handle unsupported > queries. connection.query(sql) does backend-specific commands, can > create/drop databases, etc. ... but .queryAll() returns returns strings for UnicodeCols, so you have to convert them to unicode yourself: value =3D unicode(value, 'utf-8') or to be pedantic: encoding =3D THE_TABLE.sqlmeta.columns[col_name].dbEncoding value =3D unicode(value, encoding) Or face the wrath of Kid if you dare to pass it a string with high-byte "ascii" characters. -- Mike Orr <slu...@gm...> (ms...@oz... address is semi-reliable) |
From: Jorge G. <go...@ie...> - 2006-02-11 02:03:27
|
Mike Orr <slu...@gm...> writes: > ... but .queryAll() returns returns strings for UnicodeCols, so you This sounds like a bug to me... -- Jorge Godoy <go...@ie...> "Quidquid latine dictum sit, altum sonatur." - Qualquer coisa dita em latim soa profundo. - Anything said in Latin sounds smart. |
From: Mike O. <slu...@gm...> - 2006-02-11 02:52:37
|
On 2/10/06, Jorge Godoy <go...@ie...> wrote: > Mike Orr <slu...@gm...> writes: > > > ... but .queryAll() returns returns strings for UnicodeCols, so you > > This sounds like a bug to me... It's just a lower level. connection.queryAll(sql) is essentially just a wrapper around DBAPI's "cursor.execute(sql); return cursor.fetchall()". It doesn't parse the SQL to see what the columns are, much less know what a UnicodeCol is. -- Mike Orr <slu...@gm...> (ms...@oz... address is semi-reliable) |
From: Jorge G. <go...@ie...> - 2006-02-11 03:28:30
|
Mike Orr <slu...@gm...> writes: > On 2/10/06, Jorge Godoy <go...@ie...> wrote: >> Mike Orr <slu...@gm...> writes: >> >> > ... but .queryAll() returns returns strings for UnicodeCols, so you >> >> This sounds like a bug to me... > > It's just a lower level. connection.queryAll(sql) is essentially just > a wrapper around DBAPI's "cursor.execute(sql); return > cursor.fetchall()". It doesn't parse the SQL to see what the columns > are, much less know what a UnicodeCol is. If your database is configured for UTF-8, .queryAll() should return unicode instances. -- Jorge Godoy <go...@ie...> "Quidquid latine dictum sit, altum sonatur." - Qualquer coisa dita em latim soa profundo. - Anything said in Latin sounds smart. |
From: Max I. <isc...@gm...> - 2006-02-11 10:15:59
|
> >> > ... but .queryAll() returns returns strings for UnicodeCols, so = you > >> > >> This sounds like a bug to me... > > > > It's just a lower level. connection.queryAll(sql) is essentially = just > > a wrapper around DBAPI's "cursor.execute(sql); return > > cursor.fetchall()". It doesn't parse the SQL to see what the = columns > > are, much less know what a UnicodeCol is. >=20 > If your database is configured for UTF-8, .queryAll() should return = unicode > instances. Theoretically, yes. But IIRC not every python DB-API driver works that = way so if SQLObject just passes the results up it may be possible. IMO, = this is still SQLObject bug though. =20 |
From: Mike O. <slu...@gm...> - 2006-02-11 12:31:57
|
On 2/11/06, Max Ischenko <isc...@gm...> wrote: > > >> > ... but .queryAll() returns returns strings for UnicodeCols, so yo= u > > >> > > >> This sounds like a bug to me... > > > > > > It's just a lower level. connection.queryAll(sql) is essentially jus= t > > > a wrapper around DBAPI's "cursor.execute(sql); return > > > cursor.fetchall()". It doesn't parse the SQL to see what the columns > > > are, much less know what a UnicodeCol is. > > > > If your database is configured for UTF-8, .queryAll() should return uni= code > > instances. > > Theoretically, yes. But IIRC not every python DB-API driver works that wa= y so if SQLObject just passes the results up it may be possible. IMO, this = is still SQLObject bug though. I saw MySQL has a command to set the charset but it looks like you'd have to set it with a custom query. Dunno about SQLite. I don't care about the others. -- Mike Orr <slu...@gm...> (ms...@oz... address is semi-reliable) |
From: Tracy R. <tra...@gm...> - 2006-02-11 23:00:35
|
On Feb 9, 2006, at 8:05 PM, Mike Orr wrote: > There are 6000 > chemicals with 200 fields each, so I'm not sure I want the entire > objects in memory. I need only the ID and name (for hyperlinks in > TurboGears). Hey Mike, I'm curious about how you might be dealing with larger data sets where you don't want every single value from the database cached in an object somewhere. For my domain objects, the numbers are fairly reasonable so I just use SQLObject straight up the way it was intended to be used. But, with my *event* tables that record various user events within the system, the tables are 1 to 2 million rows and counting. For these, I'm using the _connection.queryAll and _connection.queryOne methods within some home grown helper classes. I'm not sure that there is a programmatic solution within SQLObject for tables with many, many, many rows, is there? Does anyone out there have any experience dealing with tables of this size within SQLObject? --T |
From: Tracy R. <tra...@gm...> - 2006-02-12 16:59:48
|
On Feb 12, 2006, at 12:35 AM, Mike Orr wrote: >> I'm not sure that there is a programmatic solution within SQLObject >> for tables with many, many, many rows, is there? Does anyone out >> there have any experience dealing with tables of this size within >> SQLObject? > > Your situation is the opposite of mine so I don't know. I have a few > huge records; you have a lot of small ones. I think SQLObject holds > weakrefs to the records and they disappear immediately when they go > out of scope. My problem has been, is SQLite memory efficient? > connection.queryAll works fine so I wouldn't hesitate to use it. You > can use Select() and sqlbuilder and COLUMN.q to construct your SQL if > you want. Everything else in SQLObject seems tied to the record > object, so I think you get entire records. For others who might be in the same situation as me, here's how I currently deal with processing very large tables: # process.py from itertools import chain, imap def step(selection, size): i = 0 while 1: output = selection[i:i+size] if output.count() > 0: for row in output: yield row i += size else: raise StopIteration def process(f, selection, stepsize=1000): return imap(f, chain(step(selection, stepsize))) And, using it would look something like this: # processor.py from process import process def f(item): pass # do the processing work here gen = process(f, bigtable.select()) list(gen) # do the processing In this setup, there shouldn't more than 1000 sqlobjects in memory at any one time. *But*, the size of the python process grows and grows over time as it's processing the rows. In my test of a 3 million row table where 'f' does nothing at all, it doesn't look like any of the sqlobject rows are being removed from memory with the above functions. Does anyone have any suggestions about how to 'release' those objects? I thought the generators would efficiently take care of it. As soon as the local variables 'output' and 'row' are reassigned with the new objects, the old objects should have no more references. Is that correct? I'm using python2.4, fedora core 4 and MySQL4.1 (and an older version sqlobject: r569) Thanks, --Tracy |