From: Twylite <tw...@cr...> - 2008-06-19 10:03:31
|
Hi, > OK. I was a bit strong in my rejection of discussing NULLs. But > I didn't want people demanding that 308 be rejected because it failed > to address 185. The general issue of NULL as a value (a non-value > value?) in Tcl is Out Of Scope. Agreed. In my mind this discussion is about how to retrieve and represent SQL data in a Tclish way. > It's not clear to me that the type information has a place in Tcl, > except insofar as a result is NULL. Results of any type have already > in principle been converted to strings (and yes, at least the > ODBC bridge uses pure integers, pure doubles, and so on intelligently > to avoid needless shimmering). Except for "is this NULL", it's > not clear what use a script would make of a column type. In any > case, it really isn't always known; SQLite, for instance, uses > "manifest typing", which comes close to "everything is a Tcl_Obj". Conceded. The main purpose is in fact "isnull" as you suggest. > The only thing that's really awkward is the disambiguation > of duplicate column names. But ODBC and JDBC both manage to > do that anyway (by adding strings like '#1' to the end of > duplicate names). And if the user thinks those names are > ugly, there's always the use of 'AS' in SQL. Then perhaps - if dict is the accepted approach (and I'm not saying it is ;p) - it needs to be mandatory for the DBI to ensure the uniqueness of column names in this manner? > The data returned are at least complete: given the list of > column names, you can get full information with: > > while {[$resultSet nextdict rowdict]} { > foreach column [$resultSet columns] { > if {[dict exists $rowdict $column]} { > ... it's not null, it's [dict get $rowdict $column] > } else { > ... it's null > } > } > } Yes, I _could_. I could write a helper function that does this for me. I could even suggest that such a helper function should be in the DBI. And I could suggest that the performance sucks. > So I'm convinced that we already have a solution that's > 'correct' in that it can deal with 100% of the cases. The > notation may not be ideal, but there's been a lot of time > spent quibbling about it, without coming up with a single > solution that will satisfy everyone's sense of æsthetics. > So instead of {name type value}, we really have {name isnull value} > or {name {Maybe value}}. And, well, Tcl dictionaries are an > extremely natural representation for {name {Maybe value}}. Assuming that name is unique. By "extremely natural" I take it to mean that "if {Maybe value} is Nothing then leave name out of the dict, otherwise include name -> Just value" ? Here are the three persistent problems with this approach: (1) To represent your row as an ordered list of values - which is undeniably a common use - requires a nasty block of embedded loops and dict lookups that you have kindly presented above, and which has horrible performance. So yes there IS a way to get the coveted ordered-list-of-values, I'm just saying the way isn't great. (2) If you have duplicate column names, you are screwed. This appears to be an uncommon case, and the DB could provide a work-around by providing unique suffixes to make duplicate names unique. So yes there IS a solution, I'm just saying that it isn't great. (3) The one that everyone keeps ignoring, and is to my mind the biggest problem with the dict approach: a dict-with-missing-key is, as many have pointed out, the closest thing Tcl has to a true NULL. This brings with it some of the problems of NULL handling. In particular the problem that SQL returns a list of tuples and if I try to access one by name where the value is NULL, things blow up. By "blow up" I mean "dict get $row colname" throws an error. Now you may think that is good - after all we're dealing with NULL, right? Well I'm dealing with a team of developers who have asked for columns X, Y and Z, and expect to be able to [dict get $result X] (or Y or Z) without it blowing up in their face. Unreasonable? Hardly - SQL did in fact return all those columns, even if they "contained NULL". That's how SQL works. It doesn't leave out data just because its NULL - it tells you the cell is there but undefined. Dict does not tell you the cell is there. The software engineering implication of this is that EVERY SQL statement and the code handling its results must be carefully analyzed to determine if NULLs could possibly be returned, and to do a [dict exists] to check for and handle each potential NULL. This is of course a poor approach as a database schema change could invalidate all this careful analysis in a flash. So instead we will see the following boiler-plate code everywhere (or at least we hope to, or there will be bugs): while {[$resultSet nextdict rowdict]} { if ( [dict size $rowdict] != [llength [$resultset columns]] } { # This resultset may contain unexpected NULLs, blow up sanely # otherwise we get a *ahem* useful error message like 'key "x" not known in dictionary" error "Database row contains NULLs - can't process" } In short the dict interface as it stands requires explicit handling of NULLs and does not permit the _safe_ implicit handling of NULLs which - in my experience and from comments on many on c.l.t and mailing lists - is a common use case. > One could even use [dict merge] to supply correct column-dependent > default values for the nulls, something that a putative -nullvalue > option, whether connection-wide or statement-specific, cannot do: > > set row [dict merge {counter 0 middlename NONE age UNKNOWN} $row] Yes, assuming you know the columns coming back. Much of the database code I deal with doesn't know the columns until it has a resultset. Also a [dict merge] doesn't preserve order, so you must still iterate over "$resultset columns" if order is important to you. > So the whole thing comes down to a matter of notational convenience. Absolutely. Notation to overcome that "impedance mismatch". In the many discussions on this topic three use cases have come up repeatedly: - The 80% case that require simple database operations mainly aimed at reporting, in which rows need to be available as an ordered list of values with an appropriate (preferably configurable) value in place of NULL. - The other 80% case that need to access individual values in the resultset - typically by name - to effect some form of processing, and usually don't want to be bothered by NULL (so substitution if fine). - The other other 80% case of serious database users that need to handle NULL explicitly. They say (for some value of "they") that APIs should be optimized for ease of use for the 80% case. My problem with the TDBC API, as it stands, is that it seems to be optimized for the last 80% case, at the expense of the other two. To improve the situation for the first 80% case we need a configurable nullvalue and decent performance. A helper that converts from "$resultset columns" + rowdict to rowlist with substituted NULLs _may_ suffice, but think needs to be demonstrated. To improve the situation for the second 80% case we need a dict- or array-like representation with substituted NULLs, so that elements can be accessed safely. I'm not in the third 80% case so I'm can't claim to know how to optimize it - I assume the dict-with-missing-keys approach is considered appropriate, although I would have thought that something that does not require a bunch of "dict exists" checks would be more pleasant. CAUTION: FLIGHTY THOUGHTS AHEAD Idea: dict {name {isnull value} name {isnull value} ...} which would allow 'lassign [dict get $rowdict mycol] isnull mycol_value' which would make for more readable code (checking 'if { $isnull }' is more readable/understandable than 'if { ! [dict exists $rowdict mycol] }'). Idea: dict {name {isnull bool value val} ...} which allows [dict get $rowdict $name isnull] and [dict get $rowdict $name value]. Its a nested dict so performance may well suck (would need investigation) but the code reads more clearly and the isnull/value are more directly accessible (no lassign/lindex required). Idea: since the developer must either assume (or be confident) that a column is not NULL before doing the [dict get], or check for NULL first, why not a "$resultset isnull colname" method? (I'll answer that: because you may want to take the result row out of the context of a resultset, e.g. using allrows). Why not a "$resultset nulldict varname" to return a dict name -> isnull? Or "$resultset nextrow rowdictVarname ?nulldictVarname?" - the rowdict has substituted NULL values and covers the second 80% case as well as the first ([dict values $rowdict]). There are many possible representations, and I don't think they have been properly considered or investigated because there was too much focus on NULLs in Tcl, and not enough on optimizing for all the common use cases. SUMMARY - Blah, blah, blah. - Implementations should enforce uniqueness of column names (by appending a suffix if required). - Yadda, software engineering, fubar. - You're making it hard for two common use cases that deserve ease-of-use optimization. - Silly idea, silly idea, demand, etc. Regards, Trevor |