Re: [Tcl-tdbc] Last-minute comments on TIP #308

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,
> OK.  I was a bit strong in my rejection of discussing NULLs. But
> I didn't want people demanding that 308 be rejected because it failed
> to address 185.  The general issue of NULL as a value (a non-value
> value?) in Tcl is Out Of Scope.
Agreed.  In my mind this discussion is about how to retrieve and 
represent SQL data in a Tclish way.
> It's not clear to me that the type information has a place in Tcl,
> except insofar as a result is NULL.  Results of any type have already
> in principle been converted to strings (and yes, at least the
> ODBC bridge uses pure integers, pure doubles, and so on intelligently
> to avoid needless shimmering).  Except for "is this NULL", it's
> not clear what use a script would make of a column type.  In any
> case, it really isn't always known; SQLite, for instance, uses
> "manifest typing", which comes close to "everything is a Tcl_Obj".
Conceded.  The main purpose is in fact "isnull" as you suggest.
>  The only thing that's really awkward is the disambiguation
> of duplicate column names. But ODBC and JDBC both manage to
> do that anyway (by adding strings like '#1' to the end of
> duplicate names).  And if the user thinks those names are
> ugly, there's always the use of 'AS' in SQL.
Then perhaps - if dict is the accepted approach (and I'm not saying it 
is ;p) - it needs to be mandatory for the DBI to ensure the uniqueness 
of column names in this manner?
> The data returned are at least complete: given the list of
> column names, you can get full information with:
>
> while {[$resultSet nextdict rowdict]} {
>     foreach column [$resultSet columns] {
>         if {[dict exists $rowdict $column]} {
>             ... it's not null, it's [dict get $rowdict $column]
>         } else {
>             ... it's null
>         }
>     }
> }
Yes, I _could_.  I could write a helper function that does this for me.  
I could even suggest that such a helper function should be in the DBI.  
And I could suggest that the performance sucks. 
> So I'm convinced that we already have a solution that's
> 'correct' in that it can deal with 100% of the cases.  The
> notation may not be ideal, but there's been a lot of time
> spent quibbling about it, without coming up with a single
> solution that will satisfy everyone's sense of æsthetics.
> So instead of {name type value}, we really have {name isnull value}
> or {name {Maybe value}}.  And, well, Tcl dictionaries are an
> extremely natural representation for {name {Maybe value}}.
Assuming that name is unique.

By "extremely natural" I take it to mean that "if {Maybe value} is 
Nothing then leave name out of the dict, otherwise include name -> Just 
value" ?

Here are the three persistent problems with this approach:

(1) To represent your row as an ordered list of values - which is 
undeniably a common use - requires a nasty block of embedded loops and 
dict lookups that you have kindly presented above, and which has 
horrible performance.
So yes there IS a way to get the coveted ordered-list-of-values, I'm 
just saying the way isn't great.

(2) If you have duplicate column names, you are screwed.  This appears 
to be an uncommon case, and the DB could provide a work-around by 
providing unique suffixes to make duplicate names unique.
So yes there IS a solution, I'm just saying that it isn't great.

(3) The one that everyone keeps ignoring, and is to my mind the biggest 
problem with the dict approach: a dict-with-missing-key is, as many have 
pointed out, the closest thing Tcl has to a true NULL.  This brings with 
it some of the problems of NULL handling.  In particular the problem 
that SQL returns a list of tuples and if I try to access one by name 
where the value is NULL, things blow up.  By "blow up" I mean "dict get 
$row colname" throws an error.

Now you may think that is good - after all we're dealing with NULL, 
right?  Well I'm dealing with a team of developers who have asked for 
columns X, Y and Z, and expect to be able to [dict get $result X] (or Y 
or Z) without it blowing up in their face.  Unreasonable?  Hardly - SQL 
did in fact return all those columns, even if they "contained NULL".  
That's how SQL works.  It doesn't leave out data just because its NULL - 
it tells you the cell is there but undefined.  Dict does not tell you 
the cell is there.

The software engineering implication of this is that EVERY SQL statement 
and the code handling its results must be carefully analyzed to 
determine if NULLs could possibly be returned, and to do a [dict exists] 
to check for and handle each potential NULL.  This is of course a poor 
approach as a database schema change could invalidate all this careful 
analysis in a flash.
So instead we will see the following boiler-plate code everywhere (or at 
least we hope to, or there will be bugs):
  while {[$resultSet nextdict rowdict]} {
    if ( [dict size $rowdict] != [llength [$resultset columns]] } {
      # This resultset may contain unexpected NULLs, blow up sanely
      # otherwise we get a *ahem* useful error message like 'key "x" not 
known in dictionary"
      error "Database row contains NULLs - can't process"
    }

In short the dict interface as it stands requires explicit handling of 
NULLs and does not permit the _safe_ implicit handling of NULLs which - 
in my experience and from comments on many on c.l.t and mailing lists - 
is a common use case.
> One could even use [dict merge] to supply correct column-dependent
> default values for the nulls, something that a putative -nullvalue
> option, whether connection-wide or statement-specific, cannot do:
>
>     set row [dict merge {counter 0 middlename NONE age UNKNOWN} $row]
Yes, assuming you know the columns coming back.  Much of the database 
code I deal with doesn't know the columns until it has a resultset.
Also a [dict merge] doesn't preserve order, so you must still iterate 
over "$resultset columns" if order is important to you.

> So the whole thing comes down to a matter of notational convenience.
Absolutely.  Notation to overcome that "impedance mismatch".

In the many discussions on this topic three use cases have come up 
repeatedly:

- The 80% case that require simple database operations mainly aimed at 
reporting, in which rows need to be available as an ordered list of 
values with an appropriate (preferably configurable) value in place of NULL.
- The other 80% case that need to access individual values in the 
resultset - typically by name - to effect some form of processing, and 
usually don't want to be bothered by NULL (so substitution if fine).
- The other other 80% case of serious database users that need to handle 
NULL explicitly.

They say (for some value of "they") that APIs should be optimized for 
ease of use for the 80% case.  My problem with the TDBC API, as it 
stands, is that it seems to be optimized for the last 80% case, at the 
expense of the other two.

To improve the situation for the first 80% case we need a configurable 
nullvalue and decent performance.  A helper that converts from 
"$resultset columns" + rowdict to rowlist with substituted NULLs _may_ 
suffice, but think needs to be demonstrated.

To improve the situation for the second 80% case we need a dict- or 
array-like representation with substituted NULLs, so that elements can 
be accessed safely.

I'm not in the third 80% case so I'm can't claim to know how to optimize 
it - I assume the dict-with-missing-keys approach is considered 
appropriate, although I would have thought that something that does not 
require a bunch of "dict exists" checks would be more pleasant.

CAUTION: FLIGHTY THOUGHTS AHEAD

Idea: dict {name {isnull value} name {isnull value} ...} which would 
allow 'lassign [dict get $rowdict mycol] isnull mycol_value' which would 
make for more readable code (checking 'if { $isnull }' is more 
readable/understandable than 'if { ! [dict exists $rowdict mycol] }').

Idea: dict {name {isnull bool value val} ...} which allows [dict get 
$rowdict $name isnull] and [dict get $rowdict $name value].  Its a 
nested dict so performance may well suck (would need investigation) but 
the code reads more clearly and the isnull/value are more directly 
accessible (no lassign/lindex required).

Idea: since the developer must either assume (or be confident) that a 
column is not NULL before doing the [dict get], or check for NULL first, 
why not a "$resultset isnull colname" method?  (I'll answer that: 
because you may want to take the result row out of the context of a 
resultset, e.g. using allrows).  Why not a "$resultset nulldict varname" 
to return a dict name -> isnull?  Or "$resultset nextrow rowdictVarname 
?nulldictVarname?" - the rowdict has substituted NULL values and covers 
the second 80% case as well as the first ([dict values $rowdict]).

There are many possible representations, and I don't think they have 
been properly considered or investigated because there was too much 
focus on NULLs in Tcl, and not enough on optimizing for all the common 
use cases.

SUMMARY

- Blah, blah, blah.
- Implementations should enforce uniqueness of column names (by 
appending a suffix if required).
- Yadda, software engineering, fubar.
- You're making it hard for two common use cases that deserve 
ease-of-use optimization.
- Silly idea, silly idea, demand, etc.

Regards,
Trevor

Re: [Tcl-tdbc] Last-minute comments on TIP #308

The Tool Command Language implementation

Re: [Tcl-tdbc] Last-minute comments on TIP #308