[Geotools-devel] data access review

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ok, I've spent some good time with the interfaces the Refractions crew
has thrown up, and have more thoughts and concerns than I could
reasonably organize and put down in one email.  So I think I'm going to
mostly address the main issue I have with them.

But before I begin, something positive =96 I really like many of the
motivations for these changes, I think our api is in need of another
evolutionary leap forward, especially simplified creation, better
starting point for file based feature sources, better grid coverage
integration, integration of
featureresults/featurecollection/featureiterator/featurereader, random
access and spatial index queries.  I support upping fid mapping to the
public api, and would actually like to take it further, extend the
mapping idea to all objects, not just featureids.  I want most all of
these as well, and I'd like to be sure that some sort of joins are
actually working, instead of mere ideas thrown into the api that might
work.

What I really don't like is this Catalog coming smashing into our API.=20
I've taken a good bit of time to figure out why I feel this way, went
back through the catalog specs, looked at the the geoapi interfaces,
studied the new interfaces, talked to David a bit.  It basically comes
down to two things:
1.)I think the catalog specs are not very good specifications
2.)I feel the way they are being used in the interfaces isn't very close
to the intention of the specs.

For the first point, how many catalog implementations do you guys know
of?  And compare that to how many WFS and WMS implementations do you
know of?  The only catalog implementation I know of is Ionic's, and if
you look at their catalog product it's based much more on some other
xml technology.  I forget what it is at the moment, and I'm off-line so
I can't look it up, but it's another web services standard put out by a
different organization.  I've talked to people in the OGC, and most of
them agree that catalogs need to improve.

I admit that this isn't a direct criticism of the catalog spec, but it
is an indication of how useful other people have found it.  As for my
own personal opinion, the catalog specs themselves do not strike me as
'good' specifications.  WFS and WMS lay out exactly what they want to
do, and provide all the details to do it generically.  The catalog
specs either come across as incredibly abstract, talking about vague
concepts, or incredibly detailed =96 this is how you implement our spec
with z39.50, which comes off to me as 'we wrote a z39.50 catalog, and
now we're going to write up how it works'.  The fact that there are
multiple specs is definitely a point of confusion, as there are at
least 3 substantially different specifications, the abstract, 1.* and
2.0, and people seem to just select which ever one props up their point
(which I could easily do to argue for pages in support of my second
point, but which I will attempt to hold off doing).  The specs also
seem to suffer from trying to prop themselves up by referring to other
OGC documents, but in a very superficial way.  An example:

'The catalog entry consists of an aggregation of metadata attributes, at
least one of which describes the "footprint" of the data referenced.
Thus, a catalog entry meets the fundamental definition of a feature.
For this reason, the Catalog Entry class realizes the Feature
interface, that is, it supports all interface protocols defined on
Feature. Since the catalog entries are sub-types of feature, their
aggregation, the Catalog, is a sub-type of feature collection. Thus,
the Catalog realizes the interface for Feature Collection.'

Now this statement is of course true, but anything with a geographic
attribute is a Feature.  They go on to suggest that thus one way to
implement a robust catalog is to use OGC compliant feature data stores.
 This point actually lends support in defense of what I'm against in
the second point =96 it says that you could have metadata be a feature.=20
But it talks not at all about how one would go about doing that, they
just mention that the models may be similar, and for me confuses the
issue even more.

Ok, I'm going to stop talking about why I don't like the catalog
spec(s), unless anyone feels it necessary to call me on some of this
stuff, since I'm not going incredibly in depth, and one could easily
level the criticism that I don't like it because I don't understand it.
 To which I could reply with a quote I recently heard, which is if you
read a spec and don't understand it, don't worry about it, it's most
likely that no one else understood it either, and so it's not going to
be important.  But needless to say, I think the catalog spec(s) is/are
bad spec(s).

What this does beg is an examination of how close GeoTools needs to
follow OGC specs.  I personally feel we have no obligation to at all.=20
They are not paying us (at one point they were, but even then there was
no prerogative to use their specs in our interfaces).  We originally
used the OGC specs as inspiration for our interfaces because they are
very good specs.  When reading them it is obvious that a lot of very
smart people with a lot of experience have thought through these issues
long and hard.  That we could bootstrap on their knowledge, and focus
on implementation instead of abstract notions.  And that we'd gain in
clarity of our interfaces since anyone who knew the OGC specs would
much more easily understand where we were coming from and going.  And I
think this has benefited us enormously.  But I feel very strongly that
we should not be dogmatic in our use of OGC specs.  If they put out bad
specs, there's no reason to incorporate them into our interfaces, to
blindly follow where they lead.  This does of course beg the question
of GeoAPI.  I must admit that I am less excited about GeoAPI then I
initially was, mostly due to the fact that deegree seems to have
dropped off the map.  I saw it as a coming together of open source
projects, not as having our interfaces voted on by the OGC.  My feeling
now is we should make use of them, but only where their interfaces are
substantially better than ours, where the cost of rewriting is worth
the gain.  This is obviously just my opinion, and is open for
discussion.  I very much support Martin's work in GeoAPI, and feel good
about using it for the lower level referencing and coordinate
transformation stuff that he's always worked with.  The geometry stuff
seems like it could be good, allowing us to plug in different
implementations.  But when we get into datastores and even feature
models I'm more hesitant.  And I'm hesitant borrowing a Catalog
interface that hasn't been tested as far as I know (and is drawn from
in my opinion a bad spec(s)), and attempting to fit it to our needs.=20
Which leads me to my second point, but before we get there, one last
thing on the use of OGC specs.  I encourage us to evaluate OGC specs
for how they can help us, if the interfaces one derives from them are
useful for understanding and simplifying things.  Open Source has the
ability to choose from the best out there, and when the best out there
isn't sufficient, to work in a community to come up with a better way.=20
We've got a lot of smart people here (who are often distracted with
other priorities, understandably, since none of us are paid to directly
work on geotools), and a lot of people who care about this project, I
think we just need to come together and move our architecture forward
another step.  Albeit a bit more slowly than the last major changes, as
people do seem to have more commitments, but I think we can move it
forward.

Before I move fully into point 2, one more sub point, the new interfaces
completely break backwards compatibility, and not even by just a
little.  2.1 has already changed things enough that I can't plug in
many datastores from 2.1 into my 2.0 based GeoServer (which I'm not
incredibly psyched about).  This catalog change would require me to
rewrite large chunks of code.

Ok, onto point 2.  The use of the interfaces is not inline with the
intent of the specifications.  I will concede that the specs can be
argued in support of their use, as the specs make all sorts of broad
claims about data access.  But what the Catalog spec is actually
_useful_ for is implementing search and discovery of geographic
resources by their metadata.  And by metadata I do not mean the
FeatureType.  FeatureType _is_ a form of metadata, it is data about
data.  But it's corollary in the web world is a GML application schema,
which is _not_ what catalog services search on.  Catalog services
search on information about the data.  Metadata of the type represented
by FGDC or ISO 19115, metadata as detailed in Martin's metadata object,
or from the catalog spec:

The catalog object includes metadata (information like who, what, why,
when, where and how) and
search engines that let users identify holdings of interest. Catalogs
describe and reference content
found in storage collections and in other catalogs.

Metadata is additional data about the actual FeatureType, or rather the
full FeatureCollection, made up of the FeatureType and the features.  A
Catalog is made to query those records.  In practice it may return a
WFS or WMS link, or many times it will be just a website or someone to
contact to get the data.

In theory I agree it should be more closely linked, but it still should
just be a reference, and not the data itself (despite the spec(s)
confusing bits about how metadata can be a feature, which I think is
not worth following at all).  I actually am not completely against a
Catalog construct in geotools, but I am against it being tied up in the
datastores, in the actual source of the data.

What I think what we are looking for right now is a common way to access
Grid Coverages and Feature Sources.  In OGC terms a common way to
access WFS and WCS.  In the OGC world the catalog spec is _not_ the
answer to this problem.  It is just a way to input a bunch of search
terms and get a reference to WFS and WCSes (or other formats of data
repositorys), based on data about the holdings.  It receives search
queries and returns the records.  In the OGC world WFS and WCS do not
implement Catalog, and I don't think our interfaces should either.  A
WFS will provide a small bit of meta information, what is held in the
FeatureTypeInfo construct of GeoServer, but that's just for catalog
servers to crawl and refer to, it's not to be queried directly.

Using the metadata and catalog interfaces can be justified, as the spec
is written so vaguely, but it really just confuses the api a lot more.=20
DataStore no longer looks remotely like a WFS, which was it's
inspiration =96 getFeatures, transaction, getFeatureType, ect.  It just
refers to weird CatalogEntries that you have to dig into to start to
get your features.  Using a query operation and getting a QueryResult?=20
And then I'm supposed to iterate through that to get my CatalogEntries?
 Cast those to FeatureType Entries?  The call getFeatureSource on that?
 Then all I get back is a FeatureIterator, if I want the bounds later
I'm going to have to pass both the FeatureIterator and the
FeatureSource around, or else just iterate through completely whenever
I want the bounds.  I know I'm probably misrepresenting things, but the
point is I've read the specs and worked with this stuff extensively and
it doesn't make more sense to me.  And maybe I'm just holding onto the
old, but I don't think so, I've been all for changes in the past.  And
this is ignoring that this stuff is not going to be backwards
compatible.

This all said, I do think there may be room somewhere in GeoTools for a
Catalog construct.  But it should be focused on metadata, like that
defined in Martin's metadata package.  It should be able to archive
lots of FGDC/iso/dublin core/ect. Metadata, and provide search
functionality for it.  I actually implemented a proto version of this,
for z39.50, using this great lucene toolkit.  It would be useful if we
could have that construct in GeoTools, and have it refer to
FeatureSources, to provide search and discovery functionality.  But it
should not be all up in our DataStore and GridCoverages.  It should be
the type of thing where I could easily plug servlet hooks into it and
fairly seamlessly implement the Catalog 2.0 http part of the spec.  Or
z39.50 and implement that part of the spec.  With DataStores and
catalogs as they are proposed that would not be the case.

So can we get rid of all the Catalog references?  And allow David to
rewrite his stuff without having to refer to them at all?  It actually
scared me when I just a couple days ago found out that
AbstractDataStore implements Catalog.

If we do want a construct that can register and look up DataStores I
think we should use the DataRepository interface, and make it suit our
needs.  But to some extent I feel that DataRepository should maybe
extend DataStore, or vice-versa.  That it's just a source of data, that
is more decoupled from the actual data format.

Which I think is the direction we want to head, and I don't think
fitting into Catalog is the way to do this.  This is what Rob A is also
interested in, so that you can define FeatureTypes that are no longer
coupled with the back end format.  You could map columns into sub
fields, define new names for the columns.  Basically tell the DataStore
how it is that you'd like to view the data, give it the instructions to
make the Schema (FeatureType) to your specifications, based on a number
of mappings/joins/ect. From the back-end format.  You have your
DataRepository, which is your range of possibilities, and then you can
define the application schemas you want to derive from it.

Ok, I'm getting ahead of myself, I really should save those thoughts for
future emails, as I think we've got a number of requirements, and that
this is going to end up more work than the effort that brought us to
DataStore from DataSource.  In this email I'd just like convince once
and for all that a Catalog in geotools should be about metadata, as
defined in Martin's MetaData interface, not some derived definition
squeezed into it.

So the problem we do need to tackle is the 'metadata' that defines a
source of features or coverages.  This is not covered in any ogc spec,
and it's because OGC works in a web based world, and what we're dealing
with here is files and databases.  We need a way to register and find
the source of data based on parameters specified by users.  We need to
simplify this so it works across raster and vector representations.  I
personally don't have super strong feelings on this, I'm yet to be
fully convinced that a map is not find (though I will listen to others
for sure).  Perhaps a way out is to look to JDBC for inspiration, where
the URL really is just a map of key/value pairs, the url prefix would
specify the geotools data type, and the kvps the values needed.

I don't really know, all I'm saying is I think a better solution can be
reached if we don't constrain ourselves to follow some fairly random
interfaces.  We have some very major problems to solve =96 random access,
joins, complex mappings, uniting raster and vector access, and some
minor ones that are worth cleaning up =96
featurecollection/reader/results/iterator confusion, high and low api,
AbstractDataStore, ect.  And I don't feel a Catalog interface helps us
for really any of those, and that it's presence in the proposed
interfaces only obscured the good work that might actually be done in
them (which I will have more to say in future emails).  I do admit that
some sort of greater structure is needed in GeoTools, to register a
number of DataSources, but I'd feel better about doing it through
DataRepository than bending the definition of Catalog and Metadata.

Thoughts?  What do others think about getting rid of this catalog stuff?

Best regards,

Chris

----------------------------------------------------------
This mail sent through IMP: https://webmail.limegroup.com/

[Geotools-devel] data access review

Toolkit for working with and mapping geospatial data

[Geotools-devel] data access review