Re: [Treebase-devel] [PhyloWS] PhyloWS, CQL, NeXML on TreeBASE2

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Hilmar, all,

thanks for your comments!

>> I notice that this departs a bit from the phylows that is proposed here.
>>  For example, the proposed phylows puts "/find/" before "/tree/", whereas
>> you have it the other way.
>
> Right, this is not in compliance with the spec. find/ comes first as it
> changes the resource from a record and its URI to a finder.

Right, switching that around is fairly trivial, so I'll do that.

> Also, find/taxon/ would imply that you are finding (and returning) taxa,
> which if I understand correctly is not the case - rather it seems you have
> one query parameter in the URI path (namely that you are searching by
> taxon?) and one in the query string. So if this is searching trees, it needs
> to be find/tree/, and if you are matching against taxon names, the query
> parameter needs to be tb.taxon.name or whatever the blessed metadata term
> for this purpose is.
>
> Third, recordSchema=tree means that you want records back in the tree
> schema. Unless you have invented that schema meanwhile, this is in all
> likelihood not what you want. Rather, the value should be nexml I suppose.
> find/tree already implies that you are finding (and returning) trees, so
> there is no point in expressing that redundantly in the query string. You
> might want to specify that you only want the tree and not also the matrix,
> but that would be a separate query parameter and should not be confounded
> with the return format.

Mmmmm... I think this warrants a little more discussion. It's probably
true that for most implementors their searches can be conveniently
decomposed into several domains (tree search/matrix search/taxon
search/etc.) and that for each domain the metaphor is that of
searching a single table where the CQL indices are that table's
columns.

Then, within each domain there is a limited number of concerns: how to
search on the provided indices and how to format the results. For
example, for a search like
http://8ball.sdsc.edu:6666/treebase-web/search/studySearch.html?query=dcterms.identifier=S2484&format=rss1&recordSchema=tree
the implementation is thus:

* there is a self-contained study searcher
* the searcher knows how predicates map onto columns in the study
table (e.g. dcterms.identifier is the same as study.id)
* the searcher knows how to unpack a study object and get the trees out

if instead we'd have phylows/tree/find?query=study.identifier=S2484,
the implementation would be something like:

* there is a tree searcher
* the tree searcher needs to know not just about the tree table but
also about how all other predicates map onto all other tables, and how
they join with the tree table
* the tree searcher needs to know how to traverse study objects and
where trees are inside the study object
* (and similar overlap of concerns becomes necessary if we want the
trees for a given matrix, or for a taxon, or what have you)

To me that seems like bad design. We'll lose any separation of concern
and might end up with a lot of redundancy between searchers - and a
lot more code (and bugs) to write. I realize that I'm overloading the
"recordSchema" token (and should fix that) but some way of saying
"search THIS domain and project the results into THAT domain" seems
very, very handy - especially because CQL doesn't have a notion of
joins.

Rutger

-- 
Dr. Rutger A. Vos
Department of zoology
University of British Columbia
http://www.nexml.org
http://rutgervos.blogspot.com