Re: [Treebase-devel] ABI proposal for phyloinformatics

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Karen,

In a separate discussion on this topic with Bill he had the following
comments, reproduced below. This to emphasize (as I also did in the
google doc) that my ideas for TreeBASE redevelopment are only my own
blue sky thinking. Bill favours a more gradual approach, and if that
is something that could go into an ABI proposal it is probably the
wiser option. Anyway, here are the remarks:

===============
I've been struggling a bit -- vacillating between Hilmar's phases
(phases 1 - 3 in the doc).  Some thoughts:

(1) Yes, this is mainly a document storage/retrieval system, but
nonetheless there are still some very sexy queries that can be more
easily implemented if at least some portion of the data are relational
(such as the trees) -- such as functionalities that blend TreeBASE and
ToLWeb (as desired by Karen).  And no matter what ABI says re. just
nuts/bolts/hammer stuff, sexy functionalities are still important
because at the end of the day the grant reviewers will be biologists
from outside of NSF -- so Vladimir's comments are still relevant.

(2) Phylogenetic data objects have complex components that must
interdigitate to work properly. For example, TreeBASE's ability to
verify that the sets of taxon labels in matrices and their daughter
trees match up perfectly catches errors in the great majority of all
submissions. Which is just to say that the sad truth is that people
try to deposit broken crap whenever they can get away with it --
that's just a fact of life, and it highlights the fact that Dryad is a
very poor solution for data sharing. That TreeBASE guarantees that all
of our analysis downloads can be opened in Mesquite without error is
fabulously important. (In theory, of course, well-written document
preparation software and validators can do the same thing as what
TreeBASE currently does -- but that essentially shifts the problem to
an earlier stage, such as writing a Mesquite plugin for data
submission preparation, which itself contains all validation/error
checking features, and then dumps rich NeXML for NoSQL-style storage.
But if we do invest in developing a Mesquite plug-in, we won't be
addressing how to ingest matrices and trees that exceed Mesquite's
capabilities -- e.g. genomic-scale data -- so in a way we are just
back to square one)

(3) We must always be well-grounded in the ways in which biologists
actually work, not just how we would like them to work -- the software
they use, the work flows that they use, etc. We know that in their
analysis phase, they use codes and abbreviations for their taxon
labels. When it comes time to submitting to TreeBASE, suddenly they
have to upgrade their data (e.g. writing taxa in full, for example),
and that's when NEXUS files start to break. They often use software
that produced poorly formed NEXUS. They often produce Newick trees
that are incorrectly rooted/oriented (rooting in their figures being
produced by special PAUP commands rather than the implicit order of
parentheses).  The idea that biologists will use a work-flow such that
all metadata are nicely captured from the get-go, and therefore
submission of metadata is trivially easy, is our fantasy of how we
would like them to work, not how they actually work.

(4) The MIIDI minimum metadata editor
(http://www.miidi.org:8080/orbeon/miidi-review/report?id=14) is
totally cool in that it provides the ability to mark up almost any
data package for submission/storage using tons of metadata with
controlled vocabularies, and where the extent of metadata provided can
be verified as to whether it meets minimum standards. The problem is
there is no way in hell that biologists will invest the time in this:
can you imagine taking a 1,000-taxon tree, and for each 1,000 OTUs you
have to click a set of nested boxes to enter the Genbank taxID number,
the museum collection code, the lat-long, etc etc. ? Ha! No f*cking
way (pardon my language).  Realistically, we have to think in terms of
both our fantasy system (like this MIIDI editor) and in terms of what
is likely to be the case for most biologists -- i.e. spreadsheets --
things where people can copy/paste from Excel, etc.

So... for a beefed up Hilmar phase 1 approach:  (a) continue solving
bugs, but going deeper -- i.e. solve those the deeper bug problems
like the hanging queries, excess memory problems, etc, that require
frequent reboots, with the goal that the application will be stable
for much longer stretches of time (b) fix some of our really dumb
data-model problems -- e.g. fuse the submission table with the study
table. (c) soft-type all of our metadata for all objects: matrices,
trees, nodes, etc. (d) provide alternative parsers for larger data
imports, (e) provide automated taxon intel tools for alternative data
sources (e.g. GNI) to just uBio, (f) pre-cache serializations for all
major data objects so that mass downloads don't tax us of memory and
CPU, (g) bring in the NCBI classification and/or connections with
ToLWeb and provide sexy queries for questions of generic topology, (h)
integrate sequence data with a BLAST engine for yet another sexy query
option, (i) integrate the lat/long metadata with Google Earth or Map
for yet another sexy query option, (j) totally redo the search
interface to make it sexy and fun to use, (k) expand out the API, (l)
modify the submission system so for MIAPA compliance, (m) provide a
way to ingest MIAPA-compliant NeXML or submissions. (n) export all
TreeBASE data into CouchDB as an alternative way to access/distribute
the data.

Now, granted, a huge problem is the service-layer bloat and the
general headache of a fat and complex codebase. Can we solve this by
putting programmers hard at work making major changes to the existing
code, or must be start from scratch? And if we start from scratch, how
do we know that we won't find ourself back in the same situation in
five years hence??  It is easier to justify starting from scratch if
we are saying that we need a whole new platform/architecture (e.g.
NoSQL) -- otherwise we don't sound so good if we have to admit that
the code that we wrote is dying under its own weight. On the other
hand, as long as we budget enough fte programmer time into redoing it
all from scratch, we might be able to avoid admitting that we are
forced to redo from scratch. (or blame all our problems on Hibernate
and argue for some other MVC framework).

So one thing I'm saying is that sticking with SQL (but caching all
data objects, and/or dumping to a JSON NoSQL server) would, I think,
solve all the major performance/functionality issues while retaining
the data integrity advantages and ability to do certain fancy queries
which are more easily done by a RDMS.  I don't think that an RDMS is
necessarily

An alternative is to build a Mesquite plugin that has a very rich
interface, with all the data integrity checks, and with easy
copy/paste spreadsheets for metadata, or metadata marked up directly
on tree nodes and edges, etc, etc, and then have this push rich NeXML
on to a NoSQL document storage system. Certain sexy queries
(phylogeographic queries, BLAST searching, topology searching) might
be sacrificed. And we'd be dealing with Mesquite -- which has its own
limitations, idiosyncrasies, and code-bloat, etc.

On Mon, Jun 6, 2011 at 3:04 PM, Karen Cranston
<kar...@ne...> wrote:
> There are several pitches now in the Google doc, with a fair bit of
> overlap between them. I am willing to consolidate into a single page
> and send to NSF (Reed?) and see what he has to say about the various
> components. It seems like these components are:
> 1. some level of re-engineering of TreeBASE
> 2. further development of MIAPA, with annotation tools and TreeBASE integration
> 3. use of ToLWeb as a crowd sourcing and data synthesis platform
> 4. NeXML refinement and development
>
> I don't think this one-pager needs to capture all of the ideas and
> details we currently have, but instead give a general sense of what we
> are proposing and if all / some of these ideas is potentially
> fundable.
>
> Everyone in agreement? I will post the single page in the doc later today.
>
> Karen
>
> On Fri, Jun 3, 2011 at 3:38 PM, Arlin Stoltzfus <ar...@um...> wrote:
>> Today is the deadline for our 1-page synopsis to pitch to an NSF program
>> officer (before going further).   Currently we seem to have 3 pitches.  It
>> is time now for some energetic person to consolidate this, so that we can
>> move ahead.
>>
>> Arlin
>>
>> On May 31, 2011, at 12:19 PM, Karen Cranston wrote:
>>
>>> Tomorrow morning (Wed, June 1) looks to be good for everyone, and
>>> sooner seems better than later. I propose we talk at 9:00 am EST. I
>>> will send connection information later today.
>>>
>>> Cheers,
>>> Karen
>>>
>>> On Thu, May 26, 2011 at 3:00 PM, Karen Cranston
>>> <kar...@ne...> wrote:
>>>>
>>>> There has been some interest among various groups in an ABI proposal
>>>> for development of phyloinformatics resources. This email is an
>>>> attempt to connect those threads and move the process forward. The
>>>> conversations that have been happening up to this point are:
>>>>
>>>> 1. The Phyloinformatics Research Foundation (phylofoundation.org,
>>>> stewards of TreeBASE and ToLWeb) started a Google doc aimed at
>>>> TreeBASE
>>>> 2. MIAPA developers started a wiki page
>>>> (https://www.nescent.org/sites/evoio/NSF_ABI_2011), recognizing the
>>>> need for coordination with TreeBASE and other resources
>>>> 3. NESCent (Todd, Hilmar and myself), as the current TreeBASE host and
>>>> as a third party interested in coordinated development across
>>>> resources started a third document (now added to the already mentioned
>>>> Google doc)
>>>>
>>>> If you are interested in this discussion and do not already have
>>>> access to the Google doc entitled TreeBASE_ABI.doc, let me know and I
>>>> can grant you access. Hilmar and I made some substantial edits earlier
>>>> this morning. I point you specifically to the section at the end
>>>> entitled "An attempt to re-think all of this". Briefly, we wanted to
>>>> encourage some radical thinking and explore the idea of developing a
>>>> PhyloCommons that incorporates both TreeBASE and ToLWeb into the
>>>> proposal (as the data repository and the data sharing / dissemination
>>>> / synthesis platform, respectively).
>>>>
>>>> The ABI deadline is July 7, so we have a short period of time to pull
>>>> this together. Here is a link to a Doodle poll for an initial
>>>> teleconference.
>>>>
>>>> http://doodle.com/zf2tz7sftyk3naxy
>>>>
>>>> During this meeting, we hope to come to agreement on the broad
>>>> direction of the grant, identify possible leaders of the various
>>>> components and create a plan for getting this pulled together in time
>>>> for the deadline. Please feel free to continue the conversation on the
>>>> Google doc between now and the teleconference. If there are others who
>>>> you think should be invited, feel free to do so. Not everyone who
>>>> participates in this first phase will end up being named on the grant,
>>>> but these resources require input from a much larger group.
>>>>
>>>> Cheers,
>>>> Karen
>>>>
>>>>
>>>> --
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> Karen Cranston
>>>> Training Coordinator and Informatics Project Manager
>>>> nescent.org
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>
>>>
>>>
>>>
>>> --
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Karen Cranston
>>> Training Coordinator and Informatics Project Manager
>>> nescent.org
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "MIAPA" group.
>>> For more options, visit this group at
>>> http://groups.google.com/group/miapa-discuss?hl=en
>>
>> -------
>> Arlin Stoltzfus (ar...@um...)
>> Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
>> IBBR, 9600 Gudelsky Drive, Rockville, MD
>> tel: 240 314 6208; web: www.molevol.org
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "MIAPA" group.
>> For more options, visit this group at
>> http://groups.google.com/group/miapa-discuss?hl=en
>>
>
>
>
> --
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Karen Cranston
> Training Coordinator and Informatics Project Manager
> nescent.org
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> --
> You received this message because you are subscribed to the Google
> Groups "MIAPA" group.
> For more options, visit this group at
> http://groups.google.com/group/miapa-discuss?hl=en
>

-- 
Dr. Rutger A. Vos
School of Biological Sciences
Philip Lyle Building, Level 4
University of Reading
Reading, RG6 6BX, United Kingdom
Tel: +44 (0) 118 378 7535
http://rutgervos.blogspot.com