From: Rutger V. <R....@re...> - 2011-06-06 14:24:46
|
Hi Karen, In a separate discussion on this topic with Bill he had the following comments, reproduced below. This to emphasize (as I also did in the google doc) that my ideas for TreeBASE redevelopment are only my own blue sky thinking. Bill favours a more gradual approach, and if that is something that could go into an ABI proposal it is probably the wiser option. Anyway, here are the remarks: =============== I've been struggling a bit -- vacillating between Hilmar's phases (phases 1 - 3 in the doc). Some thoughts: (1) Yes, this is mainly a document storage/retrieval system, but nonetheless there are still some very sexy queries that can be more easily implemented if at least some portion of the data are relational (such as the trees) -- such as functionalities that blend TreeBASE and ToLWeb (as desired by Karen). And no matter what ABI says re. just nuts/bolts/hammer stuff, sexy functionalities are still important because at the end of the day the grant reviewers will be biologists from outside of NSF -- so Vladimir's comments are still relevant. (2) Phylogenetic data objects have complex components that must interdigitate to work properly. For example, TreeBASE's ability to verify that the sets of taxon labels in matrices and their daughter trees match up perfectly catches errors in the great majority of all submissions. Which is just to say that the sad truth is that people try to deposit broken crap whenever they can get away with it -- that's just a fact of life, and it highlights the fact that Dryad is a very poor solution for data sharing. That TreeBASE guarantees that all of our analysis downloads can be opened in Mesquite without error is fabulously important. (In theory, of course, well-written document preparation software and validators can do the same thing as what TreeBASE currently does -- but that essentially shifts the problem to an earlier stage, such as writing a Mesquite plugin for data submission preparation, which itself contains all validation/error checking features, and then dumps rich NeXML for NoSQL-style storage. But if we do invest in developing a Mesquite plug-in, we won't be addressing how to ingest matrices and trees that exceed Mesquite's capabilities -- e.g. genomic-scale data -- so in a way we are just back to square one) (3) We must always be well-grounded in the ways in which biologists actually work, not just how we would like them to work -- the software they use, the work flows that they use, etc. We know that in their analysis phase, they use codes and abbreviations for their taxon labels. When it comes time to submitting to TreeBASE, suddenly they have to upgrade their data (e.g. writing taxa in full, for example), and that's when NEXUS files start to break. They often use software that produced poorly formed NEXUS. They often produce Newick trees that are incorrectly rooted/oriented (rooting in their figures being produced by special PAUP commands rather than the implicit order of parentheses). The idea that biologists will use a work-flow such that all metadata are nicely captured from the get-go, and therefore submission of metadata is trivially easy, is our fantasy of how we would like them to work, not how they actually work. (4) The MIIDI minimum metadata editor (http://www.miidi.org:8080/orbeon/miidi-review/report?id=14) is totally cool in that it provides the ability to mark up almost any data package for submission/storage using tons of metadata with controlled vocabularies, and where the extent of metadata provided can be verified as to whether it meets minimum standards. The problem is there is no way in hell that biologists will invest the time in this: can you imagine taking a 1,000-taxon tree, and for each 1,000 OTUs you have to click a set of nested boxes to enter the Genbank taxID number, the museum collection code, the lat-long, etc etc. ? Ha! No f*cking way (pardon my language). Realistically, we have to think in terms of both our fantasy system (like this MIIDI editor) and in terms of what is likely to be the case for most biologists -- i.e. spreadsheets -- things where people can copy/paste from Excel, etc. So... for a beefed up Hilmar phase 1 approach: (a) continue solving bugs, but going deeper -- i.e. solve those the deeper bug problems like the hanging queries, excess memory problems, etc, that require frequent reboots, with the goal that the application will be stable for much longer stretches of time (b) fix some of our really dumb data-model problems -- e.g. fuse the submission table with the study table. (c) soft-type all of our metadata for all objects: matrices, trees, nodes, etc. (d) provide alternative parsers for larger data imports, (e) provide automated taxon intel tools for alternative data sources (e.g. GNI) to just uBio, (f) pre-cache serializations for all major data objects so that mass downloads don't tax us of memory and CPU, (g) bring in the NCBI classification and/or connections with ToLWeb and provide sexy queries for questions of generic topology, (h) integrate sequence data with a BLAST engine for yet another sexy query option, (i) integrate the lat/long metadata with Google Earth or Map for yet another sexy query option, (j) totally redo the search interface to make it sexy and fun to use, (k) expand out the API, (l) modify the submission system so for MIAPA compliance, (m) provide a way to ingest MIAPA-compliant NeXML or submissions. (n) export all TreeBASE data into CouchDB as an alternative way to access/distribute the data. Now, granted, a huge problem is the service-layer bloat and the general headache of a fat and complex codebase. Can we solve this by putting programmers hard at work making major changes to the existing code, or must be start from scratch? And if we start from scratch, how do we know that we won't find ourself back in the same situation in five years hence?? It is easier to justify starting from scratch if we are saying that we need a whole new platform/architecture (e.g. NoSQL) -- otherwise we don't sound so good if we have to admit that the code that we wrote is dying under its own weight. On the other hand, as long as we budget enough fte programmer time into redoing it all from scratch, we might be able to avoid admitting that we are forced to redo from scratch. (or blame all our problems on Hibernate and argue for some other MVC framework). So one thing I'm saying is that sticking with SQL (but caching all data objects, and/or dumping to a JSON NoSQL server) would, I think, solve all the major performance/functionality issues while retaining the data integrity advantages and ability to do certain fancy queries which are more easily done by a RDMS. I don't think that an RDMS is necessarily An alternative is to build a Mesquite plugin that has a very rich interface, with all the data integrity checks, and with easy copy/paste spreadsheets for metadata, or metadata marked up directly on tree nodes and edges, etc, etc, and then have this push rich NeXML on to a NoSQL document storage system. Certain sexy queries (phylogeographic queries, BLAST searching, topology searching) might be sacrificed. And we'd be dealing with Mesquite -- which has its own limitations, idiosyncrasies, and code-bloat, etc. On Mon, Jun 6, 2011 at 3:04 PM, Karen Cranston <kar...@ne...> wrote: > There are several pitches now in the Google doc, with a fair bit of > overlap between them. I am willing to consolidate into a single page > and send to NSF (Reed?) and see what he has to say about the various > components. It seems like these components are: > 1. some level of re-engineering of TreeBASE > 2. further development of MIAPA, with annotation tools and TreeBASE integration > 3. use of ToLWeb as a crowd sourcing and data synthesis platform > 4. NeXML refinement and development > > I don't think this one-pager needs to capture all of the ideas and > details we currently have, but instead give a general sense of what we > are proposing and if all / some of these ideas is potentially > fundable. > > Everyone in agreement? I will post the single page in the doc later today. > > Karen > > On Fri, Jun 3, 2011 at 3:38 PM, Arlin Stoltzfus <ar...@um...> wrote: >> Today is the deadline for our 1-page synopsis to pitch to an NSF program >> officer (before going further). Currently we seem to have 3 pitches. It >> is time now for some energetic person to consolidate this, so that we can >> move ahead. >> >> Arlin >> >> On May 31, 2011, at 12:19 PM, Karen Cranston wrote: >> >>> Tomorrow morning (Wed, June 1) looks to be good for everyone, and >>> sooner seems better than later. I propose we talk at 9:00 am EST. I >>> will send connection information later today. >>> >>> Cheers, >>> Karen >>> >>> On Thu, May 26, 2011 at 3:00 PM, Karen Cranston >>> <kar...@ne...> wrote: >>>> >>>> There has been some interest among various groups in an ABI proposal >>>> for development of phyloinformatics resources. This email is an >>>> attempt to connect those threads and move the process forward. The >>>> conversations that have been happening up to this point are: >>>> >>>> 1. The Phyloinformatics Research Foundation (phylofoundation.org, >>>> stewards of TreeBASE and ToLWeb) started a Google doc aimed at >>>> TreeBASE >>>> 2. MIAPA developers started a wiki page >>>> (https://www.nescent.org/sites/evoio/NSF_ABI_2011), recognizing the >>>> need for coordination with TreeBASE and other resources >>>> 3. NESCent (Todd, Hilmar and myself), as the current TreeBASE host and >>>> as a third party interested in coordinated development across >>>> resources started a third document (now added to the already mentioned >>>> Google doc) >>>> >>>> If you are interested in this discussion and do not already have >>>> access to the Google doc entitled TreeBASE_ABI.doc, let me know and I >>>> can grant you access. Hilmar and I made some substantial edits earlier >>>> this morning. I point you specifically to the section at the end >>>> entitled "An attempt to re-think all of this". Briefly, we wanted to >>>> encourage some radical thinking and explore the idea of developing a >>>> PhyloCommons that incorporates both TreeBASE and ToLWeb into the >>>> proposal (as the data repository and the data sharing / dissemination >>>> / synthesis platform, respectively). >>>> >>>> The ABI deadline is July 7, so we have a short period of time to pull >>>> this together. Here is a link to a Doodle poll for an initial >>>> teleconference. >>>> >>>> http://doodle.com/zf2tz7sftyk3naxy >>>> >>>> During this meeting, we hope to come to agreement on the broad >>>> direction of the grant, identify possible leaders of the various >>>> components and create a plan for getting this pulled together in time >>>> for the deadline. Please feel free to continue the conversation on the >>>> Google doc between now and the teleconference. If there are others who >>>> you think should be invited, feel free to do so. Not everyone who >>>> participates in this first phase will end up being named on the grant, >>>> but these resources require input from a much larger group. >>>> >>>> Cheers, >>>> Karen >>>> >>>> >>>> -- >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>> Karen Cranston >>>> Training Coordinator and Informatics Project Manager >>>> nescent.org >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>> >>> >>> >>> >>> -- >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Karen Cranston >>> Training Coordinator and Informatics Project Manager >>> nescent.org >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "MIAPA" group. >>> For more options, visit this group at >>> http://groups.google.com/group/miapa-discuss?hl=en >> >> ------- >> Arlin Stoltzfus (ar...@um...) >> Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST >> IBBR, 9600 Gudelsky Drive, Rockville, MD >> tel: 240 314 6208; web: www.molevol.org >> >> -- >> You received this message because you are subscribed to the Google >> Groups "MIAPA" group. >> For more options, visit this group at >> http://groups.google.com/group/miapa-discuss?hl=en >> > > > > -- > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Karen Cranston > Training Coordinator and Informatics Project Manager > nescent.org > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > -- > You received this message because you are subscribed to the Google > Groups "MIAPA" group. > For more options, visit this group at > http://groups.google.com/group/miapa-discuss?hl=en > -- Dr. Rutger A. Vos School of Biological Sciences Philip Lyle Building, Level 4 University of Reading Reading, RG6 6BX, United Kingdom Tel: +44 (0) 118 378 7535 http://rutgervos.blogspot.com |