Re: [Treebase-devel] ABI proposal for phyloinformatics

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Jun 6, 2011, at 10:24 AM, Rutger Vos wrote:

> (3) We must always be well-grounded in the ways in which biologists
> actually work, not just how we would like them to work -- the software
> they use, the work flows that they use, etc. We know that in their
> analysis phase, they use codes and abbreviations for their taxon
> labels.
. . .
> (4) The MIIDI minimum metadata editor
> (http://www.miidi.org:8080/orbeon/miidi-review/report?id=14) is
> totally cool  . . . The problem is
> there is no way in hell that biologists will invest the time in this:
> can you imagine taking a 1,000-taxon tree, and for each 1,000 OTUs you
> have to click a set of nested boxes to enter the Genbank taxID number,

I agree with the thinking here-- IMHO our proposal will fare better if  
we focus on solving user problems (in sexy ways, of course).  The main  
problem is that users need to archive (to comply with policies) but  
the crap that they are poised to archive is not re-usable.    
(TreeGrabber exists because most authors publish and archive pictures  
of trees rather than logically encoded trees).   Archiving is going to  
happen, because it's being pushed by policies, but this won't have a  
huge impact on re-use until we make it easy for users to submit re- 
useable data.

To break this down into manageable chunks, the biggest problems that I  
see are 1) most users need to translate their data into formats better  
suited to archiving; 2) the OTU names don't match within the user's  
own files; 3) the data objects referenced in the files do not have  
GUIDs or accessions that can be machine-processed; 4) the record does  
not have sufficient metadata annotations for potential re-users to  
judge accurately the prospects for re-use.

The TreeBASE submission process doesn't help with #1, although  
Mesquite actually can help users load up their data from other formats  
into NEXUS.  The TB submission process exposes problem #2 but doesn't  
help the user to fix it.  However, matching N things with N other  
things is a classic problem in comp sci called "the marriage  
problem".   There are many solutions.  We just need to implement one  
and allow the user to accept or edit the suggested matching in a nice  
graphical way.    If users have sequences, we can BLAST them and get  
both a suggested accession and a suggested species identifier.  That  
solves #3 for molecular users.
Support for #4 is already part of what the MIAPA people are proposing.

Arlin
-------
Arlin Stoltzfus (ar...@um...)
Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
IBBR, 9600 Gudelsky Drive, Rockville, MD
tel: 240 314 6208; web: www.molevol.org