[Treebase-devel] cool GSOC projects with Mesquite?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi.  Not sure if this is the right Mesquite list, because I'm trying  
to reach developers, rather than users, with some ideas for summer  
projects.  Some interest has been expressed in 2 Mesquite-related GSOC  
(Google summer-of-code) proposals, for inclusion in the NESCent- 
organized "phylosoc" package of proposals.  Google provides summer  
support for work on open-source projects.

What these potential projects would need from you folks is a a)  
programmer committed to serving as a mentor, and b) a compelling write- 
up.

If you might be interested, read on...

One proposal is for a graphical UI to design workflow descriptions

  http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2011#Graphical_UI_for_designing_phylo_workflow_descriptions

Annotating workflows is a stumbling block in creating re-useable  
phylogenetic records.  The idea here is that users would use drag-an- 
drop tools to compose a phylogenetics workflow (with pipes or flow- 
chart icons or whatever), and this would be converted into a set of  
annotations useful for an archival record.  An example of this sort of  
thing, executed with Google Web Toolkit, would be http://exon.niaid.nih.gov/mobyleWorkflow 
.   Ultimately (when hooked up to the right back-end) this could be  
used to create executable workflow descriptions.  Some folks have  
suggested that Mesquite would provide an ideal framework for  
developing this.  Vivek Gopalan (who developed the example above) is  
one mentor, but we would need an experienced Mesquite programmer to  
join with us.

The second idea, which is not on the proposals page yet, is an  
intelligent submission tool for TreeBASE.  Mesquite is ideal for this  
because TreeBASE already recommends that users format their NEXUS  
files using Mesquite, for compatibility.

Ideally this tool would solve 2 main problems, using some intelligence  
to aid the user.  Bill Piel reports that a major stumbling block in  
submission is that users start the process with separate 1) tree and  
2) alignment (or other char matrix) with non-matching OTU names.  This  
corresponds to my own experience trying to re-use other people's  
data.  Finding an optimal name-match (the submission tool could  
propose a match to the user for manual verification) turns out to be a  
simple and well-studied problem in CS called "the marriage problem".    
The second major stumbling block is that, in order to annotate  
provenance, users need to match up (tediously) GenBank accession  
numbers and species identifiers.   In the case of sequence alignments,  
an intelligent tool could leverage NCBI services to guess the  
accession and species (i.e., BLAST it).  Given accessions, an  
intelligent tool could supply NCBI species ids (an even easier  
problem).  Initially, this tool could create a NEXUS file with a  
TreeBASE block containing the annotations (in the future, presumably  
the preferred format will be NeXML).

What makes the second proposal sexy is the use of intelligence to aid  
the user.   Again, Mesquite might provide an ideal development platform.

For either proposal, let me know ASAP if you are interested.  The GSOC  
project proposals need to be finished up this week.  Thanks for your  
time,

Arlin
-------
Arlin Stoltzfus (ar...@um...)
Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
IBBR, 9600 Gudelsky Drive, Rockville, MD
tel: 240 314 6208; web: www.molevol.org