Re: [Treebase-devel] ABI proposal for phyloinformatics

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Jun 6, 2011, at 10:48 AM, Arlin Stoltzfus wrote:

> The TreeBASE submission process doesn't help with #1, although Mesquite actually can help users load up their data from other formats into NEXUS.  The TB submission process exposes problem #2 but doesn't help the user to fix it.  However, matching N things with N other things is a classic problem in comp sci called "the marriage problem".   There are many solutions.  We just need to implement one and allow the user to accept or edit the suggested matching in a nice graphical way.    If users have sequences, we can BLAST them and get both a suggested accession and a suggested species identifier.  That solves #3 for molecular users.
> Support for #4 is already part of what the MIAPA people are proposing.

Just some minor commentary:

- I've written scripts that take Genbank accessions numbers, extract metadata out of Genbank, and format it ready for ingest by TreeBASE -- but I'm surprised at the number of times that people submit alignments containing sequences that are still embargoed by Genbank. (arg...). A lot of people just pick the default one-year embargo period, not knowing how long it will take for their article to get through the publishing system. So at the time of submitting to TreeBASE, we can't take advantage of any automatic cross-walking with Genbank. 

- Unfortunately, BLAST frequently doesn't work in that it often produces false positives. At best, we should use BLAST to *assist* the submitter in preparing metadata, but human eyes have to supervise this process. This also assumes that Genbank is richly annotated, and unfortunately that's not true. For example, in a sample of 21,736 records in Genbank that are found in TreeBASE, only 373 of them were tagged with lat/long metadata. :-(

taken together, this weakens the statement that this "solves #3 for molecular users" 

bp