Re: [Treebase-devel] charsets?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Okay. I'll reply to the MIAPA discussion group, seeing as they are keen on formulating minimum requirements. 

bp

On May 21, 2011, at 6:09 PM, Richard Ree wrote:

> Hi Bill,
> 
> Thanks for the info.  Because the partition information is critical for data re-use, I hope this may eventually be required for DNA alignments.   The gene names and genbank numbers are less important, since they are more easily inferred/discovered.  But, even a non-standardized gene name provided by the author is way more informative than none at all.  And, I think it would be difficult to try and deduce gene partitions from BLAST.  More difficult than burdening submitters a bit more, anyway.
> 
> -Rick
> 
> On Sat, May 21, 2011 at 5:01 PM, William Piel <wil...@ya...> wrote:
> Hi Rick,
> 
> On May 21, 2011, at 10:53 AM, Richard Ree wrote:
> 
>> I'm trying to decompose sequence alignments from TB into component
>> gene regions.  Does TB have a policy regarding uploaded matrices
>> providing this information?  E.g. S11152 does not - 8 cpDNA regions
>> are concatenated without any charset or partition info.
> 
> To avoid over-burdening the submitter, we don't require this. Also, I don't know that gene names in charset free text are standardized enough to make this kind of metadata as valuable as it should be. Ideally everyone would supply Genbank accession numbers, and then extract gene names from there. Currently these are downloadable only as tab-separated text (e.g. use this link to download the metadata for this dataset) -- I've asked the Mesquite people how this could be expressed in Mesquite-readable NEXUS, but so-far no answer from them. Current implementations of annotations for the NOTES block only allow for taxon-linked metadata:
> 
> SUTM  T = 4 N = genBankNumber S = AF284000;
> 
> and single-character-linked metadata, e.g.:
> 
> AN T = 4 C = 1  AU = TreeBASE TF = ( CM AF284000 ) TF = ( R genBankNumber );
> 
> ... but not for ranges or spans of characters, which is the logic used in TreeBASE.
> 
>> For those that do, it seems only visible in the nexus format, not
>> nexml.  Can we get charset tags in nexml?
> 
> Yes, that should be easily doable. Maybe Laurel Yohe (a new Google Summer of Code author) can implement this as an exercise to get to know TreeBASE's code.  We need a NeXML solution both for expressing CHARSET free text and for expressing the row-segment metadata like the Genbank accession number. 
> 
>> Also, for a given matrix, is the order of characters the same, if I
>> download it in nexus and nexml?  If so, I suppose it would be easy to
>> just pull out the charset info from the nexus file.
> 
> 
> Yes, if I understand what you're saying. Of course, the order of the characters are not shuffled (!!) but I don't know why you might think that could happen. 
> 
> At any rate, I'd like to come up with a script, involving BLAST, that is smart enough to take all matrix rows and figure out (1) where gene-sequences begin and end, and (2) auto-generate Genbank accession numbers for row-segment metadata. It will be challenging to avoid introducing false-positives, and to get BLAST to match different sequences to different sections of the matrix-row. 
> 
> bp
> 
> 
>