Re: [Treebase-devel] charsets?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Bill,

Thanks for the info.  Because the partition information is critical for data
re-use, I hope this may eventually be required for DNA alignments.   The
gene names and genbank numbers are less important, since they are more
easily inferred/discovered.  But, even a non-standardized gene name provided
by the author is way more informative than none at all.  And, I think it
would be difficult to try and deduce gene partitions from BLAST.  More
difficult than burdening submitters a bit more, anyway.

-Rick

On Sat, May 21, 2011 at 5:01 PM, William Piel <wil...@ya...> wrote:

> Hi Rick,
>
> On May 21, 2011, at 10:53 AM, Richard Ree wrote:
>
> I'm trying to decompose sequence alignments from TB into component
> gene regions.  Does TB have a policy regarding uploaded matrices
> providing this information?  E.g. S11152 does not - 8 cpDNA regions
> are concatenated without any charset or partition info.
>
>
> To avoid over-burdening the submitter, we don't require this. Also, I don't
> know that gene names in charset free text are standardized enough to make
> this kind of metadata as valuable as it should be. Ideally everyone would
> supply Genbank accession numbers, and then extract gene names from there.
> Currently these are downloadable only as tab-separated text (e.g. use this<http://treebase.org/treebase-web/search/study/rowSegmentsTSV.html?matrixid=8983> link
> to download the metadata for this<http://purl.org/phylo/treebase/phylows/study/TB2:S11551?format=nexus> dataset)
> -- I've asked the Mesquite people how this could be expressed in
> Mesquite-readable NEXUS, but so-far no answer from them. Current
> implementations of annotations for the NOTES block only allow for
> taxon-linked metadata:
>
> SUTM  T = 4 N = genBankNumber S = AF284000;
>
> and single-character-linked metadata, e.g.:
>
> AN T = 4 C = 1  AU = TreeBASE TF = ( CM AF284000 ) TF = (
> R genBankNumber );
>
> ... but not for ranges or spans of characters, which is the logic used in
> TreeBASE.
>
> For those that do, it seems only visible in the nexus format, not
> nexml.  Can we get charset tags in nexml?
>
>
> Yes, that should be easily doable. Maybe Laurel Yohe (a new Google Summer
> of Code<http://informatics.nescent.org/wiki/PhyloSoC:_Automated_submission_of_rich_data_to_TreeBASE> author)
> can implement this as an exercise to get to know TreeBASE's code.  We need a
> NeXML solution both for expressing CHARSET free text and for expressing the
> row-segment metadata like the Genbank accession number.
>
> Also, for a given matrix, is the order of characters the same, if I
> download it in nexus and nexml?  If so, I suppose it would be easy to
> just pull out the charset info from the nexus file.
>
>
> Yes, if I understand what you're saying. Of course, the order of the
> characters are not shuffled (!!) but I don't know why you might think that
> could happen.
>
> At any rate, I'd like to come up with a script, involving BLAST, that is
> smart enough to take all matrix rows and figure out (1) where gene-sequences
> begin and end, and (2) auto-generate Genbank accession numbers for
> row-segment metadata. It will be challenging to avoid introducing
> false-positives, and to get BLAST to match different sequences to different
> sections of the matrix-row.
>
> bp
>
>
>
>
>
> ------------------------------------------------------------------------------
> What Every C/C++ and Fortran developer Should Know!
> Read this article and learn how Intel has extended the reach of its
> next-generation tools to help Windows* and Linux* C/C++ and Fortran
> developers boost performance applications - including clusters.
> http://p.sf.net/sfu/intel-dev2devmay
> _______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>
>