From: Richard R. <rr...@fi...> - 2011-05-21 22:09:17
|
Hi Bill, Thanks for the info. Because the partition information is critical for data re-use, I hope this may eventually be required for DNA alignments. The gene names and genbank numbers are less important, since they are more easily inferred/discovered. But, even a non-standardized gene name provided by the author is way more informative than none at all. And, I think it would be difficult to try and deduce gene partitions from BLAST. More difficult than burdening submitters a bit more, anyway. -Rick On Sat, May 21, 2011 at 5:01 PM, William Piel <wil...@ya...> wrote: > Hi Rick, > > On May 21, 2011, at 10:53 AM, Richard Ree wrote: > > I'm trying to decompose sequence alignments from TB into component > gene regions. Does TB have a policy regarding uploaded matrices > providing this information? E.g. S11152 does not - 8 cpDNA regions > are concatenated without any charset or partition info. > > > To avoid over-burdening the submitter, we don't require this. Also, I don't > know that gene names in charset free text are standardized enough to make > this kind of metadata as valuable as it should be. Ideally everyone would > supply Genbank accession numbers, and then extract gene names from there. > Currently these are downloadable only as tab-separated text (e.g. use this<http://treebase.org/treebase-web/search/study/rowSegmentsTSV.html?matrixid=8983> link > to download the metadata for this<http://purl.org/phylo/treebase/phylows/study/TB2:S11551?format=nexus> dataset) > -- I've asked the Mesquite people how this could be expressed in > Mesquite-readable NEXUS, but so-far no answer from them. Current > implementations of annotations for the NOTES block only allow for > taxon-linked metadata: > > SUTM T = 4 N = genBankNumber S = AF284000; > > and single-character-linked metadata, e.g.: > > AN T = 4 C = 1 AU = TreeBASE TF = ( CM AF284000 ) TF = ( > R genBankNumber ); > > ... but not for ranges or spans of characters, which is the logic used in > TreeBASE. > > For those that do, it seems only visible in the nexus format, not > nexml. Can we get charset tags in nexml? > > > Yes, that should be easily doable. Maybe Laurel Yohe (a new Google Summer > of Code<http://informatics.nescent.org/wiki/PhyloSoC:_Automated_submission_of_rich_data_to_TreeBASE> author) > can implement this as an exercise to get to know TreeBASE's code. We need a > NeXML solution both for expressing CHARSET free text and for expressing the > row-segment metadata like the Genbank accession number. > > Also, for a given matrix, is the order of characters the same, if I > download it in nexus and nexml? If so, I suppose it would be easy to > just pull out the charset info from the nexus file. > > > Yes, if I understand what you're saying. Of course, the order of the > characters are not shuffled (!!) but I don't know why you might think that > could happen. > > At any rate, I'd like to come up with a script, involving BLAST, that is > smart enough to take all matrix rows and figure out (1) where gene-sequences > begin and end, and (2) auto-generate Genbank accession numbers for > row-segment metadata. It will be challenging to avoid introducing > false-positives, and to get BLAST to match different sequences to different > sections of the matrix-row. > > bp > > > > > > ------------------------------------------------------------------------------ > What Every C/C++ and Fortran developer Should Know! > Read this article and learn how Intel has extended the reach of its > next-generation tools to help Windows* and Linux* C/C++ and Fortran > developers boost performance applications - including clusters. > http://p.sf.net/sfu/intel-dev2devmay > _______________________________________________ > Treebase-devel mailing list > Tre...@li... > https://lists.sourceforge.net/lists/listinfo/treebase-devel > > |