From: Rutger V. <rut...@gm...> - 2014-01-08 12:29:37
|
> 1544 files (43%) had errors associated with using repeated IDs. I > think that in most of these cases the taxa with repeated IDs were just > shallow copies of the same data. > Yes. TreeBASE studies can have multiple TaxonLabelSets (~taxa blocks), that can (and do) reference the same taxon labels. I guess the IDs that are generated from the taxon label objects should therefore become namespaced to the containing taxon label set. This will get ugly, though, when generating the OTU ID references in tree nodes and character rows. 2 studies failed to emit a <char> element in the matrix > >From the web interface it also appears that there are NO characters in these matrices. Not sure what's up with that. > 6 studies used lower case in a DNA or AA matrix. > This ought to be trivially fixable. > > There were a fair number of studies (372) that look like rows from the > NEXUS file were simply exported as the text of a NeXML seq element. > This is not allowed if there is special NEXUS-specific syntax in the > matrix. Also augmenting the DNA or AA type with additional character > state symbols is not allowed. > Yes, this is a huge bug bear. TreeBASE character sequences are de-normalized (i.e. just opaque strings), because proper normalization of ambiguous character states is nowhere near scalable. It is conceivable that the nexml generator classes could parse those strings and replace the ambiguity tokens (e.g. () {}) with IUPAC symbols but I'm guessing this will only be non-painful, performance-wise, if the generated nexml is subsequently cached by the web application. > The remaining 1 % of files look like more serious errors in TreeBase > code (empty studies or stack traces) or corrupt data in the db. > There are some studies where the web application times out (too much data), and then there are some where the data appears to be corrupted somehow. Unfortunately. |