Re: [Treebase-devel] nexml and treebase status

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

>         1544 files (43%) had errors associated with using repeated IDs. I
> think that in most of these cases the taxa with repeated IDs were just
> shallow copies of the same data.
>

Yes. TreeBASE studies can have multiple TaxonLabelSets (~taxa blocks), that
can (and do) reference the same taxon labels. I guess the IDs that are
generated from the taxon label objects should therefore become namespaced
to the containing taxon label set. This will get ugly, though, when
generating the OTU ID references in tree nodes and character rows.

        2 studies failed to emit a <char> element in the matrix
>

>From the web interface it also appears that there are NO characters in
these matrices. Not sure what's up with that.

>         6 studies used lower case in a DNA or AA matrix.
>

This ought to be trivially fixable.

>
> There were a fair number of studies (372) that look like rows from the
> NEXUS file were simply exported as the text of a NeXML seq element.
> This is not allowed if there is special NEXUS-specific syntax in the
> matrix. Also augmenting the DNA or AA type with additional character
> state symbols is not allowed.
>

Yes, this is a huge bug bear. TreeBASE character sequences are
de-normalized (i.e. just opaque strings), because proper normalization of
ambiguous character states is nowhere near scalable. It is conceivable that
the nexml generator classes could parse those strings and replace the
ambiguity tokens (e.g. () {}) with IUPAC symbols but I'm guessing this will
only be non-painful, performance-wise, if the generated nexml is
subsequently cached by the web application.

> The remaining 1 % of files look like more serious errors in TreeBase
> code (empty studies or stack traces) or corrupt data in the db.
>

There are some studies where the web application times out (too much data),
and then there are some where the data appears to be corrupted somehow.
Unfortunately.