From: Mark H. <mth...@gm...> - 2014-01-04 18:45:21
|
Hi, I'm crossposting this to NeXML and TreeBASE lists because I think that TreeBASE could be the largest supplier of NeXML files. In the process of working on a roundrip test for a NeXML->JSON tool for the open tree project, I've been doing some more characterizations of the nexml-validity of the files in Rutger's supertreebase repo ( https://github.com/rvosa/supertreebase ). My understanding is that these were all produced via export from TreeBase. In a couple of cases I re-downloaded the files from the TreeBase API to verify that they were the same as the versions that Rutger committed, but I mainly used his cached versions in the git repo to avoid hammering the treebase servers. The validation was done against a version of the nexml schema which would be generated if someone with nexml-repo admin privileges were to accept my recent pull request (https://github.com/nexml/nexml/pull/8 ). The tool used for validation was the xml-validator-1.0 tool (available from http://code.google.com/p/xml-validator/). Results: The bad news that only 1652 files (about 46%) validate successfully. The good news is that many of the errors appear to be easy to fix: they seem to just be related to how the NexML is composed rather than any deeper problem in the data or TreeBase. In many cases, I think that a lax NeXML parser will accept these errors and interpret the data correctly. For example: 1544 files (43%) had errors associated with using repeated IDs. I think that in most of these cases the taxa with repeated IDs were just shallow copies of the same data. 2 studies failed to emit a <char> element in the matrix 6 studies used lower case in a DNA or AA matrix. There were a fair number of studies (372) that look like rows from the NEXUS file were simply exported as the text of a NeXML seq element. This is not allowed if there is special NEXUS-specific syntax in the matrix. Also augmenting the DNA or AA type with additional character state symbols is not allowed. The remaining 1 % of files look like more serious errors in TreeBase code (empty studies or stack traces) or corrupt data in the db. Some more details can be found at: http://phylo.bio.ku.edu/public/supertreebase.txt I submitted treebase bug reports. all the best, Mark -- Mark Holder mth...@gm... mth...@ku... http://phylo.bio.ku.edu/mark-holder ============================================== Department of Ecology and Evolutionary Biology University of Kansas 6031 Haworth Hall 1200 Sunnyside Avenue Lawrence, Kansas 66045 lab phone: 785.864.5789 fax (shared): 785.864.5860 ============================================== |