Re: [Treebase-devel] Uploading interpretations of published data sets

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On May 14, 2012, at 11:18 AM, Michael Jones wrote:

> (Reposted from treebase-Bugs)
> I'm interested in uploading data from the repository available
> http://www.graemetlloyd.com/matr.html. According to the sites curator,
> Dr. Graeme Lloyd, however: "1. The vast majority of the datasets are
> based on my interpretation of published datasets (usually matrices
> printed as tables in PDFs), and not the original files used by the
> author. (In theory these should be the same, but in practice...) 2.
> Similarly, all trees are from my own (parsimony) searches in TNT and
> don't necessarily reflect the settings used by the authors. 3. This
> list is a dynamic one and I am adding to it all the time." Since these
> files are interpretations of published data, I was wondering wether or
> not to upload them onto TreeBASE.

Hi Michael,

- For me, the biggest disappointment of the Graeme Lloyd data is the lack of character labels and state labels, which would be pretty critical for data reuse. But characters without character labels is still better than none. In the past, we have captured character label data using OCR from the hard-copy or copy/paste from the PDF. But turning free text into NEXUS can be tricky and requires some skill with TextWrangler and regular expressions. If you'd like to take a jab at acquiring these data, I'd be happy to Skype with you about how to do this with TextWrangler, MacClade, Mesquite, etc. 

- While mistakes may be present in Graeme Lloyd's data, I wouldn't worry about that too much it unless you find that the rate of it is quite high. 

- When I look at the page, I see that Graeme provides both MPTs and a consensus tree (in Phylip/Newick format). If possible, the tree submitted to TreeBASE should look like the one published in the original paper, so I would suggest the following: (1) Open the data matrix in Mesquite, (2) Do "include File" to import the consensus tree (selecting "philip" format type), (3) look at the original PDF and use the tools in Mesquite (move branch, collapse clade, swap branch, etc) to make any minor changes needed for the consensus tree to match the published tree, (4) do "Store Tree" from the Tree menu in Mesquite to commit these changes (this is a very important step), (5) use the List Tree tab to change the name of the tree to match its name in the publication (e.g. "Fig. 3" etc). (6) Save the nexus. (7) If there is no opportunity to match the consensus tree with what was published, I think it is okay to upload this anyway, but name the analysis and tree something like "Unpublished result; TreeBASE Fecit" or "Unpublished result; Graeme Lloyd Fecit". 

- Most important, it would be good form to contact Graeme Lloyd to praise him for his valuable efforts in assembling dino phylogenetic data, and then explain your plans to "mirror" these data in TreeBASE, again emphasizing the great value they will add to TreeBASE (on account of Dinos being very rare in TreeBASE). TreeBASE submissions have an "URL" field, and you can offer to put his web site in there as a way to credit the source of the data.

regards,

Bill