Re: [Treebase-devel] harvesting

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Jan 24, 2012, at 3:12 PM, Hilmar Lapp wrote:

> and we know already that the queries for some studies will time out if you use the REST API.

That certainly was true at one time, but we have since made fixes that should have solved those problems. 

Rod Page's attempt to suck down all of TreeBASE did encounter studies that were timing out -- and he sent me a list of them. But later, when I tried to fetch them, they downloaded fine. So I think the problem was one of hitting the application in rapid fire, with an overall performance slowdown resulting from the cumulative effects of this rapid fire, and as a result certain studies were timing out on him.  Hence my suggestion that Rutger purposely throttle his scripts. 

Both TreeBASE's tallest dataset (~3,000 taxa) and it's widest dataset (~110,000 characters), download just fine:

tallest: 

http://purl.org/phylo/treebase/phylows/study/TB2:S11686?format=nexus

widest: 

http://purl.org/phylo/treebase/phylows/study/TB2:S12064?format=nexus

And this works to get a list of all URIs.

So unless there are specific cases of corrupt data (which there probably are), or the cumulative effects of excessive web service load causes subsequent time-outs, I don't anticipate any fundamental problems. (And if the former, we'd like to hear about which ones are corrupt).  So I think this is worth the experiment, on the understanding that Rutger might need to halt what he's doing should we discover that he has a crippling effect on the service. 

bp