Re: [Treebase-devel] harvesting

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I do agree that having downloadable dumps of the TreeBASE content in different formats would be a good idea - in fact it was one of the deliverables of the just declined ABI grant. So of you want put this in place now without support that's cool of course. The problem is though that contrary to the plans in the grant you wouldn't be doing this here based on a NoSQL document store and SOLR index, but from the relational database, and we know already that the queries for some studies will time out of you use the REST API.

So I think the best way to accomplish this would be to dump the PostgreSQL database and reload it on a different server, where you can then generate the NEXUS and NeXML dumps. 

-hilmar

Sent with a tap.

On Jan 24, 2012, at 11:07 AM, William Piel <wil...@ya...> wrote:

> 
> On Jan 24, 2012, at 7:53 AM, Rutger Vos wrote:
> 
>> Hi all,
>> 
>> I've had a request from one of Enrico Pontelli's students for a complete dump in NeXML of TreeBASE. I would like to have one as well for my own purposes. Because we now have caching this may not be as big a problem as previously, though most studies will not yet ever have been serialized to NeXML since the start of caching so we still need to be careful. On the plus side: once we've done this we will have all of them in cache so all subsequent requests should be more snappy. Can we come up with a reasonable waiting time between requests so we don't kill the server? Is there a quiet time during which this can best be done? Do tb-stage or tb-dev also have caches?
>> 
>> Rutger
> 
> I think this is a good idea, given that it will build up a war-chest of cached data. (In fact, maybe we should first extend the expire date on the cache so that this lasts longer?) Perhaps it will also catch datasets that are problematic. 
> 
> Google Analytics shows that activity is lowest on the weekend -- no surprise there. But maybe it would be better to do it during the week so that it's easy to intervene if the application gets locked up. Also, it might make sense to throttle the download process intentionally (e.g. interspersing requests with the "sleep" function in perl, for example) so that the application has ample time for garbage collection, etc, and so not to impact the system too much. Finally, even if you're not capturing NEXUS, maybe it would help to also download NEXUS as well, as the NEXUS cache is also valuable to build up. 
> 
> bp
> 
> 
> 
> 
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Treebase-devel mailing list
> Tre...@li...
> https://lists.sourceforge.net/lists/listinfo/treebase-devel