From: Nilesh C. <ni...@ni...> - 2014-07-29 21:55:29
|
Dear all, We are happy to announce an early beta version of Distributed DBpedia Extraction with Hadoop / Spark. Things are still rough but we want beta testers to report their experience - and extraction time of course. :) https://github.com/dbpedia/distributed-extraction-framework Read ahead if you are interested ================================== Right now we only support extraction, which means that you need to download the dumps with the existing method (distributed downloading is our next step) Setting up the framework and performing a distributed extraction is fairly easy; we have outlined all the details in the README and added a script for firing up a Spark+HDFS cluster quickly on Google Compute Engine. For a single language, the whole extraction job (including redirects) is executed in parallel. If you add multiple languages, all jobs are submitted to Spark, and based upon Spark’s configured scheduling mode, they’ll be scheduled over the cluster in parallel either in a FIFO (default) or FAIR manner. We did some tests on a small 3-node cluster: 1 master (2 core 7.5G RAM - GCE n1-standard-2), 2 slaves (4 core 15G RAM each - GCE n1-standard-4) with 4 workers on each slave. Using the English Wikipedia, the distributed framework took a total of 3hrs. 21 min. to finish extraction (including the pre-extraction redirects computation). We’ll add more tests and benchmarks to the GitHub wiki pages very soon. Any feedback is more than welcome. We keep track of our future tasks and bugs @GitHub https://github.com/dbpedia/distributed-extraction-framework/issues Cheers, Nilesh, Sang & Dimitris Acknowledgements: This project is sponsored by the Google Summer of Code project. https://www.google-melange.com/gsoc/project/details/google/gsoc2014/nileshc/5841554954518528 You can also email me at co...@ni... or visit my website <http://nileshc.com/> |