Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.
Can somebody share a valid couple of
To test installation before building a new csv summary (long process).
Thanh you in advance
do you hava 20110722.xml? if you do,please share it
I have it , it will take a time to upload it .
you have 20110722.xml? can you send it to me?
Can you put it on FTP or share as big file (zipped)
Yup. I need it as well. Can you share it out?
finally I have upload it
the second file
Thank you so much for sharing!!! There is no problem using a wikipedia dump from 2008 and a CSV summary from 2011?
Sorry about this ,I upload the old file for wikipedia miner 1 , you should downlwd the same wikipedia dump as CSV summary . You can find it in the link below and it has this name enwiki-20110722-pages-articles.xml .
This is exactly the key point !!
enwiki-20110722-pages-articles.xml is not anymore available!!!
So need to rebuild Csv file or find a couple of valid dataset CSV and Dump.
Any on could provide one?
Did you get the enwiki-20110722-pages-articles.xml.bz2 file by any chance? If so, can you kindly share :)?
I also need a dump with his corresponding CSV summary.
It seems that I'm not the only one and that would be very kind to share a couple of valid data.
I finally managed to extract the CSV summaries of a recent wikipedia dump ( I don't remember the exact date of the dump…).
If someone needs it, I can upload it to a FTP server or an online service of your choice (9GB for the dump and 5.8GB (uncompressed) for the summaries).
Hi Muonique, that would be awesome if you could share the summaries! I have sent you some info via SF message and I can host publicly after receiving it.
No problem for sharing but I didn't receive your message.
Which wikipedia dump have you extracted the CSV summaries for? I don't mean the exact date but is it a 2013 dump?
Also, what hardware resources did you need for extraction? I'm trying to get an idea of how long it would take to process any of the recent Wikipedia dumps and how big a Hadoop cluster I will need for this, what memory size for each node etc.
Thanks in advance.
I extracted the latest dump available in April 2013.
It took about 2 days on a single node (8 core Xeon processor) and a few hours on 30 nodes (4 core processor). Sending the data on each node and the reduce phase were the main bottlenecks on the grid.
It would be really great if you can share the files, I am working on a local test case and have to present it to people for which am using the wiki dump and the csv summary dump however I am not able to get the xml dump for either of the csv dumps which is available here (http://heanet.dl.sourceforge.net/project/wikipedia-miner/data/)
It will be really cool that we put an effort and get it uploaded on sourceforgenet as it will help in solving the problem of a lot of people around here. Please reply soon as its quite urgent.
I really need an CSV summary for a recent dump, if you have it can you please share it with me. or the enwiki-20110722-pages-articles.xml file, I couldn;t find it any where, and I really need this...
I am trying to make the system work but the embarrasing parts are:
Any dumps as well as the corresponding summary available for sharing?