I use the script extractWikipedia.pl to excuse the enwiki-20100130-pages-articles.xml.bz2 version of wikipedia data
I use perl 5.10, 64-bit linux, 8G RAM, but it seems I still have memory problems
While extracting core summaries from dump file, the script took up all the memory(about 7.4G), and it seemed to stall at 34.1%
I changed the passes to 8 but it doesn't help
Sorry, a lot of people are having this problem. It is likely caused by some regular expression that happily worked over Gigabytes of text previously, but has started failing on the later Wikipedia dumps. There is a thread about a new version of the toolkit that does the extraction in java instead of Perl (so it avoids this bug) here.
Thank you for your reply, I'll try it
do I have to install svn to download the new version of the toolkit?
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.