out of memory problem

  • Tassadar

    Tassadar - 2010-10-21

    Hi all,
    I use the script extractWikipedia.pl to excuse the enwiki-20100130-pages-articles.xml.bz2 version of wikipedia data
    I use perl 5.10, 64-bit linux, 8G RAM, but it seems I still have memory problems
    While extracting core summaries from dump file, the script took up all the memory(about 7.4G), and it seemed to stall at 34.1%
    I changed the passes to 8 but it doesn't help
    Any ideas?

  • David Milne

    David Milne - 2010-10-21


    Sorry, a lot of people are having this problem. It is likely caused by some regular expression that happily worked over Gigabytes of text previously, but has started failing on the later Wikipedia dumps. There is a thread about a new version of the toolkit that does the extraction in java instead of Perl (so it avoids this bug) here.

    - Dave

  • Tassadar

    Tassadar - 2010-10-21

    Thank you for your reply, I'll try it
    do I have to install svn to download the new version of the toolkit?


Log in to post a comment.