Menu

Getting started

Tomaz Solc

Installation

  • Download, unpack Wikiprep distribution

    $ git clone http://code.zemanta.com/tsolc/git/wikiprep
    $ cd wikiprep
    
  • See included README file for a list of required Perl packages. Install those packages either through your distribution's package manager or by downloading directly from CPAN. Also have a look at hardware requirements.

    $ less README
    
  • Compile the splitwiki utility

    $ cd tools/splitwiki
    $ make
    $ cd ../..
    
  • To make sure everything is working correctly on your machine, run unit tests by running make from the top directory.

    $ make
    

Processing a Wikipedia dump

  • Download the latest English Wikipedia pages-articles.xml dump from Wikimedia

    $ wget http://download.wikimedia.org/enwiki/20090920/enwiki-20090920-pages-articles.xml.bz2
    
  • Split the dump for parallel processing. For best results, set the first parameter of splitwiki to be equal to the number of physical processors on your computer.

    $ mkdir work
    $ bzcat enwiki-20090920-pages-articles.xml.bz2 | tools/splitwiki/splitwiki 4 work/enwiki-20090920-pages-articles.xml
    
  • Run Wikiprep and hope for the best. Be prepared to wait for a day or two to finish, depending on your machine.

    $ perl wikiprep.pl -parallel -format composite -compress -f work/enwiki-20090920-pages-articles.xml.0000.gz
    
  • Wikiprep will make two passes through the dump - named prescan and transform. After prescan is finished, you will see the number of articles, templates and redirects that were recognized in the dump and a more detailed progress counter will start ticking.

  • After Wikiprep has finished you should have a number of files in the work/ directory. These files contain the results of the dump. Refer to [File formats] page for details on what the contain.


Related

Wiki: File formats
Wiki: Main Page