Download, unpack Wikiprep distribution
$ git clone http://code.zemanta.com/tsolc/git/wikiprep
$ cd wikiprep
See included README file for a list of required Perl packages. Install those packages either through your distribution's package manager or by downloading directly from CPAN. Also have a look at hardware requirements.
$ less README
Compile the splitwiki utility
$ cd tools/splitwiki $ make $ cd ../..
To make sure everything is working correctly on your machine, run unit tests by running make from the top directory.
$ make
Download the latest English Wikipedia pages-articles.xml dump from Wikimedia
$ wget http://download.wikimedia.org/enwiki/20090920/enwiki-20090920-pages-articles.xml.bz2
Split the dump for parallel processing. For best results, set the first parameter of splitwiki to be equal to the number of physical processors on your computer.
$ mkdir work $ bzcat enwiki-20090920-pages-articles.xml.bz2 | tools/splitwiki/splitwiki 4 work/enwiki-20090920-pages-articles.xml
Run Wikiprep and hope for the best. Be prepared to wait for a day or two to finish, depending on your machine.
$ perl wikiprep.pl -parallel -format composite -compress -f work/enwiki-20090920-pages-articles.xml.0000.gz
Wikiprep will make two passes through the dump - named prescan and transform. After prescan is finished, you will see the number of articles, templates and redirects that were recognized in the dump and a more detailed progress counter will start ticking.
After Wikiprep has finished you should have a number of files in the work/ directory. These files contain the results of the dump. Refer to [File formats] page for details on what the contain.