Welcome, Guest! Log In | Create Account

Main Page

From wikiprep

Jump to: navigation, search

To edit this Wiki, please create an account on SourceForge and log in with the link on the upper-right

Contents

What is Wikiprep

Wikiprep is short for Wikipedia preprocessor and information extractor.

It is a Perl script that parses MediaWiki data dumps in XML format and extracts useful information from them. It implements a subset of MediaWiki syntax (such as template inclusion with parameters, internal and external links, headings, redirects, etc.). Output is in the form of several files: some of them in simple, line-oriented format and some of them in XML. One of the files also contains processed Wikipedia pages in a simple HTML-like syntax.

The goal of Wikiprep is to convert Wikipedia data dumps into a format that can be easily processed with other tools. These tools then do not need to have the full knowledge of all quirks and odd corners of MediaWiki syntax.

Wikiprep was initially developed by Evgeniy Gabrilovich.

Available versions

There are two distinct versions available:

Chris Jordan's version

Currently (march 2009) This version is incapable of processing the latest Wikipedia dumps provided by Wikimedia)

This is the original Wikiprep code with some minor modifications.

Chris maintains the CVS repository on SourceForge.net. You can get his code by following these instructions.

Zemanta's Wikiprep

This is a version of Wikiprep that is used by Zemanta for extracting semantic information from Wikipedia. It's based on the original Wikiprep, but is heavily modified and extracts different information from dumps as Chris' version.

This version is currently maintained by Tomaž Šolc. It is kept up-to-date to support the latest Wikipedia dumps. Usually any problems are resolved within a week after a new English Wikipedia dump becomes available.

You can get this version of Wikiprep from a git repository:

http://code.zemanta.com/tsolc/git/wikiprep

The simplest way is to use a command like this:

$ git clone http://code.zemanta.com/tsolc/git/wikiprep

Refer to the README file for further instructions.

Documentation

Mailing list

We have a mailing list for announcements and general discussion. This is the best place to ask questions or send bug reports and feature requests.