Menu

File formats

Tomaz Solc

Introduction

This section contains some background knowledge that is required for understanding file format descriptions.

Page identifiers

Each page on Wikipedia (articles, templates, redirects, categories, etc. are all pages in this regard) has two identifiers that can be used to uniquely identify it: page title and page ID.

Page title is a Unicode string. There exists a relatively complicated surjective function between a page title and a page (i.e. multiple page titles reference the same page). For example "Arnold Schwarzeneger", "arnold Scharzeneger", "Arnold_Schwarzeneger", "Arnold__Schwarzeneger" can all be used to refer to the same page.

To remove this complication, Wikiprep uses normalized page titles. A page has one and only one normalized page title (for all the gory details about title normalization, see Wikiprep::Namespace::normalizeNamespaceTitle() function). In following text, where ever a "page title" is mentioned, a normalized page title is meant.

Page ID is simply an integer, unique to each page. This is the preferred way of identifying pages in Wikiprep as it is more efficient. Article categories, interwiki links and similar are all given in the form of page IDs.

Wikiprep deals with 3 different types of links.

The simplest is the internal or intrawiki link. This is the most common type of link created using the [[Foo]] syntax. The destination link can either exist or not (there are a lot of such dangling links on Wikipedia).

Second type is the interwiki link created using the [[File:Foo.png]] or [[WoW:Foo]] syntax. Such links point to pages in another namespace or another Wiki altogether. Images are included in articles via interwiki links to the File: namespace while the link anchor provides the image caption.

The third type is the external link. This link can point to any website on the web. It is created explicitly using the [http://example.com foo] syntax or implicitly by finding an URL-like pattern in the text.

Composite format

This is the format of the files used by Zemanta's Wikiprep when used with the -format composite option.

File name suffix Short description More
anchor_text Extracted anchor texts for internal Wikipedia links and images. Contains source and target page ID, anchor location in page text, and the actual anchor text [Anchor text file]
disambig Parsed disambiguation pages. Contains disambiguation page ID - list of disambiguated pages pairs. [Disambiguation file]
gum.xml Preprocessed content of pages in a simple XML format. For each page the following information is given: internal links and links to images, external links, categories, normalized title and preprocessed content. Content of the page is given in plain text with section headers and internal links marked up with XML [Gum file]
tmpl.xml Mapping between template titles and page IDs in an XML format. [Templates file]

Legacy format

This is the format of the files used by Chris' Wikiprep and Zemanta's Wikiprep when used with the -format legacy option.

File name suffix Short description More
anchor_text Extracted anchor texts for internal Wikipedia links and images. Contains source and target page ID, anchor location in page text, and the actual anchor text [Anchor text file]
cat_hier Wikipedia category hierarchy. Contains Parent - descendant category pairs. [Category hierarchy file]
disambig Parsed disambiguation pages. Contains disambiguation page ID - list of disambiguated pages pairs. [Disambiguation file]
external_anchor Extracted anchor texts for external links. Contains source page ID, destination URL and the actual anchor text [External links file]
hgw.xml Preprocessed content of pages in a simple XML format. For each page the following information is given: internal links and links to images, external links, categories, normalized title and preprocessed content. Content of the page is given in plain text with section headers and internal links marked up with XML [Hogwarts file]
local.xml Information about images linked from pages. Contains image ID - image file name pairs. [Local pages file]
redir.xml Information about redirects. Contains source page ID - destination page ID pairs. [Redirects file]
related_links Information about related links. [Related pages file]
stat.categories Category statistics. [Statistics files]
stat.inlinks Internal link statistics. [Statistics files]
templates Information about template transclusions. Contains source page ID and template parameters for all template transclusions and an index of recognized templates. [Templates directory]

Related

Wiki: Getting started
Wiki: Main Page