Gum file

Tomaz Solc

This file holds the majority of information extracted by Wikiprep from the dump. The file format is similar to that of the original page-articles XML dump.

<gum>
  <page id="39"

Each <page> tag describes one article in Wikipedia. id attribute gives the page ID.

        timestamp="2009-09-12T10:02:47Z"

Time of the last edit (makes it possible to locate this exact version of the page on the live Wikipedia).

        orglength="20740"

Original length of the article text (including the original MediaWiki syntax).

        newlength="16794"

Length of the article text after Wikiprep processing.

        stub="0"

Whether this article was recognized as a stub.

        disambig="0"

Whether this article was recognized as a disambiguation page.

        category="0"

Whether this article was recognized as a category.

        image="0">

Whether this article was recognized as an image.

  <title>Albedo</title>

Normalized page title.

  <categories>8375611 798401</categories>

Space-separated list of page IDs of Wikipedia categories this article belongs to.

  <links>26751 41644 679294 9426 51331</links>

Space-separated list of all page IDs this article contains links to.

  <related>5777979 407233 224310</related>

Space-separated list of related page IDs. Related articles are recognized with relatively complicated heuristical algorithm that finds "See also", "Further reading", etc. sections. Pages in this list usually have a stronger semantic connection to the article comparing to other links.

  <external>
    <link url="http://eetd.lbl.gov/HeatIsland/Pavements/Albedo/">Pavement Albedo</link>
  </external>

<external> tag contains a list of all external URLs that have been extracted from the article text.

Each <link> tag corresponds to one external link (either explicit through [http://...] markup, or implicit) and contains the target URL and the link anchor (if any)

  <interwiki>
    <link namespace="File" loc="0" title="Albedo-e hg.svg">Percentage of diffusely reflected sun light</link>
  </interwiki>

<interwiki> tag contains a list of all interwiki links that appear in the article. For the purpose of this list, links to images are considered interwiki.

Each <link> tag corresponds to one interwiki link. It specifies the namespace of the link, location (measured in characters starting at the beginning of the article text), title of the destination article and link anchor (or in the case of image links, image caption)

  <templates>

Templates tag contains information about all templates that were included in the page.

    <template id="693634">

Each <template> tag corresponds to one template that was visited when parsing this page. id attribute gives the page ID of the template.

      <incl>
        <param name="1">Further information: [[Cloudalbedo]]</param>
      </incl>
    </template>

Each <incl> tag corresponds to one invocation of the template. Parameters of the template invocation are given in <param> tags. Each <param> tag gives the name and value of the parameter. Note that the values of parameters are not parsed separately, so they may contain unexpanded invocations of other templates, parser function or unparsed internal links.

  </templates>
  <text>

Text tag contains the content of the article in plain text. Tables, math sections, external links, boiler-plate text and similar parts are removed.

There are several XML tags possible with in the text.

    <a id="93085">Johann Heinrich Lambert</a>

Marks the exact location of an internal link, giving destination page ID and link anchor.

    <w namespace="File" title="Albedo-e hg.svg">Percentage of diffusely reflected sun light</w>

Marks the exact location of an interwiki link, giving destination page namespace and title and link anchor.

    <h1>Terrestrial albedo</h1>

Headings are marked with several levels of <h> tags (<h1>, <h2>, ...) same as in HTML.

Note that since tables are removed, links that appear in infoboxes will not be visible in the marked-up text. Those links can only be found in the lists in the <links> and <interwiki> tags and in the [Anchor text file]

    </text>
  </page>
</gum>

Related

Wiki: Anchor text file
Wiki: Templates file