problem with new dump file

  • Anonymous - 2010-09-01

    I downloaded the newest version of dump file: enwiki-20100817-pages-articles.xml.bz2.
    I installed the Parse::MediaWikiDump 1.0.4 perl module (on linux).
    I modified the code code: replaced every '$pages->page' to '$pages->next' .
    But I've got an error: "Could not locate root category. Please configure the script properly. …" (line: 913 in the code)
    I tried to debugg and in the page.csv there is no record with type of 2. In the oldest page.csv file I found e.g. 690070,"Futurama",2
    So the problem is that the extractWikipediaData doesn't find any page with category type (number 2), only 1, 3 or 4.

    Any ideas?

  • Viktor Seifert

    Viktor Seifert - 2010-09-01

    Seems to that the wikipedia root category has changed.
    Try setting the variable '$root_category' in the to 'Contents'.

    Greetings, Viktor

  • Anonymous - 2010-09-07

    Viktor thanks for your reply.
    I think I know what is the problem. I have compared the new page.csv file with a one of the oldest page.csv file.
    And I saw in the new page.csv file the title of categories start with "Category:" and the type is 1 not 2.
    e.g. "Contents"
    in the old page.csv file:
    in the new page.csv file:

    When the is running it is not find any category, so the categorylink.csv file will be empty.
    How can I modify the What can be the solution?


  • Viktor Seifert

    Viktor Seifert - 2010-09-07

    Just open the in a text editor(like notepad or scite).
    Then search for $root_category. It's a variable that contains the name of the root category.
    Change the line from

    my $root_category = "Fundamental" ; # for enwiki


    my $root_category = "Contents" ; # for enwiki

    Cheers, Viktor

  • Anonymous - 2010-09-17

    I have tried what you advised (change the root_category) but as I had written before my problem is that the generated page.csv file is wrong.
    Let's see an example term: Futurama
    In an oldest generated page.csv file you can see this:
    But in the newest page.csv file I found this:

    So I think the can not recognize the categories with "Category:" prefix.
    Any idea?


  • Roderick Sprattling

    eReSZ, I find in the subroutine readPageSummaryFromCsv(), which contains the lines:

    if ($page_type == 2) {
        $pages_ns14{$page_title} = $page_id ;
    } else {
        $pages_ns0{$page_title} = $page_id ;

    So as a first hack you might try testing for $page_type == 1 instead to align root page_type and root page_title. There may be other places in the code where type constants and/or title need updating.

  • Anonymous - 2010-10-19

    hi roderick, thanks for your reply.
    finally I found the problem.
    I changed the namspace's regexp for this:

    while (defined (my $line = <DUMP>)) {
            $line =~ s/\s//g ;  #clean whitespace
            if ($line =~ m/<\/namespaces>/i) {
                last ;
            if ($line =~ m/<namespacekey=\"([-]*\d+)\"case=\"first-letter\">(.*)<\/namespace>/i){
                $namespaces{lc($2)} = $1 ;
                print "namespace: ".$2." key:".$1."\n";
            if ($line =~ m/<namespacekey=\"(\d+)\"case=\"first-letter\"\/>/i) {
                $namespaces{""} = $1 ;
                print "empty namespace key:".$1."\n";

    And Viktor was right, the root_category changed to "Contents"

  • Utkarsh Dubey

    Utkarsh Dubey - 2011-05-04

    Hi All,

    I downloaded the latest enwikisource-20110430-pages-articles.xml.bz2.
    It works fine till it reaches the state "summarizing generality"
    I followed all the steps mentioned above still I am getting the error message saying
    "Could not locate root category. Please configure the script properly. …"
    I have no clue what to do…
    Any help will be deeply appreciated.


  • brkgrnr

    brkgrnr - 2011-09-05

    Hi all,
    I worder ask a question about root_category,
    Where do i learn this variable to work  extractWikipediaData script ?

    Ps: i am working on tr wikipedia

    Best regarts,


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks