problem with new dump file

Anonymous
2010-09-01
2013-05-30

  • Anonymous
    2010-09-01

    Hi,
    I downloaded the newest version of dump file: enwiki-20100817-pages-articles.xml.bz2.
    I installed the Parse::MediaWikiDump 1.0.4 perl module (on linux).
    I modified the code extractWikipediaData.pl code: replaced every '$pages->page' to '$pages->next' .
    But I've got an error: "Could not locate root category. Please configure the script properly. …" (line: 913 in the code)
    I tried to debugg and in the page.csv there is no record with type of 2. In the oldest page.csv file I found e.g. 690070,"Futurama",2
    So the problem is that the extractWikipediaData doesn't find any page with category type (number 2), only 1, 3 or 4.

    Any ideas?
    Thanks,
    eReSZ

     
  • Viktor Seifert
    Viktor Seifert
    2010-09-01

    Seems to that the wikipedia root category has changed.
    Try setting the variable '$root_category' in the extractWikipediaData.pl to 'Contents'.

    Greetings, Viktor

     

  • Anonymous
    2010-09-07

    Viktor thanks for your reply.
    I think I know what is the problem. I have compared the new page.csv file with a one of the oldest page.csv file.
    And I saw in the new page.csv file the title of categories start with "Category:" and the type is 1 not 2.
    e.g. "Contents"
    in the old page.csv file:
    14105005,"Contents",2
    in the new page.csv file:
    14105005,"Category:Contents",1

    When the extractWikipediaData.pl is running it is not find any category, so the categorylink.csv file will be empty.
    How can I modify the extractWikipediaData.pl? What can be the solution?

    Thanks,
    eReSZ

     
  • Viktor Seifert
    Viktor Seifert
    2010-09-07

    Just open the extracWikipediaData.pl in a text editor(like notepad or scite).
    Then search for $root_category. It's a variable that contains the name of the root category.
    Change the line from

    my $root_category = "Fundamental" ; # for enwiki
    

    to

    my $root_category = "Contents" ; # for enwiki
    

    Cheers, Viktor

     

  • Anonymous
    2010-09-17

    I have tried what you advised (change the root_category) but as I had written before my problem is that the generated page.csv file is wrong.
    Let's see an example term: Futurama
    In an oldest generated page.csv file you can see this:
    690070,"Futurama",2
    But in the newest page.csv file I found this:
    690070,"Category:Futurama",1

    So I think the extracWikipediaData.pl can not recognize the categories with "Category:" prefix.
    Any idea?

    Thanks,
    eReSZ

     
  • eReSZ, I find in extractWikipediaData.pl the subroutine readPageSummaryFromCsv(), which contains the lines:

    if ($page_type == 2) {
        $pages_ns14{$page_title} = $page_id ;
    } else {
        $pages_ns0{$page_title} = $page_id ;
    }

    So as a first hack you might try testing for $page_type == 1 instead to align root page_type and root page_title. There may be other places in the code where type constants and/or title need updating.

     

  • Anonymous
    2010-10-19

    hi roderick, thanks for your reply.
    finally I found the problem.
    I changed the namspace's regexp for this:

    while (defined (my $line = <DUMP>)) {
            $line =~ s/\s//g ;  #clean whitespace
            if ($line =~ m/<\/namespaces>/i) {
                last ;
            }
            
            if ($line =~ m/<namespacekey=\"([-]*\d+)\"case=\"first-letter\">(.*)<\/namespace>/i){
                $namespaces{lc($2)} = $1 ;
                print "namespace: ".$2." key:".$1."\n";
            }
            
            if ($line =~ m/<namespacekey=\"(\d+)\"case=\"first-letter\"\/>/i) {
                $namespaces{""} = $1 ;
                print "empty namespace key:".$1."\n";
            }
        }
    

    And Viktor was right, the root_category changed to "Contents"

     
  • Utkarsh Dubey
    Utkarsh Dubey
    2011-05-04

    Hi All,

    I downloaded the latest enwikisource-20110430-pages-articles.xml.bz2.
    It works fine till it reaches the state "summarizing generality"
    I followed all the steps mentioned above still I am getting the error message saying
    "Could not locate root category. Please configure the script properly. …"
    I have no clue what to do…
    Any help will be deeply appreciated.

    Thanks
    dutkarsh

     
  • brkgrnr
    brkgrnr
    2011-09-05

    Hi all,
    I worder ask a question about root_category,
    Where do i learn this variable to work  extractWikipediaData script ?

    Ps: i am working on tr wikipedia

    Best regarts,
    B.Görener.