enwiki: "Could not identify root c...

Help
2011-07-25
2013-05-30
  • Hi, I've checkout and compiled the hadoopAndBerkeleyDb branch and I'm tring to run the DumpExtractor in a standalone hadoop config.

    I'm have my languages.xml almost identical to what's described here: http://sourceforge.net/apps/mediawiki/wikipedia-miner/index.php?title=Language_configuration_for_extraction (It's actually the 'stock' one from the svn branch)

    Now the processing goes well but at one point I hit:

    11/07/22 18:37:09 INFO mapred.LocalJobRunner: reduce > reduce
    11/07/22 18:37:09 INFO mapred.LocalJobRunner: reduce > reduce
    11/07/22 18:37:09 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
    11/07/22 18:37:09 INFO mapred.JobClient:  map 100% reduce 100%
    11/07/22 18:37:14 INFO mapred.JobClient: Job complete: job_local_0001
    11/07/22 18:37:15 INFO mapred.JobClient: Counters: 21
    11/07/22 18:37:15 INFO mapred.JobClient:   File Input Format Counters
    11/07/22 18:37:15 INFO mapred.JobClient:     Bytes Read=27795898617
    11/07/22 18:37:15 INFO mapred.JobClient:   File Output Format Counters
    11/07/22 18:37:15 INFO mapred.JobClient:     Bytes Written=279573127
    11/07/22 18:37:15 INFO mapred.JobClient:   org.wikipedia.miner.extraction.PageStep$Counter
    11/07/22 18:37:15 INFO mapred.JobClient:     articleCount=2983559
    11/07/22 18:37:15 INFO mapred.JobClient:     categoryCount=597401
    11/07/22 18:37:15 INFO mapred.JobClient:     redirectCount=4121634
    11/07/22 18:37:15 INFO mapred.JobClient:     disambiguationCount=127488
    11/07/22 18:37:15 INFO mapred.JobClient:   FileSystemCounters
    11/07/22 18:37:15 INFO mapred.JobClient:     FILE_BYTES_READ=11478330350550
    11/07/22 18:37:15 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=192578573235
    11/07/22 18:37:15 INFO mapred.JobClient:   Map-Reduce Framework
    11/07/22 18:37:15 INFO mapred.JobClient:     Map output materialized bytes=225101297
    11/07/22 18:37:15 INFO mapred.JobClient:     Map input records=9475180
    11/07/22 18:37:15 INFO mapred.JobClient:     Reduce shuffle bytes=0
    11/07/22 18:37:15 INFO mapred.JobClient:     Spilled Records=30912980
    11/07/22 18:37:15 INFO mapred.JobClient:     Map output bytes=209435773
    11/07/22 18:37:15 INFO mapred.JobClient:     Map input bytes=27571918745
    11/07/22 18:37:15 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110970
    11/07/22 18:37:15 INFO mapred.JobClient:     Combine input records=0
    11/07/22 18:37:15 INFO mapred.JobClient:     Reduce input records=7830082
    11/07/22 18:37:15 INFO mapred.JobClient:     Reduce input groups=7830082
    11/07/22 18:37:15 INFO mapred.JobClient:     Combine output records=0
    11/07/22 18:37:15 INFO mapred.JobClient:     Reduce output records=7830082
    11/07/22 18:37:15 INFO mapred.JobClient:     Map output records=7830082
    Exception in thread "main" java.lang.Exception: Could not identify root category
            at org.wikipedia.miner.extraction.PageStep.updateStats(PageStep.java:87)
            at org.wikipedia.miner.extraction.DumpExtractor.run(DumpExtractor.java:247)
            at org.wikipedia.miner.extraction.DumpExtractor.main(DumpExtractor.java:94)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:156
    

    Any idea?  I'm using the latest full wikipedia enwiki xml dump.

    Thanks!

    Jp

     
  • Jason
    Jason
    2012-06-07

    Samething here…

     

  • Anonymous
    2012-09-20

    I face the some problem here. Can anyone help me ? Please…

     
  • Jason
    Jason
    2012-09-23

    This is probably wrong for reasons I don't care to understand, but it basically works for me:

      <Language code="en" name="English" localName="English">
        <RootCategory>Fundamental categories</RootCategory>
        <DisambiguationCategory>Disambiguation_pages</DisambiguationCategory>
        <DisambiguationTemplate>Disambiguation_pages</DisambiguationTemplate>
        <DisambiguationTemplate>All_disambiguation_pages</DisambiguationTemplate>
        <DisambiguationTemplate>disambig</DisambiguationTemplate>
        <DisambiguationTemplate>Place_name_disambiguation_pages</DisambiguationTemplate>
        <DisambiguationTemplate>Human_name_disambiguation_pages</DisambiguationTemplate>
        <DisambiguationTemplate>Hospital_disambiguation_pages</DisambiguationTemplate>
        <DisambiguationTemplate>mathdab</DisambiguationTemplate>
        <DisambiguationTemplate>mountianindex</DisambiguationTemplate>
        <DisambiguationTemplate>Lists_of_ambiguous_numbers</DisambiguationTemplate>
        <DisambiguationTemplate>roaddis</DisambiguationTemplate>
        <DisambiguationTemplate>Educational_institution_disambiguation</DisambiguationTemplate>
        <DisambiguationTemplate>shipindex</DisambiguationTemplate>
        <DisambiguationTemplate>Places_of_worship_disambiguation_pages</DisambiguationTemplate>
        <DisambiguationTemplate>SIA</DisambiguationTemplate>
        <RedirectIdentifier>REDIRECT</RedirectIdentifier>
      </Language
    
     
  • Duygu
    Duygu
    2012-09-23

    I have the same problem and Jason's modification in languages.xml did not work for me. Any idea? Btw, was someone able to achieve the dump-extraction with hadoop single-node setup?