Extracting Portuguese Dump

Help
Jairo
2012-03-20
2013-06-25
  • Jairo
    Jairo
    2012-03-20

    Hi all,
    I've tried extracting Portuguese dump with Toolkit 1.2 in pseudo-cluster hadoop but it fails in categoryParent step.
    I don't know well Hadoop and print out error as follows:

    12/03/19 19:42:36 INFO extraction.DumpExtractor: Starting categoryParent step
    12/03/19 19:42:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    12/03/19 19:42:36 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/tmp/hadoop-notedlinks/mapred/staging/notedlinks/.staging/job_201203191744_0007
    Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://localhost:9000/user/notedlinks/output/tempLabelSense/tempCategoryParent* matches 0 files
            at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200)
            at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:211)
            at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:929)
            at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:921)
            at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:838)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:416)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
            at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:765)
            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1200)
            at org.wikipedia.miner.extraction.CategoryLinkSummaryStep.run(Unknown Source)
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
            at org.wikipedia.miner.extraction.DumpExtractor.run(Unknown Source)
            at org.wikipedia.miner.extraction.DumpExtractor.main(Unknown Source)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:616)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

    I would appreciate your help
    Thank you!,

    Jairo

     
  • Hi,

    I had the same problem.
    Have you already figured out how to solve it?
    It seems the files were not created, but I can not imagine why.

     
  • damiano
    damiano
    2012-11-07

    Hi,
    same problem here. Perhaps it's related with the language configuration in languages.xml?
    I'm attaching my configuration for Portuguese:

    <Language code="pt" name="Portuguese" localName="Português">
                   <RootCategory>Fundamental</RootCategory>
                   <DisambiguationCategory>desambiguação</DisambiguationCategory>
                   <DisambiguationTemplate>desambiguação</DisambiguationTemplate>
                   <RedirectIdentifier>REDIRECT</RedirectIdentifier>
                   <RedirectIdentifier>REDIRECIONAMENTO</RedirectIdentifier>
           </Language>
    

    By the way, I'm going to use also Spanish and Italian with the following configuration:

    <Language code="it" name="Italian" localName="Italiano">
                   <RootCategory>Enciclopedia</RootCategory>
                   <DisambiguationCategory>disambigua</DisambiguationCategory>
                   <DisambiguationTemplate>disambigua</DisambiguationTemplate>
                    <RedirectIdentifier>REDIRECT</RedirectIdentifier>
                    <RedirectIdentifier>RINVIA</RedirectIdentifier>               
                   
           </Language>
    

         
       

       <Language code="es" name="Spanish" localName="Español">
                 
                    <RootCategory>Categorías</RootCategory>
                   
                    <DisambiguationCategory>desambiguación</DisambiguationCategory>
        
                    <DisambiguationTemplate>desambiguación</DisambiguationTemplate>
        
                    <RedirectIdentifier>REDIRECT</RedirectIdentifier>
                    <RedirectIdentifier>REDIRECCIÓN</RedirectIdentifier>             
                          
           </Language>
    

    Any help is appreciated!
    Damiano.

     
    • yubo
      yubo
      2013-06-25

      HI
      I have come to many problem of use the wikipedia-miner 1.2 can you give me a introduction of how apply the wikipedia(eg:how change the dump to csv,how building the database from csv)
      my emial is:yubo.chen@nlpr.ia.ac.cn
      wish your replay
      thanks

       
  • damiano
    damiano
    2012-11-14

    Hi again,
    I could succesfully process the Spanish and the Italian Wikipedia dumps with the configuration above, but I still have problems with the Portuguese dump…

    Regards,
    Damiano.

     
    • yubo
      yubo
      2013-06-25

      HI
      I have come to many problem of use the wikipedia-miner 1.2 can you give me a introduction of how apply the wikipedia(eg:how change the dump to csv,how building the database from csv)
      my emial is:yubo.chen@nlpr.ia.ac.cn
      wish your replay
      thanks

       
  • Felipe Hummel
    Felipe Hummel
    2013-02-28

    Anyone was able to figure this out? I got the same error while try to extract the latest ptwiki (portuguese) dump:

    [img]null[/img]
    
     
  • Felipe Hummel
    Felipe Hummel
    2013-02-28

    Correctly pasting the error now:

    pageLink step completed in 00:09:03
    13/02/28 00:49:36 INFO extraction.DumpExtractor: Starting categoryParent step
    13/02/28 00:49:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    13/02/28 00:49:36 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/tmp/hadoop-root/mapred/staging/root/.staging/job_201302272046_0014
    13/02/28 00:49:36 ERROR security.UserGroupInformation: PriviledgedActionException as:root cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://localhost:9000/user/root/output/tempLabelSense/tempCategoryParent* matches 0 files
    Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://localhost:9000/user/root/output/tempLabelSense/tempCategoryParent* matches 0 files
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
        at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
        at org.wikipedia.miner.extraction.CategoryLinkSummaryStep.run(Unknown Source)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.wikipedia.miner.extraction.DumpExtractor.run(Unknown Source)
        at org.wikipedia.miner.extraction.DumpExtractor.main(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
    
     
  • Felipe Hummel
    Felipe Hummel
    2013-03-01

    For anyone who gets the same error:

    The problem is in file DumpLinkParser.java Line #26:

    tmp.append(namespace) ;
    

    You need to replace it with this:

    tmp.append(namespace.replaceAll("\\(", "\\\\(").replaceAll("\\)", "\\\\)")) ;
    

    The reason is that in the Portuguese wikipedia, the two of the namespaces  have () inside them, as they are inserted into a regex the () are treated as a group, and this messes up the extraction logic at Line #59 (of DumpLinkParser.java)

    The regex would be something like this:

    (Especial||Predefinição|Livro|Predefinição Discussão|Anexo Discussão|Portal|Usuário(a)|Wikipédia|Ficheiro|Portal Discussão|Ajuda|Ficheiro Discussão|Categoria|Wikipédia Discussão|Discussão|Categoria Discussão|MediaWiki Discussão|Livro Discussão|Anexo|Ajuda Discussão|Multimédia|MediaWiki|Usuário(a) Discussão)\:(.*)

    The new code just escapes any parentheses.

    If anyone is still maintaining the code, please include this (or another more robust regex escaping solution) in the next release.

    Felipe Hummel