Extracting Portuguese Dump

Help
Jairo
2012-03-20
2013-06-25
  • Jairo

    Jairo - 2012-03-20

    Hi all,
    I've tried extracting Portuguese dump with Toolkit 1.2 in pseudo-cluster hadoop but it fails in categoryParent step.
    I don't know well Hadoop and print out error as follows:

    12/03/19 19:42:36 INFO extraction.DumpExtractor: Starting categoryParent step
    12/03/19 19:42:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    12/03/19 19:42:36 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/tmp/hadoop-notedlinks/mapred/staging/notedlinks/.staging/job_201203191744_0007
    Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://localhost:9000/user/notedlinks/output/tempLabelSense/tempCategoryParent* matches 0 files
            at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200)
            at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:211)
            at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:929)
            at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:921)
            at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:838)
            at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:416)
            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
            at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
            at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:765)
            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1200)
            at org.wikipedia.miner.extraction.CategoryLinkSummaryStep.run(Unknown Source)
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
            at org.wikipedia.miner.extraction.DumpExtractor.run(Unknown Source)
            at org.wikipedia.miner.extraction.DumpExtractor.main(Unknown Source)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:616)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

    I would appreciate your help
    Thank you!,

    Jairo

     
  • Klessius Renato Berlt

    Hi,

    I had the same problem.
    Have you already figured out how to solve it?
    It seems the files were not created, but I can not imagine why.

     
  • damiano

    damiano - 2012-11-07

    Hi,
    same problem here. Perhaps it's related with the language configuration in languages.xml?
    I'm attaching my configuration for Portuguese:

    <Language code="pt" name="Portuguese" localName="Português">
                   <RootCategory>Fundamental</RootCategory>
                   <DisambiguationCategory>desambiguação</DisambiguationCategory>
                   <DisambiguationTemplate>desambiguação</DisambiguationTemplate>
                   <RedirectIdentifier>REDIRECT</RedirectIdentifier>
                   <RedirectIdentifier>REDIRECIONAMENTO</RedirectIdentifier>
           </Language>
    

    By the way, I'm going to use also Spanish and Italian with the following configuration:

    <Language code="it" name="Italian" localName="Italiano">
                   <RootCategory>Enciclopedia</RootCategory>
                   <DisambiguationCategory>disambigua</DisambiguationCategory>
                   <DisambiguationTemplate>disambigua</DisambiguationTemplate>
                    <RedirectIdentifier>REDIRECT</RedirectIdentifier>
                    <RedirectIdentifier>RINVIA</RedirectIdentifier>               
                   
           </Language>
    

         
       

       <Language code="es" name="Spanish" localName="Español">
                 
                    <RootCategory>Categorías</RootCategory>
                   
                    <DisambiguationCategory>desambiguación</DisambiguationCategory>
        
                    <DisambiguationTemplate>desambiguación</DisambiguationTemplate>
        
                    <RedirectIdentifier>REDIRECT</RedirectIdentifier>
                    <RedirectIdentifier>REDIRECCIÓN</RedirectIdentifier>             
                          
           </Language>
    

    Any help is appreciated!
    Damiano.

     
    • yubo

      yubo - 2013-06-25

      HI
      I have come to many problem of use the wikipedia-miner 1.2 can you give me a introduction of how apply the wikipedia(eg:how change the dump to csv,how building the database from csv)
      my emial is:yubo.chen@nlpr.ia.ac.cn
      wish your replay
      thanks

       
  • damiano

    damiano - 2012-11-14

    Hi again,
    I could succesfully process the Spanish and the Italian Wikipedia dumps with the configuration above, but I still have problems with the Portuguese dump…

    Regards,
    Damiano.

     
    • yubo

      yubo - 2013-06-25

      HI
      I have come to many problem of use the wikipedia-miner 1.2 can you give me a introduction of how apply the wikipedia(eg:how change the dump to csv,how building the database from csv)
      my emial is:yubo.chen@nlpr.ia.ac.cn
      wish your replay
      thanks

       
  • Felipe Hummel

    Felipe Hummel - 2013-02-28

    Anyone was able to figure this out? I got the same error while try to extract the latest ptwiki (portuguese) dump:

    [img]null[/img]
    
     
  • Felipe Hummel

    Felipe Hummel - 2013-02-28

    Correctly pasting the error now:

    pageLink step completed in 00:09:03
    13/02/28 00:49:36 INFO extraction.DumpExtractor: Starting categoryParent step
    13/02/28 00:49:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    13/02/28 00:49:36 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:9000/tmp/hadoop-root/mapred/staging/root/.staging/job_201302272046_0014
    13/02/28 00:49:36 ERROR security.UserGroupInformation: PriviledgedActionException as:root cause:org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://localhost:9000/user/root/output/tempLabelSense/tempCategoryParent* matches 0 files
    Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://localhost:9000/user/root/output/tempLabelSense/tempCategoryParent* matches 0 files
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
        at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)
        at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:416)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
        at org.wikipedia.miner.extraction.CategoryLinkSummaryStep.run(Unknown Source)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.wikipedia.miner.extraction.DumpExtractor.run(Unknown Source)
        at org.wikipedia.miner.extraction.DumpExtractor.main(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
    
     
  • Felipe Hummel

    Felipe Hummel - 2013-03-01

    For anyone who gets the same error:

    The problem is in file DumpLinkParser.java Line #26:

    tmp.append(namespace) ;
    

    You need to replace it with this:

    tmp.append(namespace.replaceAll("\\(", "\\\\(").replaceAll("\\)", "\\\\)")) ;
    

    The reason is that in the Portuguese wikipedia, the two of the namespaces  have () inside them, as they are inserted into a regex the () are treated as a group, and this messes up the extraction logic at Line #59 (of DumpLinkParser.java)

    The regex would be something like this:

    (Especial||Predefinição|Livro|Predefinição Discussão|Anexo Discussão|Portal|Usuário(a)|Wikipédia|Ficheiro|Portal Discussão|Ajuda|Ficheiro Discussão|Categoria|Wikipédia Discussão|Discussão|Categoria Discussão|MediaWiki Discussão|Livro Discussão|Anexo|Ajuda Discussão|Multimédia|MediaWiki|Usuário(a) Discussão)\:(.*)

    The new code just escapes any parentheses.

    If anyone is still maintaining the code, please include this (or another more robust regex escaping solution) in the next release.

    Felipe Hummel

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks