Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

Arabic Wikipedia dumps

selcuk
2012-02-10
2013-05-30
  • selcuk
    selcuk
    2012-02-10

    Hi,

    While trying to process the Arabic Wikipedia dumps I encountered the below exception in the redirect step. Any idea how to solve it. I'm using a single node setup hadoop (0.20.203.0) cluster on a large Amazon instance running ubuntu 10.04.

    Thanks,
    sk



    12/02/10 12:25:27 INFO mapred.JobClient:     Map output records=411649
    page step completed in 00:05:18
    12/02/10 12:25:31 INFO extraction.DumpExtractor: Starting redirect step
    12/02/10 12:25:31 INFO extraction.RedirectStep: Cached page file file:/home/ubuntu/hadoop-0.20.203.0/output/tempPage/tempPage-00000
    12/02/10 12:25:31 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    12/02/10 12:25:31 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    12/02/10 12:25:31 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-ubuntu/mapred/staging/ubuntu1809552216/.staging/job_local_0002
    Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/home/ubuntu/hadoop-0.20.203.0/output/tempPage/tempRedirect* matches 0 files
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:211)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:929)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:921)
    at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:838)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:765)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1200)
    at org.wikipedia.miner.extraction.RedirectStep.run(Unknown Source)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.wikipedia.miner.extraction.DumpExtractor.run(Unknown Source)
    at org.wikipedia.miner.extraction.DumpExtractor.main(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

     
  • David Milne
    David Milne
    2012-02-26

    Can you attach the language configuration file you are using? Whatever you have specified for detecting redirects doesn't seem to be working.

    - Dave

     

  • Anonymous
    2012-03-19

    Hi all,

    I'm trying to generate the csv files for Arabic but i couldn't found the arabic sentence model files.
    how can i generate the model files? or from i can get those files

    also here's my language configuration file

    <Language code="ar" name="Arabic" localName="العربية">

    <RootCategory>التصنيف الرئيسي</RootCategory>

    <DisambiguationCategory>قوالب_توضيح</DisambiguationCategory>

    <DisambiguationTemplate>توضيح</DisambiguationTemplate>
    <DisambiguationTemplate>صفحة توضيح</DisambiguationTemplate>
    <DisambiguationTemplate>Disambig</DisambiguationTemplate>

    <RedirectIdentifier>تحويل</RedirectIdentifier>

    </Language>

     
  • Jason
    Jason
    2012-06-11

    Does the redirect identifier work for this configuration? 

     
  • adams
    adams
    2012-06-17

    One thing I found when running on Amazon was that you have to specify UTF-8 input/output encoding _everywhere _in the DumpExtractor code, or it would not read/write valid Arabic characters. For example:

    BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dfs.create(new Path(finalDir + "/" + filePrefix + ".csv")), "UTF-8")) ;
    
     
  • Alaa Alahmadi
    Alaa Alahmadi
    2013-02-20

    I would like to know if any one succeed in  extract the CSV Summaries for Arabic dump files?
    if yes can I have them ?

    I am trying to extract them but I got this error

    Exception in thread "main" java.lang.IllegalArgumentException: Please specify a xml dump of wikipedia, a language.xml config file, a language code, an openNLP sentence detection model, and a writable output directory
    at org.wikipedia.miner.extraction.DumpExtractor.configure(Unknown Source)
    at org.wikipedia.miner.extraction.DumpExtractor.<init>(Unknown Source)
    at org.wikipedia.miner.extraction.DumpExtractor.main(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

    even that I upload all files in input ?

    I also did not find OpenNLP sentence detection model for Arabic so I left it empty ?

    could any one help me

    Alaa