Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Error when starting dump processing

Enrico
2012-04-16
2013-05-30
  • Enrico
    Enrico
    2012-04-16

    Hi guys,
    I've been trying to process an Italian Wikipedia Dump, but I get an error as soon as the process start. Here's what I'm using and right after the error that I get:

    Windows 7 - 32 bit (3GB of RAM)
    Intel Core2 Duo P8600 @2.40GHz
    Wikipedia-Miner Toolkit, version 1.2
    Hadoop version 0-20.2, running with Cygwin

    12/04/16 09:48:42 INFO extraction.DumpExtractor: Extracting site info
    12/04/16 09:48:42 INFO extraction.DumpExtractor: Starting page step
    12/04/16 09:48:43 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    12/04/16 09:48:43 INFO mapred.FileInputFormat: Total input paths to process : 1
    12/04/16 09:48:45 INFO mapred.JobClient: Running job: job_201204160942_0001
    12/04/16 09:48:46 INFO mapred.JobClient:  map 0% reduce 0%
    12/04/16 09:48:58 INFO mapred.JobClient: Task Id : attempt_201204160942_0001_m_000092_0, Status : FAILED
    java.io.IOException: Task process exit with nonzero status of 1.
            at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
    12/04/16 09:48:58 WARN mapred.JobClient: Error reading task outputhttp://192.168.1.2:50060/tasklog?plaintext=true&taskid=attempt_201204160942_0001_m_000092_0&filter=stdout
    12/04/16 09:48:58 WARN mapred.JobClient: Error reading task outputhttp://192.168.1.2:50060/tasklog?plaintext=true&taskid=attempt_201204160942_0001_m_000092_0&filter=stderr
    12/04/16 09:49:04 INFO mapred.JobClient: Task Id : attempt_201204160942_0001_m_000092_1, Status : FAILED
    java.io.IOException: Task process exit with nonzero status of 1.
            at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
    
    12/04/16 09:49:40 INFO mapred.JobClient: Job complete: job_201204160942_0001
    12/04/16 09:49:40 INFO mapred.JobClient: Counters: 0
    Exception in thread "main" java.io.IOException: Job failed!
            at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
            at org.wikipedia.miner.extraction.PageStep.run(Unknown Source)
            at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
            at org.wikipedia.miner.extraction.DumpExtractor.run(Unknown Source)
            at org.wikipedia.miner.extraction.DumpExtractor.main(Unknown Source)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
    

    So I open the logs of the failed attempts (under logs/userlogs/attempt_*) and here's what I get for stderr and stdout:

    Could not create the Java virtual machine.
    Error occurred during initialization of VM
    Could not reserve enough space for object heap
    

    It seems rather clear that heap is the problem. Now, after some Googling, here's what I came across:

    1. Modify the file conf/hadoop-env.sh and set HADOOP_HEAPSIZE -> tried different values between 200 and 1000, nothing changes (it doesn't even start if I set a value higher than 1000)
    2. Modify the file conf/mapred-site.xml and set the property mapred.child.java.opts -> tried different values between -Xmx128m and -Xmx1024m and again, nothing changes

    Is it possible that I need to free some space on the drive? The dump is 5,67GB and I have 17,8GB of free space, shouldn't it be enough?

    I really don't know what to do.. I hope you guys can give me some help! :)
    Cheers!

    PS: I first tried with Hadoop version 1.0.1, but it looks like it freezes right after starting the job (last messages are "Running job: job_201204161001_0001" and "map 0% reduce 0%" and then nothing, JobTracker doesn't even show any running job), so that's the reason I downgraded to version 0-20.2, which I read somewhere (don't remember where) should be a better choice when processing big files.

     
  • Enrico
    Enrico
    2012-04-17

    Never mind, I got it to work!

    I used a different machine (64bit rather than 32bit, no other big difference), same Hadoop version, same Cygwin configuration. The only different things I did were:
    - not tweaking the file wikipedia-template.xml in Wikipedia Miner configs before building wikipedia-miner-hadoop.jar (though I didn't touch anything that would compromise the memory usage)
    - running sshd server on cygwin with command "cygrunsrv -start sshd", rather than just "ssh localhost"
    - using JDK 7u3 rather than 6u31

    I haven't had the chance to try this on my machine yet, so I don't know what really made the difference here.. I'll post it in case I manage to make it work!