On a dataset with ~40k entries a custom mapred has 5 of 14 map tasks killed with
java.io.IOException: pipe child exception
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
The job completes w/o error if I use only the first 100 entries. The data is tab-separated, text values. Some JSON-serialized fields.
I'm running it similar to the example on the website via hadoop pipes:
hadoop pipes -conf config.xml -input /user/hdfs/allinput -output /user/hdfs/out
The config has the same properties as specified in the 'running pydoop applications' page. namely specifying the job name, executable, and both hadoop.pipes.java.recordreader and hadoop.pipes.java.recordwriter.
I've read elsewhere this might be due to a recordreader that's not appropriate. but it's not clear what the options are that work well with pipes.
I'm slowly trying to see if particular data input is causing the exception but this is slow and it's not clear whether this will bear fruit.
Does anyone have any experience with this or any suggestions on how to debug this further/quicker?
Try checking the task logs. Do they say anything helpful?
Usually we see this sort of generic error when the Python process crashes. Seeing that it only happens to some of your map tasks, I suspect it's a specific type of input that is causing a crash in your program.
Try wrapping your map function code in a try block and catch a generic exception (Exception). Print the line and the exception in case of error. Maybe that will give you a clue. Why do you suspect the record reader? What record reader are you using? What value did you set for the hadoop.pipes.java.recordreader and hadoop.pipes.java.recordwriter properties?
Thanks for the response!
Nothing interesting in the task logs. Here's an example:
I tried wrapping the map method but nothing triggered. I'm using 'true' as the value for both hadoop.pipes.java.recordreader and hadoop.pipes.java.recordwriter. Are there more appropriate values?
Also, I forgot to mention: i'm running cdh4 and pydoop 0.7 from the debian packages on the website.
Somewhat offtopic, but the 'how to debug' guide on the hadoop wiki mentions setting 'keep.failed.task.files' config to get at the working files for failed tasks. but this doesn't seem to indicate what data was given to the failed task.
how does one figure out what set of data was given to a particular task?
you should look at the "stderr logs" section of the task logs page to see what happened to your program. The "syslog logs" section lists what happened on the Java side: in this case, it simply complains about an error that occurred in the child (your program) and bails out.
Here is an example where I intentionally added a blank line to the input of a program that expects all lines to be splittable into at least one element (that's the ipcount example from the Pydoop dist):
Traceback (most recent call last):
File "/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/-1151978452821792310_-1974856589_343255098/localhost/user/simleo/pydoop_ad09787a4ed847f99cd9d94211601efd/pydoop_2cf83e5d9c2644168fe79ffd32a33410", line 63, in <module>
pp.runTask(pp.Factory(Mapper, Reducer, combiner_class=Reducer))
File "/home/simleo/.local/lib/python2.7/site-packages/pydoop/pipes.py", line 263, in runTask
File "/var/lib/hadoop-hdfs/cache/mapred/mapred/local/taskTracker/distcache/-1151978452821792310_-1974856589_343255098/localhost/user/simleo/pydoop_ad09787a4ed847f99cd9d94211601efd/pydoop_2cf83e5d9c2644168fe79ffd32a33410", line 46, in map
ip = context.getInputValue().split(None, 1)
IndexError: list index out of range
Sadly the stderr logs are blank for all the tasks that failed.
Any idea how to ID which data gets split to the failed tasks? (I realize this is not really a pydoop specific question but this is proving frustrating to google an answer for :)
I narrowed it down to an instance of the data that's causing the exception. i instrumented the code (with 'print's to stdout) in more detail and, afaict, nothing in the python code runs (not the 'map' function or the constructor of the pydoop.pipes.Mapper derived class).
The traceback from the java exception hints at something to do with the DataInputStream. Is there anything I can do determine if it's a pydoop or hadoop issue?
2013-01-08 00:41:42,665 ERROR org.apache.hadoop.mapred.pipes.BinaryProtocol: java.io.EOFException
Can you send us your program together with a sample of the data you're processing? The sample should include the section that triggers the exception plus some records that are processed correctly.
Log in to post a comment.