I am trying to integrate Cassandra with Hadoop and to take advantage of the
data locality integration between cassandra and hadoop. It seems all the
examples online are written in Java and I would like to write my map reduce
programs in Python. After researching I think Pydoop gives enough exposure
to lower level hadoop api's to possibly bypass needing to write it in java
but not sure. Anyone have experience with using python to write a map
reduce program in hadoop that leverages data locality? Another option is if
I can just for all my nodes to run a specific python script with no input
data and no output data. I have a script in python that can determine local
data on the cassandra node but I havent been able to find a way to force
every node to run that script with no input/output data. Seems like its
really hard to break the hadoop paradigm of input/output data needing to be
Any help will be appreciated.
Well, any mapreduce program written in Pydoop should automatically leverage the data-locality features provided by Hadoop itself. The data-locality feature comes from the Hadoop scheduler -- it is the component that decides where to run tasks, combined with the FileSystem or FileInputFormat implementation (I don't remember exactly) which provides information on the location of a datum.
Regarding your last comment, I think we can agree that a job with no input or output would be kinda useless. I think what you probably want is a way to represent your input source or output destination as a Hadoop FileFormat object, which would allow your source/sink to fit into the framework, participate in data-local task scheduling, and would allow your to write your map and reduce functions as pydoop functions.
I dont see a clear way to leverage the Cassandra FileInputFormat(and others) class via Python or Pydoop for that matter. In java there are several examples of how you can import those classes and use them to override hadoops defaults. In java this achieves the desired results but in python there doesnt seem to be that mechanism. Even if there was a way to do it how would python read the data in?
I agree that generally no input and no output would not make sense but if I am running cassandra on every node and have a tasktracker on everynode then its possible I may want hadoop to just distribute the job and let the script pickup its local data by looking at cassandra range keys to see what data is local from a specific columnfamily. I rather not go that route but if its between using java and using python it becomes a possible option lol.