SourceForge has been redesigned. Learn more.

Use of jc_configure or alternative approach

  • Anonymous

    Anonymous - 2012-10-31

    Hello friends:

    I hope all is well.

    I was looking back at the code in the provided examples again,
    and have question.

    In the Writer() class, I see 3 lines of code that use jc_configure (and it's variants)
    to create 3 Object attributes: "self.part", "self.outdir", and "self.sep".

    (1) I was wondering if the 4 lines of code that I wrote after that (in bold and
    commented out) is equivalent to the 3 lines above it, and if they can be safely
    employed as a substitute?

    (2) And if so, are there any (non-stylistic) pros/cons/gotchas either way to
    doing so? I ask this because it is sometimes difficult to track down subtle
    issues in Hadoop jobs.

    Thank you (see below).

    class MyRecordWriter(pydoop.pipes.RecordWriter):
      def __init__(self, context):
        super(MyRecordWriter, self).__init__()
        self.logger = logging.getLogger("Writer")
        jc = context.getJobConf()
        jc_configure_int(self, jc, "mapred.task.partition", "part")                         # creates self.part attribute
        jc_configure(self, jc, "", "outdir")                         # creates self.outdir attribute
        jc_configure(self, jc, "mapred.textoutputformat.separator", "sep", "\t") # creates self.sep attribute
        # self.part   = jc.getInt("mapred.task.partition")
        # self.outdir = jc.get("")
        # self.sep    = "\t"
        # if jc.hasKey("mapred.textoutputformat.separator"):
        #   self.sep = jc.get("mapred.textoutputformat.separator")[/b]
        self.outfn = "%s/part-%05d" % (self.outdir, self.part)  # /path/to/output/dir/part-nnnnn
        self.file =, "w")
        [ ... ]
  • Simone Leo

    Simone Leo - 2012-11-05


    yes, your code is equivalent: it uses the lower level API provided by the boost extension. In the utils module we tried to provide a few convenience functions to reduce redundancy while interacting with the jc object, but they're not particularly elegant. A much better solution would be to add some boost code to expose a jc object that behaves as close as possible to a standard python dict.

    As usual, patches are welcome!

  • Anonymous

    Anonymous - 2012-11-08

    Thank you Simone. I'm still working on my Pydoop chops (and other areas, too). =:).
    But when the voluminous book reading, trial/error, and self note-taking subsides, I'd like to.

  • Simone Leo

    Simone Leo - 2012-11-19

    Update: Luca has done some work on this for our latest release:

    For now, the only usage example is in the pydoop script application.

  • Anonymous

    Anonymous - 2012-11-20

    Simone, thank you for this update. Appreciated. I will have a look at the documentation.

  • Anonymous

    Anonymous - 2012-11-28

    Simone, Luca…

    Thank you for the proactive update on this here.

    I just upgraded to CDH4 Hadoop 2.0.0-cdh4.1.2 (MRv1, not YARN) and correspondingly, downloaded and compiled
    Pydoop v0.7-rc3 against that platform.

    I didn't know why all of my (previously working) Pydoop programs, when on the Reduce side attempted to write
    results to HDFS, would fail with a Permissions-related error. Then I remembered the above follow-up, which
    let me to solving the issue.

    For others who might stumble into this, a slight modification to your Pydoop programs when transitioning
    as I did above, is necessary:

    BEFORE ….

    self.file =, "w")

    AFTER ….

    self.hdfs_user = jc.get("pydoop.hdfs.user")
          # Create a Key/Value pair for this property in jobConf.xml or programmatically.
          # The Value is typically your Unix user name.,
          # The Key name is arbitrary (i.e. not Pydoop specific), but we use "pydoop.hdfs.user"
          # here to be consistent with their provided examples.
    self.file =, "w", user=self.hdfs_user)

Log in to post a comment.