Specifying InputFormat and OutputFormat ...

Help
Anonymous
2012-11-16
2013-03-15
  • Anonymous - 2012-11-16

    Hello friends:

    I hadn't noticed the "Using the Hadoop SequenceFile Format" section of the documentation
    until last night. So before that I always wondered if there was a Pydoop mechanism to - underneath
    the covers - employ the various InputFormat / OutputFormat formats offered natively in the
    Hadoop distribution. For example, these formats:

        TextInputFormat
         KeyValueTextInputFormat
         NLineInputFormat
        (etc. InputFormats)
         TextOutputFormat
          NullOutputFormat
          (etc. OutputFormats)
    

    There can be others too. For example, XMLInputFormat from the Apache Mahout project.

    My question is, can the below properties be used generally to invoke use of these
    other formats, too? That is, if I specify the correct path to them in <value> … </value>,
    will Pydoop honor and use them?

    <property>
      <name>mapred.output.format.class</name>
      <value>org.apache.hadoop.mapred.SequenceFileOutputFormat</value>
    </property>
    <property>
      <name>mapred.input.format.class</name>
      <value>org.apache.hadoop.mapred.SequenceFileInputFormat</value>
    </property>
    

    Finally, when unspecified, I assume Pydoop uses TextInputFormat and TextOutputFormat
    by default. Is this correct?

    Thank you in advance. :)
    Noel

     
  • Simone Leo

    Simone Leo - 2012-11-19

    Hello

    Yes, you can use any input/output format you want, even custom ones. This is not a feature of Pydoop: the mapred.{input,output}.format.class is a Hadoop property that works with Hadoop Pipes and thus with Pydoop. The following section of the docs shows how to write your own InputFormat and use it with pydoop:

    http://pydoop.sourceforge.net/docs/examples/input_format.html

    Finally, when unspecified, I assume Pydoop uses TextInputFormat and TextOutputFormat
    by default. Is this correct?

    Yes. Again, this is not handled by Pydoop: it's the default Hadoop behavior.

     
  • Anonymous - 2012-11-20

    Simone, thank you. This is becoming clearer and clearer now.

    BTW: Today I was going to work on a cookbook on using Pydoop on Amazon's EMR (and post it on StackExchange); but as far as I can tell, EMR does not support Hadoop Pipes -  Well, at least that is the impression I get from the Web UI interface to EMR. Perhaps a programmatic interface to EMR support pipes (but I haven't looked). And opinion on this?

    I know it can be done in EC2 linux machines, but I was wondering about EMR.

    Regards & Thanks

     
  • Simone Leo

    Simone Leo - 2012-11-23

    Unfortunately we currently have no experience using Pydoop with the Amazon Web Services. Any feedback on this would be more than welcome.

    Simone

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks