Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

pydoop0.8.1_CDH4.2.0_Ubuntu12.04LTS_hdfs_trouble

Help
Alexey
2013-03-01
2013-03-11
  • Alexey
    Alexey
    2013-03-01

    Hello,

    I am installing pydoop-0.8.1 on Ubuntu12.04 with Hadoop 2.0.0-cdh4.2.0.
    Hadoop was installed with the Cloudera Manager.

    python setup.py build

    can't locate CLASSPATH:

    error: Error compiling java component. Command: javac -classpath -d 'build/temp.linux-x86_64-2.7/pipes-2.0.0-cdh4.2.0' src/it/crs4/pydoop/NoSeparatorTextOutputFormat.java

    In order to proceed further I had to add the CLASSPATH manually:

    export CLASSPATH=/usr/share/cmf/lib/*:/usr/share/cmf/lib/cdh4/*

    and remove the "-classpath" from the "setup.py":

    489,490c489
    < # compile_cmd = "javac -classpath %s" % jlib.classpath
    < compile_cmd = "javac"# -classpath %s" % jlib.classpath
    ---
    compile_cmd = "javac -classpath %s" % jlib.classpath

    After that the build and install goes ok.

    I can run provided with the pydoop tests, but all dealing with hdfs!

    Even then I am trying to ask pydoop to show HDFS directory contents, it fails:

    print hdfs.ls("hdfs:///user/alexey")
    hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error:
    (unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/vagrant/.local/lib/python2.7/site-packages/pydoop/hdfs/init.py", line 272, in ls
    dir_list = lsl(hdfs_path, user)
    File "/home/vagrant/.local/lib/python2.7/site-packages/pydoop/hdfs/init.py", line 258, in lsl
    fs = hdfs(host, port, user)
    File "/home/vagrant/.local/lib/python2.7/site-packages/pydoop/hdfs/fs.py", line 131, in init
    h, p, u, fs = _get_connection_info(host, port, user)
    File "/home/vagrant/.local/lib/python2.7/site-packages/pydoop/hdfs/fs.py", line 60, in _get_connection_info
    fs = hdfs_ext.hdfs_fs(host, port, user)
    IOError: Cannot connect to default

    Deployed Hadoop looks to work, since I can run tasks on it, like word count etc.
    I can create HDFS dirs, list the contents and so on.

    Any suggestions?

    Regards, Alexey

     
  • Luca Pireddu
    Luca Pireddu
    2013-03-01

    Hi Alexey,

    Regarding your issues with building for CDH 4.2, a few days ago I saw some new commits in Simone Leo's logs relating to this Hadoop version. I think he's working on it. He's away these days, but I'll sure you'll have some feedback as soon as he returns.

    Regarding the problem you're having at runtime, when exactly is it failing? It seems that:

    • you can run all pydoop tests, even the ones for the pydoop.hdfs module;
    • you can run pydoop jobs on a Hadoop cluster;
    • if you open a python shell, load pydoop.hdfs and try to list a directory, you get the error you pasted.

    Is that correct?

    What if you specify the name of the server in your URI; i.e., try

    print hdfs.ls("hdfs://localhost:9000/user/alexey")
    

    for instance (obviously substituting your real HDFS namenode address for localhost:9000 :-) ).

    And does

    hadoop dfs -ls hdfs://localhost:9000/user/alexey
    

    work properly?

    Let's figure out whether this issue is related to missing environment settings, pydoop not using the environment properly, or something else entirely.

    Kind regards,

    Luca

     
  • Alexey
    Alexey
    2013-03-01

    Hello Luca,

    Thank you for a reply.

    Regarding the problem you're having at runtime, when exactly is it failing?

    It is failing when I am trying to interact with the HDFS (list dirs, create files).

    you can run all pydoop tests, even the ones for the pydoop.hdfs module;

    I can run all_tests_pipes, test_hadut, but not all_tests_hdfs.

    you can run pydoop jobs on a Hadoop cluster;

    No I can not.

    if you open a python shell, load pydoop.hdfs and try to list a directory, you get the error you pasted.

    Yes I do.

    What if you specify the name of the server in your URI; i.e., try
    print hdfs.ls("hdfs://localhost:9000/user/alexey")
    for instance (obviously substituting your real HDFS namenode address for localhost:9000 :-) ).

    It does not work.

    print hdfs.ls("hdfs://master.local.vm:8020/user/alexey")
    loadFileSystems error:
    (unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
    hdfsBuilderConnect(forceNewInstance=0, nn=master.local.vm, port=8020, kerbTicketCachePath=(NULL), userName=(NULL)) error:
    (unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
    Traceback (most recent call last):
    File "", line 1, in
    File "/home/vagrant/.local/lib/python2.7/site-packages/pydoop/hdfs/init.py", line 272, in ls
    dir_list = lsl(hdfs_path, user)
    File "/home/vagrant/.local/lib/python2.7/site-packages/pydoop/hdfs/init.py", line 258, in lsl
    fs = hdfs(host, port, user)
    File "/home/vagrant/.local/lib/python2.7/site-packages/pydoop/hdfs/fs.py", line 131, in init
    h, p, u, fs = _get_connection_info(host, port, user)
    File "/home/vagrant/.local/lib/python2.7/site-packages/pydoop/hdfs/fs.py", line 60, in _get_connection_info
    fs = hdfs_ext.hdfs_fs(host, port, user)
    IOError: Cannot connect to master.local.vm

    And does
    hadoop dfs -ls hdfs://localhost:9000/user/alexey
    work properly?

    Yes it does.

    hadoop dfs -ls hdfs://master.local.vm:8020/user/alexey
    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.
    Found 4 items
    drwx------ - alexey hadoopusers 0 2013-02-27 12:11 hdfs://master.local.vm:8020/user/alexey/.Trash
    drwx------ - alexey hadoopusers 0 2013-02-28 07:51 hdfs://master.local.vm:8020/user/alexey/.staging
    drwxr-xr-x - alexey hadoopusers 0 2013-02-27 11:46 hdfs://master.local.vm:8020/user/alexey/texts
    drwxr-xr-x - alexey hadoopusers 0 2013-02-28 07:51 hdfs://master.local.vm:8020/user/alexey/texts_out

    One question though:

    Is it necessary to have pydoop installed on all the nodes in order to be able to run HDFS tests?

    Regards, Alexey

     
  • Simone Leo
    Simone Leo
    2013-03-04

    Hi Alexey,

    I'm preparing a new release with explicit support for CDH 4.2.0, but if you compiled and installed from source you should be fine with the current Pydoop version.

    Is it necessary to have pydoop installed on all the nodes in order to be able to run HDFS tests?

    No. To interact with a remote HDFS, you only need to install Pydoop on the client. What happens if you try to connect to the local fs, e.g.:

    import pydoop.hdfs as hdfs
    hdfs.ls("file:/tmp")
    

    Currently, our best guess is that something is wrong with the classpath. Can you show us the output of:

    import pydoop
    print pydoop.hadoop_classpath()
    
     
  • Alexey
    Alexey
    2013-03-04

    Hello Simone,

    Apparently something wrong with the CLASSPATH,

    print pydoop.hadoop_classpath()

    prints nothing.

    Python 2.7.3 (default, Aug 1 2012, 05:14:39)
    [GCC 4.6.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pydoop
    >>> print pydoop.hadoop_classpath()

    >>>

    So it looks like that exporting the CLASSPATH during the installation makes installation complete without errors, but pydoop struggles to find the classpath later while running.

    Running
    hdfs.ls("file:/tmp") gives:

    >>> import pydoop.hdfs as hdfs
    >>> hdfs.ls("file:/tmp")
    loadFileSystems error:
    (unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
    hdfsBuilderConnect(forceNewInstance=0, nn=(NULL), port=0, kerbTicketCachePath=(NULL), userName=vagrant) error:
    (unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/local/lib/python2.7/dist-packages/pydoop/hdfs/init.py", line 272, in ls
    dir_list = lsl(hdfs_path, user)
    File "/usr/local/lib/python2.7/dist-packages/pydoop/hdfs/init.py", line 258, in lsl
    fs = hdfs(host, port, user)
    File "/usr/local/lib/python2.7/dist-packages/pydoop/hdfs/fs.py", line 131, in init
    h, p, u, fs = _get_connection_info(host, port, user)
    File "/usr/local/lib/python2.7/dist-packages/pydoop/hdfs/fs.py", line 60, in _get_connection_info
    fs = hdfs_ext.hdfs_fs(host, port, user)
    IOError: Cannot connect to
    >>>

    Regards, Alexey

     
  • Simone Leo
    Simone Leo
    2013-03-04

    Hello again.

    Pydoop builds the hadoop classpath according to the installed version of hadoop. For a CDH installation, pydoop should:

    1. Identify the hadoop home as /usr/lib/hadoop
    2. Identify the hadoop executable as /usr/bin/hadoop
    3. Call /usr/bin/hadoop and parse the result to get the hadoop version: in this case it should be 2.0.0-cdh4.2.0
    4. Look into the following places for hadoop jars:
      /usr/lib/hadoop
      /usr/lib/hadoop-0.20-mapreduce
      /usr/lib/hadoop/client

    Can you try to see where this process gets stuck? On my laptop:

    In [6]: import pydoop
    
    In [7]: pydoop.hadoop_home()
    Out[7]: '/usr/lib/hadoop'
    
    In [8]: pydoop.hadoop_exec()
    Out[8]: '/usr/bin/hadoop'
    
    In [9]: pydoop.hadoop_version()
    Out[9]: '2.0.0-cdh4.2.0'
    
    In [10]: for p in pydoop.hadoop_classpath().split(":"): print p
    /usr/lib/hadoop/client/avro-1.7.3.jar
    /usr/lib/hadoop/client/slf4j-api-1.6.1.jar
    /usr/lib/hadoop/client/hadoop-yarn-common-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/jsr305-1.3.9.jar
    /usr/lib/hadoop/client/jackson-xc-1.8.8.jar
    /usr/lib/hadoop/client/jaxb-impl-2.2.3-1.jar
    /usr/lib/hadoop/client/hadoop-yarn-api-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/hadoop-mapreduce-client-shuffle-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/hadoop-yarn-client-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/hadoop-mapreduce-client-common-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/xmlenc-0.52.jar
    /usr/lib/hadoop/client/commons-lang-2.5.jar
    /usr/lib/hadoop/client/commons-net-3.1.jar
    /usr/lib/hadoop/client/jackson-mapper-asl-1.8.8.jar
    /usr/lib/hadoop/client/jaxb-api-2.2.2.jar
    /usr/lib/hadoop/client/commons-configuration-1.6.jar
    /usr/lib/hadoop/client/commons-beanutils-1.7.0.jar
    /usr/lib/hadoop/client/snappy-java-1.0.4.1.jar
    /usr/lib/hadoop/client/netty-3.2.4.Final.jar
    /usr/lib/hadoop/client/jline-0.9.94.jar
    /usr/lib/hadoop/client/jackson-core-asl-1.8.8.jar
    /usr/lib/hadoop/client/jersey-core-1.8.jar
    /usr/lib/hadoop/client/commons-logging-1.1.1.jar
    /usr/lib/hadoop/client/hadoop-auth-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/junit-4.8.2.jar
    /usr/lib/hadoop/client/stax-api-1.0.1.jar
    /usr/lib/hadoop/client/mockito-all-1.8.5.jar
    /usr/lib/hadoop/client/zookeeper-3.4.5-cdh4.2.0.jar
    /usr/lib/hadoop/client/protobuf-java-2.4.0a.jar
    /usr/lib/hadoop/client/commons-codec-1.4.jar
    /usr/lib/hadoop/client/jsch-0.1.42.jar
    /usr/lib/hadoop/client/paranamer-2.3.jar
    /usr/lib/hadoop/client/jettison-1.1.jar
    /usr/lib/hadoop/client/hadoop-common-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/hadoop-yarn-server-common-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar
    /usr/lib/hadoop/client/commons-beanutils-core-1.8.0.jar
    /usr/lib/hadoop/client/jersey-json-1.8.jar
    /usr/lib/hadoop/client/commons-el-1.0.jar
    /usr/lib/hadoop/client/guava-11.0.2.jar
    /usr/lib/hadoop/client/log4j-1.2.17.jar
    /usr/lib/hadoop/client/hadoop-mapreduce-client-app-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/activation-1.1.jar
    /usr/lib/hadoop/client/jersey-server-1.8.jar
    /usr/lib/hadoop/client/commons-cli-1.2.jar
    /usr/lib/hadoop/client/hadoop-hdfs-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/commons-io-2.1.jar
    /usr/lib/hadoop/client/commons-collections-3.2.1.jar
    /usr/lib/hadoop/client/asm-3.2.jar
    /usr/lib/hadoop/client/jackson-jaxrs-1.8.8.jar
    /usr/lib/hadoop/client/commons-math-2.1.jar
    /usr/lib/hadoop/client/hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/hadoop-mapreduce-client-jobclient-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/client/commons-digester-1.8.jar
    /usr/lib/hadoop/hadoop-annotations-2.0.0-cdh4.2.0.jar
    /usr/lib/hadoop/hadoop-annotations.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-ant-2.0.0-mr1-cdh4.2.0.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-test-2.0.0-mr1-cdh4.2.0.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.0.0-mr1-cdh4.2.0.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-ant.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4.2.0.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.2.0-test.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-tools.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-tools-2.0.0-mr1-cdh4.2.0.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.2.0-examples.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.2.0-tools.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.2.0-ant.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-2.0.0-mr1-cdh4.2.0-core.jar
    /usr/lib/hadoop-0.20-mapreduce/hadoop-test.jar
    

    Regards,

    Simone

     
    • Alexey
      Alexey
      2013-03-06

      Hello Simone,

      Thank you for your answer.

      Please note, that in the case described, the Cloudera Manager(CM) was used to install Hadoop.

      Apparently CM uses significantly different paths, than package installation you are referring to.

      1)

      >>> pydoop.hadoop_home()
      '/usr/share/cmf'
      

      2)

      >>> pydoop.hadoop_exec()
      '/usr/bin/hadoop'
      

      3)

      >>> pydoop.hadoop_version()
      '2.0.0-cdh4.2.0'
      

      Now to the CLASSPATH.

      At the moment Pydoop fails to find CLASSPATH for the CM installation.

      Possible solutions:

      1)
      a)
      find the location of the jar archives manually and export them, i.e.

      $ export CLASSPATH=/usr/share/cmf/lib/*:/usr/share/cmf/lib/cdh4/*:
      

      b) remove the line contains the "-classpath"

      $ diff setup.py setup.py~
      489,490c489,490
      < #    compile_cmd = "javac -classpath %s" % jlib.classpath
      <     compile_cmd = "javac "#-classpath %s" % jlib.classpath
      ---
      >     compile_cmd = "javac -classpath %s" % jlib.classpath
      > #    compile_cmd = "javac "#-classpath %s" % jlib.classpath
      

      In that case installation goes well without errors, but being installed doesnot work. See the previous messages.

      Why is does not work, if the installation went well???

      2)
      a)to find the locations of jars and manually set their locations as:

       $ diff pydoop/hadoop_utils.py pydoop/hadoop_utils.py~
       300,304c300,302
       < #    glob.glob(os.path.join(hadoop_home, 'client', '*.jar')) +
       < #    glob.glob(os.path.join(hadoop_home, 'hadoop-annotations*.jar')) +
       < #    glob.glob(os.path.join(mr1_home, 'hadoop*.jar'))
       <      glob.glob(os.path.join('/usr/share/cmf/lib','*.jar'))+
       <      glob.glob(os.path.join('/usr/share/cmf/lib/cdh4', '*.jar'))
       ---
       >    glob.glob(os.path.join(hadoop_home, 'client', '*.jar')) +
       >    glob.glob(os.path.join(hadoop_home, 'hadoop-annotations*.jar')) +
       >    glob.glob(os.path.join(mr1_home, 'hadoop*.jar'))
      

      b) restore the -classpath from 1b)

      In this case the Pydoop accepts proposed jars and installs well.

      Almost all HDFS tests are successful also.
      Almost because it fails at:

      $ append (test_hdfs_fs.TestHDFS) ... [                     Thread-77] DFSClient                      WARN  DataStreamer Exception java.io.IOException: Failed to add a datanode.  User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT.
      

      But nevertheless it is possible now to list,create directories at HDFS using Pydoop.

      P.S. For the CDH4.1.3 installed using packages Pydoop is able to find the CLASSPATH.
      But apparently not all jars needed are captures by the Pydoop.
      So the default settings also did not work for me.

      To solve it the same manual CLASSPATH path provision can be used

      $ diff pydoop/hadoop_utils.py pydoop/hadoop_utils.py~
      302,308c302,304
      < #          glob.glob(os.path.join(hadoop_home, 'client', '*.jar')) +
      <     glob.glob(os.path.join('/usr/lib/hadoop' , '*jar')) +
      <     glob.glob(os.path.join('/usr/lib/hadoop-hdfs' , '*jar')) +
      <     glob.glob(os.path.join('/usr/lib/hadoop-0.20-mapreduce/lib' , '*jar')) +
      <     glob.glob(os.path.join('/usr/lib/hadoop-0.20-mapreduce' , '*jar')) 
      < #   glob.glob(os.path.join(hadoop_home, 'hadoop-annotations*.jar')) +
      < #   glob.glob(os.path.join(mr1_home, 'hadoop*.jar'))
      ---
      >     glob.glob(os.path.join(hadoop_home, 'client', '*.jar')) +
      >     glob.glob(os.path.join(hadoop_home, 'hadoop-annotations*.jar')) +
      >     glob.glob(os.path.join(mr1_home, 'hadoop*.jar'))
      

      Regards, Alexey

       
  • Simone Leo
    Simone Leo
    2013-03-08

    Hello again,

    I've just updated the code to add support for CDH4.2.0 installed via cloudera manager with parcels. It's in the current develop head.

    I'm getting that weird error on the append test as well, I need to investigate further.

    Thank you for your feedback!

    Simone

     
  • Simone Leo
    Simone Leo
    2013-03-11

    The append bug has been fixed (again, in the git develop branch). As described in HDFS-3091, the issue is related to the ReplaceDatanodeOnFailure policy being active on a cluster with a size in nodes that's less than or equal to half the replication factor. Cloudera manager installs a single-node cluster with a cluster-wide replication factor of 3, while installation from packages with the pseudo-distributed configuration sets replication to 1. To fix the problem in Pydoop, I simply changed the replication factor of the specific file used for that test to 1.