some older CDH versions use a + to delimit external and internal versions of Hadoop code, e.g. 0.20.2+320.
This change supports handling that plus signature in pydoop.hadoop_utils.
The patch is available on Github and will be pasted separately below.
Author: Jeremy G. Kahn <firstname.lastname@example.org>
Date: Tue Apr 16 14:28:31 2013 -0700
Allow hadoop version to use a plus
some older CDH versions use a + to delimit 'public' vs private versions
of Hadoop code, e.g. 0.20.2+320.
This change supports handling that signature in pydoop.hadoop_utils
diff --git a/pydoop/hadoop_utils.py b/pydoop/hadoop_utils.py
index 3d70495..a63cc0f 100644
@@ -63,7 +63,7 @@ class HadoopVersion(object):
def __init__(self, version_str):
self.__str = version_str
- version = self.__str.split("-", 1)
+ version = re.split(r"[-+]", self.__str, maxsplit=1)
Although we do not explicitly support older versions, adding this won't hurt. Thanks for the patch!
Thanks. I know y'all don't explicitly support versions that old, but I'm working with a legacy Hadoop cluster that I can't upgrade (yet). With this patch in place, I find that I can get Pydoop working - without it, I am immediately blocked by a version assert from pydoop at compile time.
Do you have any plans to integrate this change into develop (or master)? I would like to avoid maintaining my own local copy of this patch, and I do not believe that it interferes with any other pydoop development.
I had to add a few more lines to keep version objects consistent with the other ones. The "320" part in "0.20.2+320", for instance, is mapped to "(3, -1, 320)" (it's CDH3 and it's older than "cdh3u0" which is mapped to "(3, 0)"). I also updated the corresponding unit test.
I chose not to add support for "0.20.2-CDH3B4" for now. I hope no one needs that :)
These additional changes (using a negative number in the sort) break the resulting generated c code: I get pages of errors that look like this:
src/_pipes_0_20_2_cdh_3_-1_320_main.cpp:8:1: error: invalid suffix "_320" on integer constant
I understand the importance of keeping the version objects in line, but it's also important that the versions compile. Can we cope with a -1 minor version in a way that doesn't break the C code?
Fixed (hopefully) in commit 5164a98 (as I'm writing this, the web-based repository browser does not yet show this commit, but it's there if you checkout the develop branch via git).
I tried to address all possible cases this time, while trying to leave things as consistent as possible. Thanks for your feedback and please let me know if the current version works for you
This seems to work for me. Thanks for revising the version numbers.
Any plans to issue a formal release with these changes? I'd like to manage these packages with an artifact repository and that requires a known version number.
hmm. no, this patch doesn't seem to work for me. I can't figure out why, but when I compile against develop, I have errors from the mappers which tell me it can't find HADOOP_CONF_DIR. When I compile against my original patch, it does find the appropriate pieces and functions.
Still not happy with the revisions to this patch. I hope to abandon them all, of course, when my group upgrades to CDH4, but in the meantime I'm sticking to my revised patch until I can isolate what change broke my pydoop. :(
have you tried removing the old installation completely before installing the new version? You can also force the correct value of HADOOP_CONF_DIR by setting the environment variable explicitly.
I have tried both of those things already; I think the version-munging creates some piece of generated C code with a different name than the engine expects - setting HADOOP_CONF_DIR should only be required when it cannot be computed from HADOOP_HOME; my previous compilation already sets HADOOP_HOME.
We'll probably have to complete our adoption of CDH4 just to avoid this headache in the future; in the mean time I think we're going to have to stick with my homebrew patch.
Log in to post a comment.