Hadoop and pydoop noob question about list comprehensions

  • Jason

    Jason - 2013-08-11

    Given that Pig[Latin] is essentially Python list comprehensions, I was looking for a pyhton replacement for pig.

    Users = load 'users' as (name, age, ipaddr);
    Clicks = load 'clicks' as (user, url, value);
    ValuableClicks = filter Clicks by value > 0;
    UserClicks = join Users by name, ValuableClicks by user;
    Geoinfo = load 'geoinfo' as (ipaddr, dma);
    UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
    ByDMA = group UserGeo by dma;
    ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
    store ValuableClicksPerDMA into 'ValuableClicksPerDMA';

    Python (non-hadoop):
    users = some_load_function('users')
    clicks = some_load_function('clicks')
    valuableclicks = [x for x in clicks if x['value'] >0]
    userclicks = some_join_function(valuableclicks, users, {'user':'name'})
    geoinfo = some_load_function('geoinfo')
    usergeo = some_join_function(userclicks, geoinfo, {'ipaddr':'ipaddr'})
    bydma = group_function(usergeo, ['dma'])
    valuableclicksperdma = [{dma: len(x)} for x in bydma.keys()]

    Python (fantasy hadooped):
    import pydoop
    cluster = pydoop.cluster("myhadoopinstance")
    cluster.copy('~user/users', 'user')
    cluster.copy('~users/clicks', 'clicks')
    users = cluster.load('users', ['name','age','ipaddr']) # handle to object
    clicks = cluster.load('clicks', ['user', 'url','value']) # handle to object
    valuableclicks = cluster.filter(lambda x:x['value] >0, clicks) # clicks is an iterable object, filter is python's filter()
    userclicks = cluster.join(valuableclicks, users, {'user':'name'})
    geoinfo = cluster.load('geoinfo')
    usergeo = cluster.join(userclicks, geoinfo, {'ipaddr':'ipaddr'})
    bydma = cluster.group(usergeo, ['dma'])
    valuableclicksperdma = [{dma: len(x)} for x in bydma.keys()]

    I guess my example isnt't all t hat exciting in retrospect. But here is what I wanted:
    1. Something that looks like python
    2. Something that runs as much on the cluster as possible
    3. Something that uses python's map, filter and reduce functions, but on the cluster.

    So far all I have seen is writing the map and reduce parts in a file that is sent to the cluster. I think that is short-sighted. If at all possible the cluster.map and cluster.filter() would provide the translation and ship that to the cluster rather than having to save a file, transfer it and run it. I think python is so close to being able to do what pig does, but in a completely familiar way for python peeps.

  • Luca Pireddu

    Luca Pireddu - 2013-08-13

    That sort of tool would be really cool, but there would be a lot of work involved in implementing that sort of functionality. Maybe the easiest way to do it would be to generate Pig Latin with some Python UDFs created on-the-fly from the user script. In any case, at the moment I can't see us finding sufficient time to work on this kinda of high level feature, but we'd be happy to support you in such an endeavour ;-)




Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks