HDFSFDW GitCode

Brought to you by: simonzhangsm

Tree [8827c0] master /

History

HTTPS access

File	Date	Author	Commit
doc	2014-08-29	simonzhangsm	[dbe35a] Initial commit
hbase	2014-08-29	simonzhangsm	[dbe35a] Initial commit
hive	2014-08-29	simonzhangsm	[dbe35a] Initial commit
sample	2014-08-29	simonzhangsm	[dbe35a] Initial commit
sql	2014-08-29	simonzhangsm	[dbe35a] Initial commit
src	2014-08-29	simonzhangsm	[dbe35a] Initial commit
test	2014-08-29	simonzhangsm	[dbe35a] Initial commit
.project	2014-08-29	simonzhangsm	[dbe35a] Initial commit
Makefile	2014-08-29	simonzhangsm	[dbe35a] Initial commit
README.md	2014-08-29	simonzhangsm	[8827c0] Initial commit
hadoop_fdw.control	2014-08-29	simonzhangsm	[dbe35a] Initial commit

Read Me

=============================================

Hadoop Foreign Data Wrapper for PostgreSQL

=============================================

Instruction for Building from Source

0.) Our source files are liberally licensed under the Apache License w/ portions
Copyright to the PostgreSQL Global Development Group.

1.) Download and build THRIFT Components

Hadoop FDW requires thrift v0.9.1 (http://thrift.apache.org/)

Make sure you have reviewed and run the THRIFT pre-requisites to compile
http://thrift.apache.org/docs/install/

Unzip the downloaded thrift tarball in your home directory and set
THRIFT_HOME environment variable

   $ cd $HOME
   $ tar -xvf thrift-0.9.1.tar.gz
   $ export THRIFT_HOME=$HOME/thrift-0.9.1
   $ cd $THRIFT_HOME


Set compile flag needed for building thrift and for fb303
   $ export CXXFLAGS="-fPIC"


Configure thrift for compiling on your platform
   $ ./configure --prefix=$THRIFT_HOME --without-csharp --without-java --without-erlang --without-python --without-perl --without-php --without-ruby --without-haskell --without-go


build THRIFT 
   $ make
   $ make install

build FB303
   $ cd contrib/fb303
   $ ./bootstrap.sh
   $ export CPPFLAGS="-I$THRIFT_HOME/include -fPIC"
   $ ./configure --prefix=$THRIFT_HOME --with-thriftpath=$THRIFT_HOME
   $ make
   $ sudo make install

2.) Download and build PostgreSQL 9.3.5 from source at
http://www.postgresql.org/ftp/source/

Unzip the downloaded postgresql tarball in your home directory

   $ cd $HOME
   $ tar -xvf postgresql-9.3.5.tar.gz
   $ cd ~/postgresql-9.3.5
   $ ./configure --prefix=$PWD --enable-depend --enable-cassert --enable-debug
   $ make
   $ make install

Put the bin directory of this postgresql build at the front of your path and
test if correct

   $ export PATH=$PWD/bin:$PATH
   $ psql --version
        psql (PostgreSQL) 9.3.5

3.) Get the latest version of HadoopFDW source into a directory named hadoop_fdw
under the contrib directory of postgresql. You will need the THRIFT_HOME,
CPP_FLAGS and CXXFLAGS environment variables set as per the previous steps.
Note that you'll need your own "userid" for pulling from bitbucket.org.

   $ cd contrib
   $ git clone https://userid@bitbucket.org/openscg/hadoop_fdw.git

   $ make clean-hive-client
   $ make hive_client

   $ make clean-hbase-client
   $ make hbase_client

   $ make
   $ make install

   $ cd $THRIFT_HOME/lib
   $ mv libthrift-0.9.1.so ~/postgresql-9.3.4/lib/

4.) Get a copy of Hadoop, Hive & HBase all easily running together

   $ tar -xvf bigsql-5.1-linux64.tar.bz2
   $ cd bigsql-5.1
   $ ./bigsql start

 You can now create the "example.customer_history" table in Hive

   $  . ./setENV.sh
   $  cd examples/
   $  ./createDelimFiles.sh
   $  ./createHdfsDirectories.sh
   $  hive -v -f createHiveTables.sql
   $  hive -v -f loadHiveTables.sql

5.) You are now ready to use this PostgreSQL Foreign Data Wrapper to connect to
Hadoop. First, initialize and start postgresql on port 5433 so it won't
conflict with BigSQL's metastore PG instance that runs on port 5432.

  $ cd ../../bin
  $ ./initdb -D ../data
  $ ./pg_ctl start -D ../data -o "-p 5433" -l logfile


Next run psql & setup the FDW

  $ ./psql -U user -p 5433 postgres

    postgres=# CREATE EXTENSION hadoop_fdw;

    postgres=# 
      CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw 
        OPTIONS (address '127.0.0.1', port '10000');

    postgres=# CREATE USER MAPPING FOR PUBLIC SERVER hadoop_server;

Then create & query the foreign table

    postgres=# 
      CREATE FOREIGN TABLE customer_history (
        hist_id  INT,  
        h_c_id   INT,
        h_c_d_id INT,
        h_c_w_id INT,
        h_d_id   INT,
        h_w_id   INT,
        h_date   TIMESTAMP,
        h_amount DECIMAL,
        h_data   TEXT )
      SERVER hadoop_server
      OPTIONS (table 'example.customer_history');

    postgres=# SELECT hist_id, h_date, h_amount FROM customer_history LIMIT 3;

       hist_id |         h_date          | h_amount 
      ---------+-------------------------+----------
             1 | 2013-08-13 03:56:56.522 |       10
             2 | 2013-08-13 03:56:56.534 |       10
             3 | 2013-08-13 03:56:56.534 |       10
      (3 rows)

6.) Hadoop_fdw is by default configured to push WHERE clauses down to Hive only
if the number of rows in the relation is greater than or equal to the GUC
hadoop_fdw.hadoop_rowscanthreshold ( set to 10000 by default). You can set
hadoop_fdw.hadoop_rowscanthreshold to your desired value.

    postgres=> set hadoop_fdw.hadoop_rowscanthreshold to 100;
    SET

You have to run ANALYZE first on the table in order to collect table 
statistics including the number of rows in the table.

    postgres=> analyze customer_history;
    ANALYZE

Note:  Currently, Hive does not report back valid  statistics for the number
of rows.  It  turns out to  be zero always.  The workaround  is to  set  GUC 
hadoop_fdw.hadoop_rowscanthreshold  to 0 in  order to instruct hadoop_fdw to
push down WHERE clauses down to Hive.

7.) Hadoop_fdw supports use of funtions in WHERE clauses. When pushing functions
down to Hive, hadoop_fdw makes sure it is remote shippable by performing
mapping. Some of the functions have completely different names or calling
conventions for Hive, therefore extra translations are carried out for them
within hadoop_fdw before they are pushed down to Hive.

Here is a detailed listing of the functions supported by hadoop_fdw:-

Aggregate Functions:
  avg
  corr
  count
  covar_pop
  covar_samp
  max
  min
  stddev_pop
  stddev_samp
  sum
  var_pop
  var_samp
  variance

Array Functions:
  unnest

Date/Time Functions:
  date_mii
  date_part
  date_pli
  to_timestamp

Mathematical Functions:
  abs
  ceil
  ceiling   
  degrees
  exp
  floor
  ln
  log
  mod
  pi
  power
  radians
  random
  round
  sign
  sqrt

String Functions:
  ascii
  concat
  concat_ws
  length
  lower
  lpad
  ltrim
  repeat
  reverse
  rpad
  rtrim
  strpos
  substr
  to_hex
  translate
  trim
  upper

Trigonometric Functions:
  acos
  asin
  atan
  cos
  sin
  tan

HDFSFDW GitCode

Branches

Tree [8827c0] master / Download Snapshot History

Read Me

=============================================

Hadoop Foreign Data Wrapper for PostgreSQL

=============================================

Instruction for Building from Source

Tree [8827c0] master /

History