File | Date | Author | Commit |
---|---|---|---|
doc | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
hbase | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
hive | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
sample | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
sql | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
src | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
test | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
.project | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
Makefile | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
README.md | 2014-08-29 |
![]() |
[8827c0] Initial commit |
hadoop_fdw.control | 2014-08-29 |
![]() |
[dbe35a] Initial commit |
0.) Our source files are liberally licensed under the Apache License w/ portions
Copyright to the PostgreSQL Global Development Group.
1.) Download and build THRIFT Components
Hadoop FDW requires thrift v0.9.1 (http://thrift.apache.org/)
Make sure you have reviewed and run the THRIFT pre-requisites to compile
http://thrift.apache.org/docs/install/
Unzip the downloaded thrift tarball in your home directory and set
THRIFT_HOME environment variable
$ cd $HOME
$ tar -xvf thrift-0.9.1.tar.gz
$ export THRIFT_HOME=$HOME/thrift-0.9.1
$ cd $THRIFT_HOME
Set compile flag needed for building thrift and for fb303
$ export CXXFLAGS="-fPIC"
Configure thrift for compiling on your platform
$ ./configure --prefix=$THRIFT_HOME --without-csharp --without-java --without-erlang --without-python --without-perl --without-php --without-ruby --without-haskell --without-go
build THRIFT
$ make
$ make install
build FB303
$ cd contrib/fb303
$ ./bootstrap.sh
$ export CPPFLAGS="-I$THRIFT_HOME/include -fPIC"
$ ./configure --prefix=$THRIFT_HOME --with-thriftpath=$THRIFT_HOME
$ make
$ sudo make install
2.) Download and build PostgreSQL 9.3.5 from source at
http://www.postgresql.org/ftp/source/
Unzip the downloaded postgresql tarball in your home directory
$ cd $HOME
$ tar -xvf postgresql-9.3.5.tar.gz
$ cd ~/postgresql-9.3.5
$ ./configure --prefix=$PWD --enable-depend --enable-cassert --enable-debug
$ make
$ make install
Put the bin directory of this postgresql build at the front of your path and
test if correct
$ export PATH=$PWD/bin:$PATH
$ psql --version
psql (PostgreSQL) 9.3.5
3.) Get the latest version of HadoopFDW source into a directory named hadoop_fdw
under the contrib directory of postgresql. You will need the THRIFT_HOME,
CPP_FLAGS and CXXFLAGS environment variables set as per the previous steps.
Note that you'll need your own "userid" for pulling from bitbucket.org.
$ cd contrib
$ git clone https://userid@bitbucket.org/openscg/hadoop_fdw.git
$ make clean-hive-client
$ make hive_client
$ make clean-hbase-client
$ make hbase_client
$ make
$ make install
$ cd $THRIFT_HOME/lib
$ mv libthrift-0.9.1.so ~/postgresql-9.3.4/lib/
4.) Get a copy of Hadoop, Hive & HBase all easily running together
$ tar -xvf bigsql-5.1-linux64.tar.bz2
$ cd bigsql-5.1
$ ./bigsql start
You can now create the "example.customer_history" table in Hive
$ . ./setENV.sh
$ cd examples/
$ ./createDelimFiles.sh
$ ./createHdfsDirectories.sh
$ hive -v -f createHiveTables.sql
$ hive -v -f loadHiveTables.sql
5.) You are now ready to use this PostgreSQL Foreign Data Wrapper to connect to
Hadoop. First, initialize and start postgresql on port 5433 so it won't
conflict with BigSQL's metastore PG instance that runs on port 5432.
$ cd ../../bin
$ ./initdb -D ../data
$ ./pg_ctl start -D ../data -o "-p 5433" -l logfile
Next run psql & setup the FDW
$ ./psql -U user -p 5433 postgres
postgres=# CREATE EXTENSION hadoop_fdw;
postgres=#
CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hadoop_fdw
OPTIONS (address '127.0.0.1', port '10000');
postgres=# CREATE USER MAPPING FOR PUBLIC SERVER hadoop_server;
Then create & query the foreign table
postgres=#
CREATE FOREIGN TABLE customer_history (
hist_id INT,
h_c_id INT,
h_c_d_id INT,
h_c_w_id INT,
h_d_id INT,
h_w_id INT,
h_date TIMESTAMP,
h_amount DECIMAL,
h_data TEXT )
SERVER hadoop_server
OPTIONS (table 'example.customer_history');
postgres=# SELECT hist_id, h_date, h_amount FROM customer_history LIMIT 3;
hist_id | h_date | h_amount
---------+-------------------------+----------
1 | 2013-08-13 03:56:56.522 | 10
2 | 2013-08-13 03:56:56.534 | 10
3 | 2013-08-13 03:56:56.534 | 10
(3 rows)
6.) Hadoop_fdw is by default configured to push WHERE clauses down to Hive only
if the number of rows in the relation is greater than or equal to the GUC
hadoop_fdw.hadoop_rowscanthreshold ( set to 10000 by default). You can set
hadoop_fdw.hadoop_rowscanthreshold to your desired value.
postgres=> set hadoop_fdw.hadoop_rowscanthreshold to 100;
SET
You have to run ANALYZE first on the table in order to collect table
statistics including the number of rows in the table.
postgres=> analyze customer_history;
ANALYZE
Note: Currently, Hive does not report back valid statistics for the number
of rows. It turns out to be zero always. The workaround is to set GUC
hadoop_fdw.hadoop_rowscanthreshold to 0 in order to instruct hadoop_fdw to
push down WHERE clauses down to Hive.
7.) Hadoop_fdw supports use of funtions in WHERE clauses. When pushing functions
down to Hive, hadoop_fdw makes sure it is remote shippable by performing
mapping. Some of the functions have completely different names or calling
conventions for Hive, therefore extra translations are carried out for them
within hadoop_fdw before they are pushed down to Hive.
Here is a detailed listing of the functions supported by hadoop_fdw:-
Aggregate Functions:
avg
corr
count
covar_pop
covar_samp
max
min
stddev_pop
stddev_samp
sum
var_pop
var_samp
variance
Array Functions:
unnest
Date/Time Functions:
date_mii
date_part
date_pli
to_timestamp
Mathematical Functions:
abs
ceil
ceiling
degrees
exp
floor
ln
log
mod
pi
power
radians
random
round
sign
sqrt
String Functions:
ascii
concat
concat_ws
length
lower
lpad
ltrim
repeat
reverse
rpad
rtrim
strpos
substr
to_hex
translate
trim
upper
Trigonometric Functions:
acos
asin
atan
cos
sin
tan