This document lists some of the work in HadoopDB project that we wish to share with open source community. The description here is intentionally not very detailed in order not to limit creativity. We invite contributors to discuss specific issues, ideas and design with HadoopDB team.
HadoopDB's major architectural contributions include providing a new data storage complementary to HDFS and creating a system that resembles a shared-nothing parallel database. An integral part of such a DBMS is a global catalog. The initial HadoopDB's catalog is rudimentary both in the scope of metainformation maintained and the implementation. Basically, the current catalog specifies for each dataset the locations of each database chunk and its replicas along with connection parameters. The catalog is stored as an XML file in HDFS and is accessed by JobTracker when scheduling a job.
We want to make the catalog more comprehensive and robust.
The additional information we see as worth storing the catalog include:
- table definitions such as attributes, datatypes, primary/foreign key(s)
- dataset / partitioning information such as collocated tables, partitioning key(s), clustering keys (sort order)
- table size estimates, number of rows, cardinality of each column
The above information will be very useful for future optimizations in job planning and execution.
The catalog ought to maintain its table definitions in sync with other components of HadoopDB including Hive and the single-node DBMS servers. In particular, HadoopDB users should only interact with the catalog and all relevant information shall be propagated elsewhere automatically.
In terms of implementation, we propose an HDFS-like infrastructure (DBNameNode on the master node and DBDataNodes on slaves).
Runtime information on each database chunk might also be helpful when deciding which replica should be chosen for a given task.
2) Data placement
Some other features present in HDFS that would be welcomed in HadoopDB include rack-aware data placement and restoring lost replicas.
3) Data Loader
As described in our Quick Start Guide, a data load requires several manual steps. We believe much of this process could be automated.
4) SMS Planner
HadoopDB's SQL interface relies on the Hive project with our extension called SMS Planner to push as much as possible query processing into single-node databases as SQL queries. See http://issues.apache.org/jira/browse/HIVE-721 for details of our initial proposal on the patch.
The SMS Planner in the current implementation does not support all the SQL syntax. The work here would involve recognizing more SQL functions and operators, aggregates, subqueries and joins in order to push them into database servers.
If you find any of these topics interesting, would like ask any questions and/or share some ideas, please contact us.
Hi, I am working on project where I want to convert a postgres database into a Hadoop GFS infrastructure. I am thinking to use HadoopDB for this purpose. my current application uses Hibernate to insert/update/select/delete records in the database. I have these questions below:
1- should we create the same schema of the database on all data nodes clusters
2- how the data gets dispatched over the data nodes databases, the data is duplicated?
3- What kind of change I may need to do on my current application to talk with the HadoopDB, since the server is the same (Postgre) and the schema of the database also remain the same
Thanks in advance
Your application seems to be a transactional system not analytical one. HadoopDB was designed for analytics so we do not support single inserts/updates/deletes, only batched loads. Please refer to the architecture described in our paper (www.hadoopdb.net). Hive offers a JDBC connector that would allow talking to HadoopDB from your application.
Yes my application is a transactional system, and I am expecting a huge volume of data in the future. I am exploring the possibility to use HadoopDB for this purpose. Since hive offers a JDBC connector to HadoopDB, I am wondering what can prevent from using hadoopDB in transactional systems.
Also, can you tell me how the data are dispatched over the hadoopDB databases on the data nodes? are they duplicated, or is there any logic to spread the data over the databases?
Thank you for your help
Sorry for a late response. HadoopDB wasn't designed for transactional workloads. With some tweaking it might be possible to adapt it to serve short requests better but then the whole Hadoop infrastructure seems irrelevant. The only thing that left is a shared database with a central node knowing where to direct a request. Btw. you may want to consider HBase or Cassandra if you need a highly scalable datastore featuring CRUD operations.
We used replication akin to Hadoop's. Please check out our paper for more details: http://db.cs.yale.edu/hadoopdb/hadoopdb.pdf
A small correction to the above post: I meant "sharded" not "shared".
We have currently one running project which uses RDBMS database( with lots of tables and stored procedures for manipulating data). The current flow is like : the data access layer will call stored procedures, which will insert/delete/update or fetch data from RDBMS(please note that these stored procedures are not doing any bulk proccesses.). I just want to know whether we can use HadoopDb for our purpose? then how can we use Hadoop db with existing RDBMS?
Your application is essentially doing transactional work. HadoopDB is poor fit for such use case. We excel in data warehousing style workloads i.e. batch loads, large scans and aggreagtions, append-only.
There are other Hadoop-based systems that might be helpful in your case e.g. HBase.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.