CloudBase / Feature Requests / #15 Specyfying data files from select statement

Specyfying data files from select statement

Status: Beta

Brought to you by: ldagum, pdagum, tsingh, yanboru

#15 Specyfying data files from select statement

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2009-09-17

Created: 2009-09-17

Creator: YoungWoo Kim

Private: No

As a user, I want to specify some data files when select data from table. like below, I want to select data from specified files in directory not entire table(data directory):

I have a table 'test_table9' and the table has data files in data directroy (e.g. /user/hadoop/cloudbase/data/TEST_TABLE9/)
Files in that directory:
data1.txt
data2.txt
data3.txt

SELECT * FROM test_table9 DATAFILE ('data1.txt', 'data2.txt') WHERE col1='...';

Discussion

Tarandeep Singh - 2009-09-17

In the next release (2.0), I have added support for partitioning in CloudBase. When you create a table, you can specify partition and sub partition clause-

CREATE TABLE tablename
( c1 int, c2 datetime, c3 varchar ...)
PARTITION BY RANGE | HASH | LIST
SUBPARTITION BY RANGE | HASH | LIST

The syntax is similar to MySql.

Although, what you have asked is different than partitioning, but was wondering, if the partitioning scheme would help?

Further, the data won't be stored in the flat files (as is the current scheme), the data will be stored in binary form and compressed. I have done tests and have got lot of speed up. And due to partitioning / subpartitioning the data won't be in the same files as you specified during data load. So it will be hard to find the right files that you want to query on.

The syntax for data load will be -

LOAD DATA inpath [ HDFS | LOCAL ] 'path/to/data/dir' INTO tablename
[ LINES separated BY 'sep']
[ FIELDS separated BY 'sep' OPTIONALLY ENCLOSED BY 'char' ESCAPE BY 'char' ]

Again the syntax is similar to MySql.

The HDFS option would mean data is already in HDFS, so no need to copy to HDFS. The LOCAL option would mean data is on the host from where client has made the connection. The absence of HDFS and LOCAL would mean the data is present on the machine where CloudBase is running.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

YoungWoo Kim - 2009-09-18

Sounds cool!
Yes, I mean It was a sort of partition pruning.

New partitioning feature will support truncate individual partition or subpartition, not entire table? If yes, It would be very useful.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tarandeep Singh - 2009-09-18

You can use partition/subpartition names where ever you can use tables. For example - select * from table.partition_name or select * from table.partition_name.subpartition_name

Similarly, Truncate table.partition_name and Truncate table.partition_name.subpartition_name

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Specyfying data files from select statement

Group

Searches

Help

#15 Specyfying data files from select statement

Discussion