Cuelake Code

Use SQL to build ELT pipelines on a data lakehouse

Brought to you by: vikrantd

Tree [599eca] docs / History

HTTPS access

File	Date	Author	Commit
.github	2021-06-14	Vikrant Dubey	[f0255b] Update pr_checks.yml
api	2021-06-15	Vikrant Dubey	[c2721a] Refactoring notebook jobs services
docs	2021-06-15	Vikrant Dubey	[599eca] Update index.md
ui	2021-06-11	PraveenCuebook	[05c820] scheduleTestCases removed dead code from schedules
zeppelinConf	2021-06-08	Prabhat Dubey	[3a77ed] Update interpreter.json
.gitallowed	2021-05-26	vincue	[b241a6] Adding pushed key to gitallowed
.gitignore	2021-05-19	vincue	[cb0da0] Gitignored package-lock.json
CODE_OF_CONDUCT.md	2021-04-13	Sachin Bansal	[f289ed] Create CODE_OF_CONDUCT.md
Dockerfile	2021-05-06	Vikrant	[86608f] Fixing build issues
LICENSE	2021-04-12	vikrantcue	[f1c617] Create LICENSE
README.md	2021-06-15	Vikrant Dubey	[66245d] Update README.md
cuelake.yaml	2021-06-09	Prabhat Dubey	[510f0a] Adding pods access in default role
nginx.conf	2021-06-03	Vikrant Dubey	[790fb8] Spark UI fixes

Read Me

With CueLake, you can use SQL to build ELT (Extract, Load, Transform) pipelines on a data lakehouse.

You write Spark SQL statements in Zeppelin notebooks. You then schedule these notebooks using workflows (DAGs).

To extract and load incremental data, you write simple select statements. CueLake executes these statements against your databases and then merges incremental data into your data lakehouse (powered by Apache Iceberg).

To transform data, you write SQL statements to create views and tables in your data lakehouse.

CueLake uses Celery as the executor and celery-beat as the scheduler. Celery jobs trigger Zeppelin notebooks. Zeppelin auto-starts and stops the Spark cluster for every scheduled run of notebooks.

To know why we are building CueLake, read our viewpoint.

CueLake

Getting started

CueLake uses Kubernetes kubectl for installation. Create a namespace and then install using the cuelake.yaml file. Creating a namespace is optional. You can install in the default namespace or in any existing namespace.

In the commands below, we use cuelake as the namespace.

kubectl create namespace cuelake
kubectl apply -f https://raw.githubusercontent.com/cuebook/cuelake/main/cuelake.yaml -n cuelake
kubectl port-forward services/lakehouse 8080:80 -n cuelake

Now visit http://localhost:8080 in your browser.

If you don’t want to use Kubernetes and instead want to try it out on your local machine first, we’ll soon have a docker-compose version. Let us know if you’d want that sooner.

Features

Upsert Incremental data. CueLake uses Iceberg’s merge into query to automatically merge incremental data.
Create Views in data lakehouse. CueLake enables you to create views over Iceberg tables.
Create DAGs. Group notebooks into workflows and create DAGs of these workflows.
Elastically Scale Cloud Infrastructure. CueLake uses Zeppelin to auto create and delete Kubernetes resources required to run data pipelines.
In-built Scheduler to schedule your pipelines.
Automated maintenance of Iceberg tables. CueLake does automated maintenance of Iceberg tables - expires snapshots, removes old metadata and orphan files, compacts data files.
Monitoring. Get Slack alerts when a pipeline fails. CueLake maintains detailed logs.
Versioning in Github. Commit and maintain versions of your Zeppelin notebooks in Github.
Data Security. Your data always stays within your cloud account.

Current Limitations

Supports AWS S3 as a destination. Support for ADLS and GCS is in the roadmap.
Uses Apache Iceberg as an open table format. Delta support is in the roadmap.
Uses Celery for scheduling jobs. Support for Airflow is in the roadmap.

Support

For general help using CueLake, read the documentation, or go to Github Discussions.

To report a bug or request a feature, open an issue.

Community

Join our cuelake discord server and ask your questions to the developers directly.

Contributing

We'd love contributions to CueLake. Before you contribute, please first discuss the change you wish to make via an issue or a discussion. Contributors are expected to adhere to our code of conduct.

Cuelake Code

Use SQL to build ELT pipelines on a data lakehouse

Branches

Tags

Tree [599eca] docs /

History

Read Me

Getting started

Features

Current Limitations

Support

Community

Contributing

Cuelake Code

Use SQL to build ELT pipelines on a data lakehouse

Branches

Tags

Tree [599eca] docs / Download Snapshot History

Read Me

Getting started

Features

Current Limitations

Support

Community

Contributing

Tree [599eca] docs /

History