This is an offshoot project of open source data quality (osDQ) project https://sourceforge.net/projects/dataquality/

This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also.

Get json example at https://github.com/arrahtech/osdq-spark

How to run

Unzip the zip file

Windows : java -cp .\lib\*;osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c .\example\samplerun.json

Mac UNIX
java -cp ./lib/*:./osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c ./example/samplerun.json

For those on windows, you need to have hadoop distribtion unzipped on local drive and HADOOP_HOME set. Also copy winutils.exe from here into HADOOP_HOME\bin

Features

  • Create data pipeline like using Join, Filter, Aggregate, Case statement
  • Use Data Quality - replace, drop, join,
  • Data Profiling, Column base Profiling
  • Fuzzy Join - cosine distance and others
  • classification and sampling - random forest, Multi class neural network
  • data normalization - zscore, std deviation, ratio score,
  • Sampling Random, Stratified , Key based

Project Samples

Project Activity

See All Activity >

License

GNU General Public License version 3.0 (GPLv3)

Follow apache spark data pipeline osDQ

apache spark data pipeline osDQ Web Site

nel_h2
Secure User Management, Made Simple | Frontegg Icon
Secure User Management, Made Simple | Frontegg

Get 7,500 MAUs, 50 tenants, and 5 SSOs free – integrated into your app with just a few lines of code.

Frontegg powers modern businesses with a user management platform that’s fast to deploy and built to scale. Embed SSO, multi-tenancy, and a customer-facing admin portal using robust SDKs and APIs – no complex setup required. Designed for the Product-Led Growth era, it simplifies setup, secures your users, and frees your team to innovate. From startups to enterprises, Frontegg delivers enterprise-grade tools at zero cost to start. Kick off today.
Start for Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of apache spark data pipeline osDQ!

Additional Project Details

Intended Audience

Architects, Information Technology, Other Audience

User Interface

Console/Terminal

Programming Language

Java, Scala

Related Categories

Java Data Warehousing Software, Java Business Intelligence Software, Java ETL Tool, Java Data Pipeline Tool, Java Data Quality Tool, Scala Data Warehousing Software, Scala Business Intelligence Software, Scala ETL Tool, Scala Data Pipeline Tool, Scala Data Quality Tool

Registered

2016-06-17