This is an offshoot project of open source data quality (osDQ) project https://sourceforge.net/projects/dataquality/

This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also.

Get json example at https://github.com/arrahtech/osdq-spark

How to run

Unzip the zip file

Windows : java -cp .\lib\*;osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c .\example\samplerun.json

Mac UNIX
java -cp ./lib/*:./osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c ./example/samplerun.json

For those on windows, you need to have hadoop distribtion unzipped on local drive and HADOOP_HOME set. Also copy winutils.exe from here into HADOOP_HOME\bin

Features

  • Create data pipeline like using Join, Filter, Aggregate, Case statement
  • Use Data Quality - replace, drop, join,
  • Data Profiling, Column base Profiling
  • Fuzzy Join - cosine distance and others
  • classification and sampling - random forest, Multi class neural network
  • data normalization - zscore, std deviation, ratio score,
  • Sampling Random, Stratified , Key based

Project Samples

Project Activity

See All Activity >

License

GNU General Public License version 3.0 (GPLv3)

Follow apache spark data pipeline osDQ

apache spark data pipeline osDQ Web Site

Other Useful Business Software
Get Avast Free Antivirus | Your top-rated shield against malware and online scams Icon
Get Avast Free Antivirus | Your top-rated shield against malware and online scams

Boost your PC's defense against cyberthreats and web-based scams.

Our antivirus software scans for security and performance issues and helps you to fix them instantly. It also protects you in real time by analyzing unknown files before they reach your desktop PC or laptop — all for free.
Free Download
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of apache spark data pipeline osDQ!

Additional Project Details

Intended Audience

Information Technology, Other Audience, Architects

User Interface

Console/Terminal

Programming Language

Java, Scala

Related Categories

Java Data Warehousing Software, Java Business Intelligence Software, Java ETL Tool, Java Data Pipeline Tool, Java Data Quality Tool, Scala Data Warehousing Software, Scala Business Intelligence Software, Scala ETL Tool, Scala Data Pipeline Tool, Scala Data Quality Tool

Registered

2016-06-17