We're excited to announce the launch of a new open-source product, data-diff that makes comparing datasets across databases fast at any scale. data-diff automates data quality checks for data replication and migration. In modern data platforms, data is constantly moving between systems, and at the modern data volume and complexity, systems go out of sync all the time. Until now, there has not been any tooling to ensure that when the data is correctly copied. Replicating data at scale, across hundreds of tables, with low latency and at a reasonable infrastructure cost is a hard problem, and most data teams we’ve talked to, have faced data quality issues in their replication processes. The hard truth is that the quality of the replication is the quality of the data. Since copying entire datasets in batch is often infeasible at the modern data scale, businesses rely on the Change Data Capture (CDC) approach of replicating data using a continuous stream of updates.

Features

  • Find mismatches across databases
  • Outputs diff of rows in detail
  • Simple CLI/API to create monitoring and alerts
  • Verify 25M+ rows in <10s, and 1B+ rows in ~5min
  • Verifies across many different databases
  • Works for tables with 10s of billions of rows

Project Samples

Project Activity

See All Activity >

License

MIT License

Follow data-diff

data-diff Web Site

Other Useful Business Software
Gen AI apps are built with MongoDB Atlas Icon
Gen AI apps are built with MongoDB Atlas

Build gen AI apps with an all-in-one modern database: MongoDB Atlas

MongoDB Atlas provides built-in vector search and a flexible document model so developers can build, scale, and run gen AI apps without stitching together multiple databases. From LLM integration to semantic search, Atlas simplifies your AI architecture—and it’s free to get started.
Start Free
Rate This Project
Login To Rate This Project

User Reviews

Be the first to post a review of data-diff!

Additional Project Details

Programming Language

Python

Related Categories

Python Database Software, Python Data Quality Tool

Registered

2022-07-23