Data Quality Tools for Linux

View 20 business solutions

Browse free open source Data Quality tools and projects for Linux below. Use the toggles on the left to filter open source Data Quality tools by OS, license, language, programming language, and project status.

  • Gen AI apps are built with MongoDB Atlas Icon
    Gen AI apps are built with MongoDB Atlas

    Build gen AI apps with an all-in-one modern database: MongoDB Atlas

    MongoDB Atlas provides built-in vector search and a flexible document model so developers can build, scale, and run gen AI apps without stitching together multiple databases. From LLM integration to semantic search, Atlas simplifies your AI architecture—and it’s free to get started.
    Start Free
  • Keep company data safe with Chrome Enterprise Icon
    Keep company data safe with Chrome Enterprise

    Protect your business with AI policies and data loss prevention in the browser

    Make AI work your way with Chrome Enterprise. Block unapproved sites and set custom data controls that align with your company's policies.
    Download Chrome
  • 1
    iTop - IT Service Management & CMDB

    iTop - IT Service Management & CMDB

    An easy, extensible web based IT service management platform

    Whether you’re an infrastructure manager handling complex systems, a service support leader striving for customer satisfaction, or a decision-maker focused on ROI and compliance, iTop adapts to your processes to simplify your tasks, streamline operations, and enhance service quality. iTop (IT Operations Portal) by Combodo is an all-in-one, open-source ITSM platform designed to streamline IT operations. iTop offers a highly customizable, low-code Configuration Management Database (CMDB), along with advanced tools for handling requests, incidents, problems, changes, and service management. iTop is ITIL-compliant, making it ideal for organizations looking for standardized and scalable IT processes. Trusted by organizations worldwide, iTop provides a flexible, extensible solution. The platform’s source code is openly available on GitHub [https://github.com/Combodo/iTop].
    Leader badge
    Downloads: 1,101 This Week
    Last Update:
    See Project
  • 2
    TTA Lossless Audio Codec
    Lossless compressor for multichannel 8,16 and 24 bits audio data, with the ability of password data protection. Being 'lossless' means that no data/quality is lost in the compression - when uncompressed, the data will be identical to the original.
    Leader badge
    Downloads: 150 This Week
    Last Update:
    See Project
  • 3
    CSV Lint

    CSV Lint

    CSV Lint plug-in for Notepad++ for syntax highlighting

    CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files. Use CSV Lint for metadata discovery, technical data validation, and reformatting on tabular data files. It is not meant to be a replacement for spreadsheet programs like Excel or SPSS, but rather it's a quality control tool to examine, verify or polish up a dataset before further processing.
    Downloads: 21 This Week
    Last Update:
    See Project
  • 4
    Arize Phoenix

    Arize Phoenix

    Uncover insights, surface problems, monitor, and fine tune your LLM

    Phoenix provides ML insights at lightning speed with zero-config observability for model drift, performance, and data quality. Phoenix is an Open Source ML Observability library designed for the Notebook. The toolset is designed to ingest model inference data for LLMs, CV, NLP and tabular datasets. It allows Data Scientists to quickly visualize their model data, monitor performance, track down issues & insights, and easily export to improve. Deep Learning Models (CV, LLM, and Generative) are an amazing technology that will power many of future ML use cases. A large set of these technologies are being deployed into businesses (the real world) in what we consider a production setting.
    Downloads: 13 This Week
    Last Update:
    See Project
  • Inventors: Validate Your Idea, Protect It and Gain Market Advantages Icon
    Inventors: Validate Your Idea, Protect It and Gain Market Advantages

    SenseIP is ideal for individual inventors, startups, and businesses

    senseIP is an AI innovation platform for inventors, automating any aspect of IP from the moment you have an idea. You can have it researched for uniqueness and protected; quickly and effortlessly, without expensive attorneys. Built for business success while securing your competitive edge.
    Learn More
  • 5
    Diffgram

    Diffgram

    Training data (data labeling, annotation, workflow) for all data types

    From ingesting data to exploring it, annotating it, and managing workflows. Diffgram is a single application that will improve your data labeling and bring all aspects of training data under a single roof. Diffgram is world’s first truly open source training data platform that focuses on giving its users an unlimited experience. This is aimed to reduce your data labeling bills and increase your Training Data Quality. Training Data is the art of supervising machines through data. This includes the activities of annotation, which produces structured data; ready to be consumed by a machine learning model. Annotation is required because raw media is considered to be unstructured and not usable without it. That’s why training data is required for many modern machine learning use cases including computer vision, natural language processing and speech recognition.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    Dagster

    Dagster

    An orchestration platform for the development, production

    Dagster is an orchestration platform for the development, production, and observation of data assets. Dagster as a productivity platform: With Dagster, you can focus on running tasks, or you can identify the key assets you need to create using a declarative approach. Embrace CI/CD best practices from the get-go: build reusable components, spot data quality issues, and flag bugs early. Dagster as a robust orchestration engine: Put your pipelines into production with a robust multi-tenant, multi-tool engine that scales technically and organizationally. Dagster as a unified control plane: The ‘single plane of glass’ data teams love to use. Rein in the chaos and maintain control over your data as the complexity scales. Centralize your metadata in one tool with built-in observability, diagnostics, cataloging, and lineage. Spot any issues and identify performance improvement opportunities.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    DataCleaner

    DataCleaner

    Data quality analysis, profiling, cleansing, duplicate detection +more

    DataCleaner is a data quality analysis application and a solution platform for DQ solutions. It's core is a strong data profiling engine, which is extensible and thereby adds data cleansing, transformations, enrichment, deduplication, matching and merging. Website: http://datacleaner.github.io
    Leader badge
    Downloads: 9 This Week
    Last Update:
    See Project
  • 8
    Open Source Data Quality and Profiling

    Open Source Data Quality and Profiling

    World's first open source data quality & data preparation project

    This project is dedicated to open source data quality and data preparation solutions. Data Quality includes profiling, filtering, governance, similarity check, data enrichment alteration, real time alerting, basket analysis, bubble chart Warehouse validation, single customer view etc. defined by Strategy. This tool is developing high performance integrated data management platform which will seamlessly do Data Integration, Data Profiling, Data Quality, Data Preparation, Dummy Data Creation, Meta Data Discovery, Anomaly Discovery, Data Cleansing, Reporting and Analytic. It also had Hadoop ( Big data ) support to move files to/from Hadoop Grid, Create, Load and Profile Hive Tables. This project is also known as "Aggregate Profiler" Resful API for this project is getting built as (Beta Version) https://sourceforge.net/projects/restful-api-for-osdq/ apache spark based data quality is getting built at https://sourceforge.net/projects/apache-spark-osdq/
    Downloads: 6 This Week
    Last Update:
    See Project
  • 9
    Cleanlab

    Cleanlab

    The standard data-centric AI package for data quality and ML

    cleanlab helps you clean data and labels by automatically detecting issues in a ML dataset. To facilitate machine learning with messy, real-world data, this data-centric AI package uses your existing models to estimate dataset problems that can be fixed to train even better models. cleanlab cleans your data's labels via state-of-the-art confident learning algorithms, published in this paper and blog. See some of the datasets cleaned with cleanlab at labelerrors.com. This package helps you find label issues and other data issues, so you can train reliable ML models. All features of cleanlab work with any dataset and any model. Yes, any model: PyTorch, Tensorflow, Keras, JAX, HuggingFace, OpenAI, XGBoost, scikit-learn, etc. If you use a sklearn-compatible classifier, all cleanlab methods work out-of-the-box.
    Downloads: 1 This Week
    Last Update:
    See Project
  • G-P - Global EOR Solution Icon
    G-P - Global EOR Solution

    Companies searching for an Employer of Record solution to mitigate risk and manage compliance, taxes, benefits, and payroll anywhere in the world

    With G-P's industry-leading Employer of Record (EOR) and Contractor solutions, you can hire, onboard and manage teams in 180+ countries — quickly and compliantly — without setting up entities.
    Learn More
  • 10
    Pandas Profiling

    Pandas Profiling

    Create HTML profiling reports from pandas DataFrame objects

    pandas-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy yet a little basic for exploratory data analysis. pandas-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardized univariate and multivariate report for data understanding. High correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik). Most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic). File sizes, creation dates, dimensions, indication of truncated images and existance of EXIF metadata. Mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint). Comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others).
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    lakeFS

    lakeFS

    lakeFS - Git-like capabilities for your object storage

    Increase data quality and reduce the painful cost of errors. Data engineering best practices using git-like operations on data. lakeFS is an open-source data version control for data lakes. It enables zero-copy Dev / Test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more. Data is dynamic, it changes over time. Dealing with that without a data version control system is error-prone and labor-intensive. With lakeFS, your data lake is version controlled and you can easily time-travel between consistent snapshots of the lake. Easier ETL testing - test your ETLs on top of production data, in isolation, without copying anything. Safely experiment and test on full production data. Easily Collaborate on production data with your team. Automate data quality checks within data pipelines.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    CloverDX

    CloverDX

    Design, automate, operate and publish data pipelines at scale

    Please, visit www.cloverdx.com for latest product versions. Data integration platform; can be used to transform/map/manipulate data in batch and near-realtime modes. Suppors various input/output formats (CSV,FIXLEN,Excel,XML,JSON,Parquet, Avro,EDI/X12,HL7,COBOL,LOTUS, etc.). Connects to RDBMS/JMS/Kafka/SOAP/Rest/LDAP/S3/HTTP/FTP/ZIP/TAR. CloverDX offers 100+ specialized components which can be further extended by creation of "macros" - subgraphs - and libraries, shareable with 3rd parties. Simple data manipulation jobs can be created visually. More complex business logic can be implemented using Clover's domain-specific-language CTL, in Java or languages like Python or JavaScript. Through its DataServices functionality, it allows to quickly turn data pipelines into REST API endpoints. The platform allows to easily scale your data job across multiple cores or nodes/machines. Supports Docker/Kubernetes deployments and offers AWS/Azure images in their respective marketplace
    Downloads: 3 This Week
    Last Update:
    See Project
  • 13
    MentDB Projects

    MentDB Projects

    Generalized Interoperability and Strong AI

    MentDB is an open-source platform driving research into next-generation AI and universal data exchange. Our architecture is built around the revolutionary Mentalese Query Language (MQL). MentDB Weak (Generalized Interoperability): A unified data layer enabling seamless data exchange and application integration (SOA, ETL, Data Quality). We eliminate data silos through a single, generalized data language. MentDB Strong (Strong AI / AGI): The framework for exploring and building Machine Consciousness, free will, and advanced ethical reasoning systems. Based on new-generation AI algorithms.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 14

    BMDExpress Data Viewer

    A Visualization Tool to Analyze BMDExpress Datasets

    Regulatory agencies increasingly apply benchmark dose (BMD) modeling to determine points of departure for risk assessment. BMDExpress applies BMD modeling to transcriptomic datasets to identify transcriptional BMDs. However, graphing and analytical capabilities within BMDExpress are limited, and the analysis of output files is challenging. We developed a web-based application, BMDExpress Data Viewer for visualizing and graphing BMDExpress output files. BMDExpress Data Viewer is a useful tool to visualize, explore and analyze BMDExpress output files. Visualizing the data in this manner enables rapid assessment of data quality, model fit, doses of peak activity, most sensitive pathway perturbations and other metrics that will be useful in applying toxicogenomics in risk assessment. Tool Link: http://apps.sciome.com:8082/BMDX_Viewer/ Publication Link: http://onlinelibrary.wiley.com/doi/10.1002/jat.3265/abstract
    Downloads: 1 This Week
    Last Update:
    See Project
  • 15
    EPRI Open PQ Dashboard

    EPRI Open PQ Dashboard

    Demos new techniques for extracting information from PQ data files

    Open PQ Dashboard version 1.0 provides visual displays to quickly convey the status and location of power quality (PQ) anomalies throughout the electrical power system. Summary displays starts with the choice of a geospatial map-view or annunciator panel, both with unique visualizations for across-the-room visualizations fit for a PQ operations center. Drill-downs are in place for various statistics and guide users all the way down to the waveform level. This version consist of a few proof-of-concept applications of applying event severity and trend values to heatmap displays—giving the PQ engineers a wide-area status of PQ for quick interpretation. Data quality has been added so users can quickly see when meters are providing incomplete or invalid data. This dashboard currently accepts power quality data from COMTRADE and PQDIF standard file formats. Other proprietary software interfaces have been added. See the installation manual for more details.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    Apache Airflow Provider

    Apache Airflow Provider

    Great Expectations Airflow operator

    Due to apply_default decorator removal, this version of the provider requires Airflow 2.1.0+. If your Airflow version is 2.1.0, and you want to install this provider version, first upgrade Airflow to at least version 2.1.0. Otherwise, your Airflow package version will be upgraded automatically, and you will have to manually run airflow upgrade db to complete the migration. This operator currently works with the Great Expectations V3 Batch Request API only. If you would like to use the operator in conjunction with the V2 Batch Kwargs API, you must use a version below 0.1.0. This operator uses Great Expectations Checkpoints instead of the former ValidationOperators. Because of the above, this operator requires Great Expectations >=v0.13.9, which is pinned in the requirements.txt starting with release 0.0.5.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17

    Arthropod Easy Capture

    An arthropod specific, specimen level data capture application

    Arthropod Easy Capture (AEC) is an arthropod specific, open-source solution for handling specimen level host interactions. Developed in conjunction with the Plant Bugs Planetary Biodiversity, Tri-Tropic Interactions TCN, and Bee Database projects, AEC is designed for rapid, and accurate data capture, utilizing controlled vocabularies to maintain data quality. The application is Web-based, allowing for collaboration from multiple partners to a centralized, easily maintainable, database. The AEC community, as part of the Advancing Digitization of Biodiversity Collections (ADBC) program, is already tasked to develop the appropriate Web service for sharing data with the iDigBio data portal. This includes the application of Globally Unique Identifiers (GUIDs) for specimens, and mapping relevant data fields to DarwinCore. iDigBio is the aggregator for all of the ADBC Thematic Collection Network projects, including trophic level interaction data from the Tri-Tropic Interactions TCN.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Web-GUI for a benchmarking database which is based on the EFQM Framework for Corporate Data Quality Management.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    COBOL Data Definitions
    Parse, analyze and -- most importantly -- use COBOL data definitions. This gives you access to COBOL data from Python programs. Write data analyzers, one-time data conversion utilities and Python programs that are part of COBOL systems. Really.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    CleanVision

    CleanVision

    Automatically find issues in image datasets

    CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc. This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning. CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset! The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. CleanVision helps you automatically identify common types of data issues lurking in image datasets. This package currently detects issues in the raw images themselves, making it a useful tool for any computer vision task such as: classification, segmentation, object detection, pose estimation, keypoint detection, generative modeling, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    DataQualityDashboard

    DataQualityDashboard

    A tool to help improve data quality standards in data science

    The goal of the Data Quality Dashboard (DQD) project is to design and develop an open-source tool to expose and evaluate observational data quality. This package will run a series of data quality checks against an OMOP CDM instance (currently supports v5.4, v5.3 and v5.2). It systematically runs the checks, evaluates the checks against some pre-specified threshold, and then communicates what was done in a transparent and easily understandable way. The quality checks were organized according to the Kahn Framework1 which uses a system of categories and contexts that represent strategies for assessing data quality. Using this framework, the Data Quality Dashboard takes a systematic-based approach to running data quality checks. Instead of writing thousands of individual checks, we use “data quality check types”. These “check types” are more general, parameterized data quality checks into which OMOP tables, fields, and concepts can be substituted to represent a singular data quality idea.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    ETS Offers iClassicMDM - MDM Software

    ETS Offers iClassicMDM - MDM Software

    iClassicMDM offered by ETS is a Master Data Platform for all.

    We are living at an age where homes are becoming offices, and every offices need better data management tool, data management issues are spiraling. Our passion is to offer affordable data management tools to individuals and enterprises of all size. iClassicMDM is a Master Data Management application that can run on desktop, web server or containers. It allows customers to create data model as per their business needs & expose them for collaboration through a Data Stewardship User Interface & Restful API to exchange data across the ecosystem. It has built in Data Modeler, Databases, Data Quality - Cleanse & Match, Data Flow studio & Data store to accelerate the turn around time. Our customers can use the product for evaluation and once they are satisfied, they can reach out to us for pricing before go-live. We are recognized by Gartner since 2016. Contact us at info@etsondemand.com or visit www.etsondemand.com to download the software.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23

    EZStacking

    EZStacking is Jupyter notebook generator for machine learning

    EZStacking is Jupyter notebook generator for supervised learning problems using Scikit-Learn pipelines and stacked generalization. EZStacking handles classification and regression problems for structured data. It can also be viewed as a development tool, because a notebook generated with EZStacking contains: -an exploratory data analysis (EDA) used to assess data quality - a modelling producing a reduced-size stacked estimator - a server returning a prediction, a measure of the quality of input data and the execution time.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Encord Active

    Encord Active

    The toolkit to test, validate, and evaluate your models and surface

    Encord Active is an open-source toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling to supercharge model performance. Encord Active has been designed as a all-in-one open source toolkit for improving your data quality and model performance. Use the intuitive UI to explore your data or access all the functionalities programmatically. Discover errors, outliers, and edge-cases within your data - all in one open source toolkit. Get a high level overview of your data distribution, explore it by customizable quality metrics, and discover any anomalies. Use powerful similarity search to find more examples of edge-cases or outliers.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    Feathr

    Feathr

    A scalable, unified data and AI engineering platform for enterprise

    Feathr is a data and AI engineering platform that is widely used in production at LinkedIn for many years and was open sourced in 2022. It is currently a project under LF AI & Data Foundation. Define data and feature transformations based on raw data sources (batch and streaming) using Pythonic APIs. Register transformations by names and get transformed data(features) for various use cases including AI modeling, compliance, go-to-market and more. Share transformations and data(features) across team and company. Feathr is particularly useful in AI modeling where it automatically computes your feature transformations and joins them to your training data, using point-in-time-correct semantics to avoid data leakage, and supports materializing and deploying your features for use online in production.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next