Showing 62 open source projects for "talend data quality"

View related business solutions
  • Ship AI Apps Faster with Vertex AI Icon
    Ship AI Apps Faster with Vertex AI

    Go from idea to deployed AI app without managing infrastructure. Vertex AI offers one platform for the entire AI development lifecycle.

    Ship AI apps and features faster with Vertex AI—your end-to-end AI platform. Access Gemini 3 and 200+ foundation models, fine-tune for your needs, and deploy with enterprise-grade MLOps. Build chatbots, agents, or custom models. New customers get $300 in free credit.
    Try Vertex AI Free
  • Enterprise-grade ITSM, for every business Icon
    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

    Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
    Try it Free
  • 1
    Data-Juicer

    Data-Juicer

    Data processing for and with foundation models

    Data-Juicer is an open-source data processing and augmentation framework designed to enhance the quality and diversity of datasets for machine learning tasks. It includes a modular pipeline for scalable data transformation.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    FiftyOne

    FiftyOne

    The open-source tool for building high-quality datasets

    The open-source tool for building high-quality datasets and computer vision models. Nothing hinders the success of machine learning systems more than poor-quality data. And without the right tools, improving a model can be time-consuming and inefficient. FiftyOne supercharges your machine learning workflows by enabling you to visualize datasets and interpret models faster and more effectively.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 3
    Synthetic Data Vault (SDV)

    Synthetic Data Vault (SDV)

    Synthetic Data Generation for tabular, relational and time series data

    The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset. Synthetic data can then be used to supplement, augment and in some cases replace real data when training Machine Learning models. Additionally, it enables the testing of Machine Learning or other data dependent...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Arize Phoenix

    Arize Phoenix

    Uncover insights, surface problems, monitor, and fine tune your LLM

    Phoenix provides ML insights at lightning speed with zero-config observability for model drift, performance, and data quality. Phoenix is an Open Source ML Observability library designed for the Notebook. The toolset is designed to ingest model inference data for LLMs, CV, NLP and tabular datasets. It allows Data Scientists to quickly visualize their model data, monitor performance, track down issues & insights, and easily export to improve. Deep Learning Models (CV, LLM, and Generative) are an amazing technology that will power many of future ML use cases. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • Run Any Workload on Compute Engine VMs Icon
    Run Any Workload on Compute Engine VMs

    From dev environments to AI training, choose preset or custom VMs with 1–96 vCPUs and industry-leading 99.95% uptime SLA.

    Compute Engine delivers high-performance virtual machines for web apps, databases, containers, and AI workloads. Choose from general-purpose, compute-optimized, or GPU/TPU-accelerated machine types—or build custom VMs to match your exact specs. With live migration and automatic failover, your workloads stay online. New customers get $300 in free credits.
    Try Compute Engine
  • 5
    Pandas Profiling

    Pandas Profiling

    Create HTML profiling reports from pandas DataFrame objects

    ...Mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint). Comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others).
    Downloads: 2 This Week
    Last Update:
    See Project
  • 6
    Diffgram

    Diffgram

    Training data (data labeling, annotation, workflow) for all data types

    From ingesting data to exploring it, annotating it, and managing workflows. Diffgram is a single application that will improve your data labeling and bring all aspects of training data under a single roof. Diffgram is world’s first truly open source training data platform that focuses on giving its users an unlimited experience. This is aimed to reduce your data labeling bills and increase your Training Data Quality.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 7
    Wan2.2

    Wan2.2

    Wan2.2: Open and Advanced Large-Scale Video Generative Model

    Wan2.2 is a major upgrade to the Wan series of open and advanced large-scale video generative models, incorporating cutting-edge innovations to boost video generation quality and efficiency. It introduces a Mixture-of-Experts (MoE) architecture that splits the denoising process across specialized expert models, increasing total model capacity without raising computational costs. Wan2.2 integrates meticulously curated cinematic aesthetic data, enabling precise control over lighting, composition, color tone, and more, for high-quality, customizable video styles. ...
    Downloads: 173 This Week
    Last Update:
    See Project
  • 8
    Cleanlab

    Cleanlab

    The standard data-centric AI package for data quality and ML

    cleanlab helps you clean data and labels by automatically detecting issues in a ML dataset. To facilitate machine learning with messy, real-world data, this data-centric AI package uses your existing models to estimate dataset problems that can be fixed to train even better models. cleanlab cleans your data's labels via state-of-the-art confident learning algorithms, published in this paper and blog. See some of the datasets cleaned with cleanlab at labelerrors.com. This package helps you...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    SDGym

    SDGym

    Benchmarking synthetic data generation methods

    ...You also customize the process to include your own work. Select any of the publicly available datasets from the SDV project, or input your own data. Choose from any of the SDV synthesizers and baselines. Or write your own custom machine learning model. In addition to performance and memory usage, you can also measure synthetic data quality and privacy through a variety of metrics. Install SDGym using pip or conda. We recommend using a virtual environment to avoid conflicts with other software on your device.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 10
    LuxTTS

    LuxTTS

    A high-quality rapid TTS voice cloning model

    LuxTTS is an open-source text-to-speech (TTS) system focused on delivering high-quality, rapid voice synthesis and voice cloning that runs extremely fast and efficiently on consumer hardware. It implements a lightweight architecture based on ZipVoice and optimized sampling techniques so that it can generate speech at speeds up to roughly 150 times real-time on a single GPU and faster than real-time on CPU, all while producing audio at high fidelity with 48 kHz quality. ...
    Downloads: 19 This Week
    Last Update:
    See Project
  • 11
    fastdup

    fastdup

    An unsupervised and free tool for image and video dataset analysis

    fastdup is a powerful free tool designed to rapidly extract valuable insights from your image & video datasets. Assisting you to increase your dataset images & labels quality and reduce your data operations costs at an unparalleled scale.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Giskard

    Giskard

    Collaborative & Open-Source Quality Assurance for all AI models

    The testing framework dedicated to ML models, from tabular to LLMs. Giskard is an open-source testing framework dedicated to ML models, from tabular models to LLMs. Testing Machine Learning applications can be tedious. Since ML models depend on data, testing scenarios depend on the domain specificities and are often infinite. At Giskard, we believe that Machine Learning needs its own testing framework. Created by ML engineers for ML engineers, Giskard enables you to scan your model to find...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 13
    InfiniteYou

    InfiniteYou

    Flexible Photo Recrafting While Preserving Your Identity

    ...Using an architecture built around diffusion transformers (DiTs), InfiniteYou introduces a component called InfuseNet that injects identity features derived from reference images into the generation process — via residual connections — so that the output matches the person’s identity closely, without sacrificing visual quality or text-image alignment. The team uses a multi-stage training strategy with synthetic multi-sample data per identity to fine-tune for both identity consistency and aesthetic quality. Compared to prior methods, InfiniteYou significantly improves on identity similarity, text-prompt adherence, overall image quality, and avoids common problems such as face copy-pasting artifacts.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    DataProfiler

    DataProfiler

    Extract schema, statistics and entities from datasets

    DataProfiler is an AI-powered tool for automatic data analysis and profiling, designed to detect patterns, anomalies, and schema inconsistencies in structured and unstructured datasets. The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI), and...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    Toloka-Kit

    Toloka-Kit

    Toloka-Kit is a Python library for working with Toloka API

    ...There’s no need to validate JSON files and work with them directly. Support of both synchronous and asynchronous (via async/await) executions. Streaming support: build complex pipelines which send and receive data in real-time. For example, you can pass data between two related projects: one for data labeling, and another for its validation. AutoQuality feature which automatically finds the best fitting quality control rules for your project.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    SAM 3

    SAM 3

    Code for running inference and finetuning with SAM 3 model

    SAM 3 (Segment Anything Model 3) is a unified foundation model for promptable segmentation in both images and videos, capable of detecting, segmenting, and tracking objects. It accepts both text prompts (open-vocabulary concepts like “red car” or “goalkeeper in white”) and visual prompts (points, boxes, masks) and returns high-quality masks, boxes, and scores for the requested concepts. Compared with SAM 2, SAM 3 introduces the ability to exhaustively segment all instances of an open-vocabulary concept specified by a short phrase or exemplars, scaling to a vastly larger set of categories than traditional closed-set models. This capability is grounded in a new data engine that automatically annotated over four million unique concepts, producing a massive open-vocabulary segmentation dataset and enabling the model to achieve 75–80% of human performance on the SA-CO benchmark, which itself spans 270K unique concepts.
    Downloads: 75 This Week
    Last Update:
    See Project
  • 17
    Improved Diffusion

    Improved Diffusion

    Release for Improved Denoising Diffusion Probabilistic Models

    improved-diffusion is an open source implementation of diffusion probabilistic models created by OpenAI. These models, also known as score-based generative models, are a class of generative models that have shown strong performance in producing high-quality synthetic data such as images. The repository provides code for training and sampling diffusion models with improved techniques that enhance stability, efficiency, and output fidelity. It includes scripts for setting up training runs, generating samples, and reproducing results from OpenAI’s research on diffusion-based generation. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 18
    HY-Motion 1.0

    HY-Motion 1.0

    HY-Motion model for 3D character animation generation

    ...The training strategy for the HY-Motion series includes extensive pre-training on thousands of hours of varied motion data, fine-tuning on curated high-quality datasets, and reinforcement learning with human feedback, which improves both the plausibility and adaptability of generated motion sequences.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 19
    refinery

    refinery

    Open-source choice to scale, assess and maintain natural language data

    The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact. You are one of the people we've built refinery for. refinery helps you to build better NLP models in a data-centric approach. Semi-automate your labeling, find low-quality subsets in your training data, and monitor your data in one place. refinery doesn't get rid of manual labeling, but it makes sure that your valuable time is spent well. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    CO3D (Common Objects in 3D)

    CO3D (Common Objects in 3D)

    Tooling for the Common Objects In 3D dataset

    CO3Dv2 (Common Objects in 3D, version 2) is a large-scale 3D computer vision dataset and toolkit from Facebook Research designed for training and evaluating category-level 3D reconstruction methods using real-world data. It builds upon the original CO3Dv1 dataset, expanding both scale and quality—featuring 2× more sequences and 4× more frames, with improved image fidelity, more accurate segmentation masks, and enhanced annotations for object-centric 3D reconstruction. CO3Dv2 enables research in multi-view 3D reconstruction, novel view synthesis, and geometry-aware representation learning. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21
    DeepVariant

    DeepVariant

    DeepVariant is an analysis pipeline that uses a deep neural networks

    ...DeepTrio extends DeepVariant's functionality, allowing it to utilize the power of neural networks to predict genomic variants in trios or duos. See this page for more details and instructions on how to run DeepTrio. Out-of-the-box use for PCR-positive samples and low quality sequencing runs, and easy adjustments for different sequencing technologies and non-human species.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Matrix

    Matrix

    Multi-Agent daTa geneRation Infra and eXperimentation framework

    Matrix is a distributed, large-scale engine for multi-agent synthetic data generation and experiments: it provides the infrastructure to run thousands of “agentic” workflows concurrently (e.g. multiple LLMs interacting, reasoning, generating content, data-processing pipelines) by leveraging distributed computing (like Ray + cluster management). The idea is to treat data generation as a “data-to-data” transformation: each input item defines a task, and the runtime orchestrates asynchronous,...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    NeMo Curator

    NeMo Curator

    Scalable data pre processing and curation toolkit for LLMs

    NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens. At the core of the NeMo Curator is the DocumentDataset which serves as the the main dataset class. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Ling-V2

    Ling-V2

    Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI

    ...It introduces highly sparse architectures where only a fraction of the model’s parameters are activated per input token, enabling models like Ling-mini-2.0 to achieve reasoning and instruction-following capabilities on par with much larger dense models while remaining significantly more computationally efficient. Trained on more than 20 trillion tokens of high-quality data and enhanced through multi-stage supervised fine-tuning and reinforcement learning, Ling-V2’s models demonstrate strong general reasoning, mathematical problem-solving, coding understanding, and knowledge-intensive task performance.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    DINOv3

    DINOv3

    Reference PyTorch implementation and models for DINOv3

    DINOv3 is the third-generation iteration of Meta’s self-supervised visual representation learning framework, building upon the ideas from DINO and DINOv2. It continues the paradigm of learning strong image representations without labels using teacher–student distillation, but introduces a simplified and more scalable training recipe that performs well across datasets and architectures. DINOv3 removes the need for complex augmentations or momentum encoders, streamlining the pipeline while...
    Downloads: 9 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • Next
MongoDB Logo MongoDB