Showing 888 open source projects for "data quality"

View related business solutions
  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • Stop vibe-debugging. Icon
    Stop vibe-debugging.

    Plug Claude into your app's actual errors.

    AppSignal's MCP server hands Claude, Cursor, or Zed your real errors, traces, and the deploy that shipped them. AI writes the fix; you review the diff.
    Free 30 days.
  • 1
    data-diff

    data-diff

    Efficiently diff rows across two different databases

    ...Replicating data at scale, across hundreds of tables, with low latency and at a reasonable infrastructure cost is a hard problem, and most data teams we’ve talked to, have faced data quality issues in their replication processes. The hard truth is that the quality of the replication is the quality of the data. Since copying entire datasets in batch is often infeasible at the modern data scale, businesses rely on the Change Data Capture (CDC) approach of replicating data using a continuous stream of updates.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Easy DataSet

    Easy DataSet

    A powerful tool for creating datasets for LLM fine-tuning

    ...The system includes automated question-generation capabilities, hierarchical label trees, and answer generation pipelines that use LLM APIs to produce coherent paired data with customizable templates. Beyond dataset creation, Easy-dataset also provides a built-in evaluation system with model testing and blind-test features, helping teams validate model performance using curated test sets.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 3
    Inbucket

    Inbucket

    Disposable webmail server (similar to Mailinator) with built in SMTP

    Inbucket is an email testing application; it will accept messages from any email address and make them available to view via a web interface. When you need to test your webapp's outbound emails with Mailinator but are stuck behind a firewall, Inbucket provides the solution. It allows you to keep your new application development secret until it's time to release it. Inbucket is ideal for validating that emails go out as part of your integration test suite, sending links to coworkers to...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 4
    ispc

    ispc

    Intel SPMD Program Compiler

    ispc is a compiler for a variant of the C programming language, with extensions for single programs, and multiple data programming. Under the SPMD model, the programmer writes a program that generally appears to be a regular serial program, though the execution model is actually that a number of program instances execute in parallel on the hardware. ispc compiles a C-based SPMD programming language to run on the SIMD units of CPUs and GPUs; it frequently provides a 3x or more speedup on...
    Downloads: 0 This Week
    Last Update:
    See Project
  • $300 Free Credits to Build on Google Cloud Icon
    $300 Free Credits to Build on Google Cloud

    New to Google Cloud? Get $300 in credits to explore Compute Engine, BigQuery, Cloud Run, Gemini Enterprise Agent Platform, and more.

    Start your next project with $300 in free Google Cloud credit. Spin up VMs, run containers, query petabytes in BigQuery, or build agents with Gemini Enterprise Agent Platform. Once your credits are used, keep building with 20+ always-free tier products including Compute Engine, Cloud Storage, GKE, and Cloud Run functions. No commitment required—just sign up and start building.
    Claim $300 Free
  • 5
    The AI Scientist-v2

    The AI Scientist-v2

    Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    AI-Scientist-v2 is an advanced autonomous research system designed to perform end-to-end scientific discovery using large language models and agent-based orchestration. The platform is capable of generating original research ideas, designing and executing experiments, analyzing and visualizing results, and producing full academic papers without direct human intervention. It introduces a generalized framework that removes reliance on predefined templates, enabling broader applicability across...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 6
    AI_Tutorial

    AI_Tutorial

    A selection of learning materials, search, recommendation, advertising

    AI_Tutorial is a large curated repository that aggregates high-quality learning resources related to artificial intelligence, machine learning, deep learning, natural language processing, and data engineering. The project functions as a centralized knowledge base designed to help engineers and researchers discover tutorials, technical articles, algorithm explanations, and architecture discussions from across the AI ecosystem.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 7
    rag-search

    rag-search

    RAG Search API

    ...It is built to be easily deployable, requiring only environment configuration and dependency installation to run a functional RAG service. The system supports configurable filtering, scoring thresholds, and reranking options, allowing developers to fine-tune retrieval quality. Its architecture is modular, separating handlers, services, and utilities to support customization and extension. Overall, rag-search serves as a practical starter backend for teams building AI search or question-answering applications on their own data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    GPT Crawler

    GPT Crawler

    Crawl a site to generate knowledge files to create your own custom GPT

    GPT Crawler is an open-source tool designed to automatically crawl websites and generate structured knowledge that can be used to build AI assistants and retrieval systems. It focuses on extracting high-quality textual content from web pages and preparing it in formats suitable for embedding, indexing, or fine-tuning workflows. The project is especially useful for teams that want to turn documentation sites or knowledge bases into conversational AI backends without building custom scrapers from scratch. It includes configurable crawling logic, content filtering, and output pipelines that streamline the process of preparing data for large language models. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    AI Researcher

    AI Researcher

    An autonomous AI researcher

    ...Each agent operates with clear roles — such as researcher, analyst, and summarizer — and they communicate through a task-management interface that ensures progress tracking and iterative refinement. The system emphasizes modularity, so teams can swap in new reasoning modules, data retrieval strategies, or domain knowledge bases depending on the research topic. Through self-supervised feedback loops, agents adjust their strategies based on prior outcomes, improving both the quality and relevance of results over time.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure Icon
    Stop Cyber Threats with VM-Series Next-Gen Firewall on Azure

    Native application identity and user-based security for your Azure cloud

    Gain integrated visibility across all traffic in a single pass. Deploy Palo Alto Networks VM-Series to determine application identity and content while automating security policy updates via rich APIs.
    Get a free trial
  • 10
    textarea.my

    textarea.my

    A minimalist text editor that lives in URL

    ...This design makes it ideal for quick drafts, short markdown notes, and lightweight sharing where you want “send a link” simplicity without exporting files. It supports markdown-friendly workflows and includes small quality-of-life behaviors like using a top-level markdown title to set the page title for a cleaner browser tab. It also leans into hackable customization by letting you change the look using CSS via DevTools, and those style tweaks can persist with the document. Beyond the URL, it also stores data in localStorage, providing a second layer of persistence for convenience.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 11
    OPENRNDR

    OPENRNDR

    Kotlin library for creative coding, real-time and interactive graphics

    ...OPENRNDR provides simple, reusable utilities with which creative coders can build robust, fast, and reliable (interactive) applications for prototyping as well as building production-quality software. With ORML you can easily connect to a number of widely used Machine Learning models, such as Facemesh, Posenet, and Stylegan. You can use OPENRNDR to visualize the data coming from these models in order to create compelling (interactive) experiences. The ORML library includes both models and interface code to make the use of those models simple. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Z-BlogPHP

    Z-BlogPHP

    Z-BlogPHP blog program

    Z-BlogPHP is a blog program provided by the Z-Blog community and has been committed to providing excellent blog writing experience to domestic users. The first edition has been released since 2005 and has a history of 18 years. It is one of the few open-source CMS systems that continue to provide updates in China. Our goal is to immerse users in writing and record life, without paying attention to cumbersome settings, etc., and let users focus on creation. For users, it is simple and easy to...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    Trafilatura

    Trafilatura

    Python & command-line tool to gather text on the Web

    ...It aims at staying handy and modular: no database is required, the output can be converted to various commonly used formats. Going from raw HTML to essential parts can alleviate many problems related to text quality, first by avoiding the noise caused by recurring elements (headers, footers, links/blogroll etc.) and second by including information such as author and date in order to make sense of the data. The extractor tries to strike a balance between limiting noise (precision) and including all valid parts (recall). It also has to be robust and reasonably fast, it runs in production on millions of documents.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Orpheus TTS

    Orpheus TTS

    Towards Human-Sounding Speech

    Orpheus TTS is a state-of-the-art open-source text-to-speech system built on a Llama-3B backbone, treating speech synthesis as a large language model problem instead of a traditional TTS pipeline. It is designed to produce human-like speech with natural intonation, emotion, and rhythm, targeting quality comparable to or better than many closed-source systems. The project ships both pretrained and finetuned English models, as well as a family of multilingual models released as a research preview, and includes data-processing scripts so users can train or finetune their own variants. Inference is provided through a Python package that uses vLLM under the hood for high-throughput, low-latency generation, including streaming examples that show how to generate audio chunks in real time. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 15
    DocStrange

    DocStrange

    Extract and convert data from any document, images, pdfs, word doc

    DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services. It is built for developers who need high-quality parsing from scans, photos, PDFs, office files,...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 16
    Pedalboard

    Pedalboard

    A Python library for audio

    ...It supports the most popular audio file formats and a number of common audio effects out of the box and also allows the use of VST3® and Audio Unit formats for loading third-party software instruments and effects. pedalboard was built by Spotify’s Audio Intelligence Lab to enable using studio-quality audio effects from within Python and TensorFlow. Internally at Spotify, pedalboard is used for data augmentation to improve machine learning models and to help power features like Spotify’s AI DJ and AI Voice Translation. pedalboard also helps in the process of content creation, making it possible to add effects to audio without using a Digital Audio Workstation.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 17
    Vane

    Vane

    Vane is an AI-powered answering engine

    Vane is a privacy-focused AI-powered answering engine that combines web search, AI reasoning, and multiple language model providers into a locally controlled search experience. The platform supports both local LLMs through Ollama and cloud providers such as OpenAI, Claude, Gemini, and Groq, giving users flexibility in how queries are processed. It integrates web search through SearxNG while also supporting discussions, academic sources, image search, and video search to generate...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Prometheus-Eval

    Prometheus-Eval

    Evaluate your LLM's response with Prometheus and GPT4

    ...It also provides training data and utilities for fine-tuning evaluator models so they can assess outputs according to custom scoring rubrics such as helpfulness, accuracy, or style.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19
    Magicoder

    Magicoder

    Empowering Code Generation with OSS-Instruct

    Magicoder is an open-source family of large language models designed specifically for code generation and software development tasks. The project focuses on improving the quality and diversity of code generation by training models with a novel dataset construction approach known as OSS-Instruct. This technique uses open-source code repositories as a foundation for generating more realistic and diverse instruction datasets for training language models. By grounding training data in real open-source examples, Magicoder aims to reduce bias and improve the reliability of code generation results compared to models trained solely on synthetic instructions. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    TimesFM

    TimesFM

    Pretrained time-series foundation model developed by Google Research

    ...The repository also documents how model versions evolved, with newer variants focusing on efficiency and longer context windows while maintaining forecasting quality.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    BISHENG

    BISHENG

    BISHENG is an open LLM devops platform for next generation apps

    BISHENG is an open LLM application DevOps platform, focusing on enterprise scenarios. It has been used by a large number of industry-leading organizations and Fortune 500 companies. "Bi Sheng" was the inventor of movable type printing, which played a vital role in promoting the transmission of human knowledge. We hope that BISHENG can also provide strong support for the widespread implementation of intelligent applications. Everyone is welcome to participate.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Timber themes

    Timber themes

    Create WordPress themes with OOP code and the Twig template engine

    Timber helps you create fully-customized WordPress themes faster with more sustainable code. With Timber, you write your HTML using the Twig Template Engine separate from your PHP files. This cleans up your theme code so, for example, your PHP file can focus on being the data/logic, while your Twig file can focus 100% on the HTML and display. Once Timber is installed and activated in your plugins directory, it gives any WordPress theme the ability to take advantage of the power of Twig and...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    Dozzle

    Dozzle

    Realtime log viewer for containers. Supports Docker, Swarm and K8s

    ...Instead of indexing or storing logs, it connects to your container runtime and streams live output so you can diagnose issues as they happen. The interface includes practical quality-of-life features like fuzzy searching for containers, regex log search, split-screen viewing for multiple logs, and live stats such as CPU and memory usage. It supports more advanced analysis through an in-browser SQL query engine for querying logs, which helps when you need structured filtering without exporting data elsewhere. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    LLM Datasets

    LLM Datasets

    Curated list of datasets and tools for post-training

    ...The repository aims to make datasets easy to inspect and transform, with scripts for downloading, deduping, cleaning, and converting to formats like JSONL that slot into training pipelines. It highlights instruction-tuning and conversation-style corpora while also pointing to code, math, or domain-specific sets for targeted capabilities. Quality is a recurring theme: examples and utilities help filter low-value samples, enforce length limits, and split train/validation consistently so results are comparable. Licensing and provenance are surfaced to encourage compliant usage and to guide dataset selection in commercial settings. For practitioners, the repo is a practical “starting pantry” that accelerates experimentation and helps keep data wrangling from dominating the project timeline.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    LLM Foundry

    LLM Foundry

    LLM training code for MosaicML foundation models

    Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Large language models (LLMs) are changing the world, but for those outside well-resourced industry labs, it can be extremely difficult to train and deploy...
    Downloads: 0 This Week
    Last Update:
    See Project
Auth0 Logo