Showing 28 open source projects for "data quality"

View related business solutions
  • Build Agents and Models on One Platform Icon
    Build Agents and Models on One Platform

    Everything you need to build production-ready agents and models. Access 200+ Google and third-party AI models and tools.

    Gemini Enterprise Agent Platform is Google Cloud's comprehensive platform for developers to build, scale, govern, and optimize agents and models. Choose from Google's most advanced models and third-party models like Anthropic's Claude Model Family.
    Try It Free
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • 1
    Synthetic Data Generator

    Synthetic Data Generator

    SDG is a specialized framework

    Synthetic Data Generator is an open-source framework designed to generate high-quality synthetic tabular datasets that replicate the statistical characteristics of real data while avoiding privacy risks. The platform enables developers and data scientists to create artificial datasets that preserve important relationships between variables without containing sensitive personal information.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    Bespoke Curator

    Bespoke Curator

    Synthetic data curation for post-training and data extraction

    ...Curator includes tools for monitoring data generation processes and managing dataset quality while large batches of examples are being created. The framework also integrates with multiple inference systems and APIs, allowing users to generate data using different model providers or open-source inference engines.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Tauric TradingAgents

    Tauric TradingAgents

    Multi-Agents LLM Financial Trading Framework

    Tauric TradingAgents is a multi-agent AI framework designed for financial analysis, strategy generation, and automated trading workflows. It coordinates multiple specialized agents that collaborate on tasks such as data analysis, signal generation, and risk evaluation. The system enables complex reasoning by distributing responsibilities across agents, improving decision-making quality. It supports integration with market data sources and trading environments for real-world application. The architecture is modular, allowing developers to extend or customize agent behaviors. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 4
    DeepSearcher

    DeepSearcher

    Open Source Deep Research Alternative to Reason and Search

    DeepSearcher is an open-source “deep research” style system that combines retrieval with evaluation and reasoning to answer complex questions using private or enterprise data. It is designed around the idea that high-quality answers require more than top-k retrieval, so it orchestrates multi-step search, evidence collection, and synthesis into a comprehensive response. The project integrates with vector databases (including Milvus and related options) so organizations can index internal documents and query them with semantic retrieval. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Build Securely on AWS with Proven Frameworks Icon
    Build Securely on AWS with Proven Frameworks

    Lay a foundation for success with Tested Reference Architectures developed by Fortinet’s experts. Learn more in this white paper.

    Moving to the cloud brings new challenges. How can you manage a larger attack surface while ensuring great network performance? Turn to Fortinet’s Tested Reference Architectures, blueprints for designing and securing cloud environments built by cybersecurity experts. Learn more and explore use cases in this white paper.
    Download Now
  • 5
    Learn AI Engineering

    Learn AI Engineering

    Learn AI and LLMs from scratch using free resources

    ...The curation recognizes modern AI realities, including data pipelines, evaluation, prompt engineering, retrieval-augmented generation, and cost/performance trade-offs. It’s equally useful for refreshers—dipping into a specific module before a project—as it is for a full, self-directed curriculum. By centralizing the best references in one place, the repo reduces the overhead of finding, filtering, and sequencing resources, letting you focus on learning and building.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    LangWatch

    LangWatch

    The platform for LLM evaluations and AI agent testing

    ...The platform provides tools for tracking model interactions, analyzing prompt behavior, and identifying issues such as hallucinations, latency problems, or unexpected responses. By collecting telemetry data from AI applications, LangWatch allows developers to understand how their systems perform in real-world usage scenarios. The platform includes dashboards that visualize model behavior, enabling teams to monitor trends in response quality and reliability over time. It also provides evaluation tools that allow developers to test prompts and compare outputs across different models or configurations. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 7
    DocETL

    DocETL

    A system for agentic LLM-powered data processing and ETL

    DocETL is an open-source system designed to build and execute data processing pipelines powered by large language models, particularly for analyzing complex collections of documents and unstructured datasets. The platform allows developers and researchers to construct structured workflows that extract, transform, and organize information from sources such as reports, transcripts, legal documents, and other text-heavy data. Instead of relying on single prompts or ad-hoc scripts, DocETL...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    NeMo Curator

    NeMo Curator

    Scalable data pre processing and curation toolkit for LLMs

    NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens. At the core of the NeMo Curator is the DocumentDataset which serves as the the main dataset class. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 9
    rtk

    rtk

    CLI proxy that reduces LLM token consumption

    ...RTK intercepts these command outputs and compresses them into concise summaries before sending them to the language model. This process helps maintain important information while removing redundant data such as boilerplate logs, long directory listings, or repetitive test outputs. By minimizing the amount of noise sent to the AI model, the tool improves reasoning quality and allows longer development sessions within the same context window. The system is implemented as a lightweight Rust binary that runs locally and integrates easily with common AI coding environments.
    Downloads: 23 This Week
    Last Update:
    See Project
  • Enterprise-grade ITSM, for every business Icon
    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity.

    Freshservice is an intuitive, AI-powered platform that helps IT, operations, and business teams deliver exceptional service without the usual complexity. Automate repetitive tasks, resolve issues faster, and provide seamless support across the organization. From managing incidents and assets to driving smarter decisions, Freshservice makes it easy to stay efficient and scale with confidence.
    Try it Free
  • 10
    Agents 2.0

    Agents 2.0

    An Open-source Framework for Data-centric Language Agents

    ...During training, the system performs a forward execution where the agent completes a task and records the trajectory of prompts, outputs, and tool usage. A prompt-based loss function is then applied to evaluate the quality of the outcome, generating language-based gradients that guide improvements to the agent pipeline.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    kg-gen

    kg-gen

    Knowledge Graph Generation from Any Text

    kg-gen is an open-source framework developed by the STAIR Lab that automatically generates knowledge graphs from unstructured text using large language models. The system is designed to transform plain text sources such as documents, articles, or conversation transcripts into structured graphs composed of entities and relationships. Instead of relying on traditional rule-based extraction techniques, KG-Gen uses language models to identify entities and their relationships, producing...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 12
    RecAI

    RecAI

    Bridging LLM and Recommender System

    RecAI is an open-source research platform developed by Microsoft to explore how large language models can be integrated into modern recommender systems. Traditional recommender systems rely on structured behavioral data such as user interactions and item embeddings, while large language models excel at understanding language and reasoning about user preferences. RecAI aims to bridge these two domains by creating architectures and training methods that allow LLMs to function as intelligent...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    PaperBanana

    PaperBanana

    Extension of Google Research’s PaperBanana

    PaperBanana is an open-source agentic framework designed to automatically generate publication-quality academic diagrams and statistical plots directly from text descriptions. The project focuses on helping researchers, educators, and data scientists transform conceptual descriptions of figures into structured visual outputs suitable for research papers, presentations, and technical reports. Instead of manually designing charts or diagrams using traditional visualization tools, users can describe the desired figure in natural language and allow the system to generate the visual representation automatically. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    WebGLM

    WebGLM

    An Efficient Web-enhanced Question Answering System

    ...The system is based on the General Language Model architecture and was designed to enable language models to interact directly with web information during the question-answering process. Instead of relying solely on knowledge stored in the model’s training data, the system retrieves relevant web content and integrates it into the reasoning process. WebGLM introduces several components that coordinate this process, including a retrieval module that selects relevant web documents, a generator that produces answers, and a scoring system that evaluates the quality of generated responses. The architecture aims to improve the reliability and usefulness of AI systems that answer questions about current or external knowledge sources.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 15
    LLM Course

    LLM Course

    Course to get into Large Language Models (LLMs)

    ...The materials also cover inference optimization and quantization to make serving LLMs feasible on commodity GPUs or even CPUs, which is crucial for side projects and startups. Evaluation is treated as a first-class topic, with examples of automatic and human-in-the-loop methods to catch regressions and verify quality beyond simple loss values. By the end, students have a mental model and a practical toolkit for iterating on datasets, training configs, etc.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    MedicalGPT

    MedicalGPT

    MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training

    MedicalGPT training medical GPT model with ChatGPT training pipeline, implementation of Pretraining, Supervised Finetuning, Reward Modeling and Reinforcement Learning. MedicalGPT trains large medical models, including secondary pre-training, supervised fine-tuning, reward modeling, and reinforcement learning training.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    Easy DataSet

    Easy DataSet

    A powerful tool for creating datasets for LLM fine-tuning

    ...The system includes automated question-generation capabilities, hierarchical label trees, and answer generation pipelines that use LLM APIs to produce coherent paired data with customizable templates. Beyond dataset creation, Easy-dataset also provides a built-in evaluation system with model testing and blind-test features, helping teams validate model performance using curated test sets.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 18
    DocStrange

    DocStrange

    Extract and convert data from any document, images, pdfs, word doc

    DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services. It is built for developers who need high-quality parsing from scans, photos, PDFs, office files,...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 19
    Prometheus-Eval

    Prometheus-Eval

    Evaluate your LLM's response with Prometheus and GPT4

    ...It also provides training data and utilities for fine-tuning evaluator models so they can assess outputs according to custom scoring rubrics such as helpfulness, accuracy, or style.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 20
    Magicoder

    Magicoder

    Empowering Code Generation with OSS-Instruct

    Magicoder is an open-source family of large language models designed specifically for code generation and software development tasks. The project focuses on improving the quality and diversity of code generation by training models with a novel dataset construction approach known as OSS-Instruct. This technique uses open-source code repositories as a foundation for generating more realistic and diverse instruction datasets for training language models. By grounding training data in real open-source examples, Magicoder aims to reduce bias and improve the reliability of code generation results compared to models trained solely on synthetic instructions. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    BISHENG

    BISHENG

    BISHENG is an open LLM devops platform for next generation apps

    BISHENG is an open LLM application DevOps platform, focusing on enterprise scenarios. It has been used by a large number of industry-leading organizations and Fortune 500 companies. "Bi Sheng" was the inventor of movable type printing, which played a vital role in promoting the transmission of human knowledge. We hope that BISHENG can also provide strong support for the widespread implementation of intelligent applications. Everyone is welcome to participate.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    LLM Datasets

    LLM Datasets

    Curated list of datasets and tools for post-training

    ...The repository aims to make datasets easy to inspect and transform, with scripts for downloading, deduping, cleaning, and converting to formats like JSONL that slot into training pipelines. It highlights instruction-tuning and conversation-style corpora while also pointing to code, math, or domain-specific sets for targeted capabilities. Quality is a recurring theme: examples and utilities help filter low-value samples, enforce length limits, and split train/validation consistently so results are comparable. Licensing and provenance are surfaced to encourage compliant usage and to guide dataset selection in commercial settings. For practitioners, the repo is a practical “starting pantry” that accelerates experimentation and helps keep data wrangling from dominating the project timeline.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    LLM Foundry

    LLM Foundry

    LLM training code for MosaicML foundation models

    Introducing MPT-7B, the first entry in our MosaicML Foundation Series. MPT-7B is a transformer trained from scratch on 1T tokens of text and code. It is open source, available for commercial use, and matches the quality of LLaMA-7B. MPT-7B was trained on the MosaicML platform in 9.5 days with zero human intervention at a cost of ~$200k. Large language models (LLMs) are changing the world, but for those outside well-resourced industry labs, it can be extremely difficult to train and deploy...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    LangChain Extract

    LangChain Extract

    Did you say you like data?

    ...Developers can create reusable “extractors” that define what type of information should be pulled from a document, along with example prompts that improve extraction quality through in-context learning.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    axflow

    axflow

    The TypeScript framework for AI development

    ...Its core SDK enables developers to integrate language model capabilities into web applications while maintaining strong modular design principles. Additional components support data ingestion, evaluation, and model interaction workflows that are commonly required when building production AI systems. For example, the framework includes modules for connecting application data to language models, evaluating the quality of model outputs, and building streaming user interfaces. Because each component can be used independently, developers can adopt Axflow incrementally rather than committing to a monolithic framework. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
Auth0 Logo