Showing 37 open source projects for "data processing"

View related business solutions
  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Forever Free Full-Stack Observability | Grafana Cloud Icon
    Forever Free Full-Stack Observability | Grafana Cloud

    Our generous forever free tier includes the full platform, including the AI Assistant, for 3 users with 10k metrics, 50GB logs, and 50GB traces.

    Built on open standards like Prometheus and OpenTelemetry, Grafana Cloud includes Kubernetes Monitoring, Application Observability, Incident Response, plus the AI-powered Grafana Assistant. Get started with our generous free tier today.
    Create free account
  • 1
    Synthetic Data Generator

    Synthetic Data Generator

    SDG is a specialized framework

    ...It also includes a data processing module capable of handling different data types, preprocessing columns, managing missing values, and converting formats automatically before model training.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 2
    LOTUS

    LOTUS

    AI-Powered Data Processing: Use LOTUS to process all of your datasets

    LOTUS is an open-source framework and query engine designed to enable efficient processing of structured and unstructured datasets using large language models. The system provides a declarative programming model that allows developers to express complex AI data operations using high-level commands rather than manually orchestrating model calls. It offers a Python interface with a Pandas-like API, making it familiar for data scientists and engineers already working with data analysis libraries. ...
    Downloads: 7 This Week
    Last Update:
    See Project
  • 3
    DocETL

    DocETL

    A system for agentic LLM-powered data processing and ETL

    DocETL is an open-source system designed to build and execute data processing pipelines powered by large language models, particularly for analyzing complex collections of documents and unstructured datasets. The platform allows developers and researchers to construct structured workflows that extract, transform, and organize information from sources such as reports, transcripts, legal documents, and other text-heavy data.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 4
    Unstructured.IO

    Unstructured.IO

    Open source libraries and APIs to build custom preprocessing pipelines

    The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data into structured outputs.
    Downloads: 2 This Week
    Last Update:
    See Project
  • $300 in Free Credit Towards Top Cloud Services Icon
    $300 in Free Credit Towards Top Cloud Services

    Build VMs, containers, AI, databases, storage—all in one place.

    Start your project in minutes. After credits run out, 20+ products include free monthly usage. Only pay when you're ready to scale.
    Get Started
  • 5
    Sparrow

    Sparrow

    Structured data extraction and instruction calling with ML, LLM

    ...The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.
    Downloads: 6 This Week
    Last Update:
    See Project
  • 6
    DeepBI

    DeepBI

    LLM based data scientist, AI native data application

    DeepBI is an AI-native data analysis platform. DeepBI leverages the power of large language models to explore, query, visualize, and share data from any data source. Users can use DeepBI to gain data insight and make data-driven decisions.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 7
    E2M

    E2M

    E2M converts various file types (doc, docx, epub, html, htm, url

    E2M is a SourceForge mirror of the e2m open-source project, which focuses on providing tools or services designed to convert or process content between different formats or systems. Projects with similar naming conventions typically emphasize automation workflows where input data from one environment is transformed into another representation or output structure. The mirrored repository allows users to access the project’s codebase independently from its original hosting platform while preserving the development history and release artifacts. Systems like e2m often serve as middleware components that connect different software systems or facilitate data processing pipelines. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 8
    NVIDIA NeMo

    NVIDIA NeMo

    Toolkit for conversational AI

    NVIDIA NeMo, part of the NVIDIA AI platform, is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures. Conversational AI architectures are typically large and require a lot of data and compute for training. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 9
    Pluely

    Pluely

    The Open Source Alternative to Cluely

    ...The system focuses on orchestrating tasks performed by large language models and other AI components, allowing developers to define structured workflows where models interact with tools, APIs, and external systems. By providing a modular architecture for building AI pipelines, the platform enables developers to connect multiple processing steps such as data retrieval, prompt execution, analysis, and response generation. The project emphasizes flexibility, allowing developers to extend the platform with custom integrations and automation logic. This makes the framework suitable for building intelligent assistants, automated business workflows, and data-processing pipelines that rely on generative AI capabilities.
    Downloads: 13 This Week
    Last Update:
    See Project
  • AI-powered service management for IT and enterprise teams Icon
    AI-powered service management for IT and enterprise teams

    Enterprise-grade ITSM, for every business

    Give your IT, operations, and business teams the ability to deliver exceptional services—without the complexity. Maximize operational efficiency with refreshingly simple, AI-powered Freshservice.
    Try it Free
  • 10
    MegaParse

    MegaParse

    File Parser optimised for LLM Ingestion with no loss

    MegaParse is a file parser optimized for Large Language Model (LLM) ingestion, ensuring no loss of information. It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    WeClone

    WeClone

    One-stop solution for creating your digital avatar from chat history

    WeClone is an open source AI project designed to replicate a person’s conversational style and personality by training models on chat history data. The system analyzes message patterns, linguistic style, and contextual behavior in order to generate responses that resemble the original user’s communication style. It is intended primarily as an experimental exploration of digital personality modeling and conversational AI personalization. By processing large volumes of conversation data, WeClone can build a profile of an individual’s writing tone, vocabulary preferences, and conversational tendencies. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 12
    Bespoke Curator

    Bespoke Curator

    Synthetic data curation for post-training and data extraction

    Curator is an open-source Python library designed to build synthetic data pipelines for training and evaluating machine learning models, particularly large language models. The system helps developers generate, transform, and curate high-quality datasets by combining automated generation with structured validation and filtering. It supports workflows where models are used to produce synthetic examples that can later be refined into reliable training datasets for reasoning, question...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 13
    chatd

    chatd

    Chat with your documents using local AI

    chatd is an open-source desktop application that allows users to interact with their documents through a locally running large language model. The software focuses on privacy and security by ensuring that all document processing and inference occur entirely on the user’s computer without sending data to external cloud services. It includes a built-in integration with the Ollama runtime, which provides a cross-platform environment for running large language models locally. The application typically runs models such as Mistral-7B and allows users to load and analyze documents while asking questions in natural language. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 14
    Lemon AI

    Lemon AI

    Full-stack Open-source Self-Evolving General AI Agent

    LemonAI is an open-source full-stack framework for building autonomous AI agents capable of performing complex tasks such as research, programming, data analysis, and document processing. The platform is designed to run primarily on local infrastructure, providing a privacy-focused alternative to cloud-dependent agent platforms. It integrates with local large language models through tools such as Ollama, vLLM, and other model runtimes while also allowing optional connections to external cloud models. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 15
    text-extract-api

    text-extract-api

    Document (PDF, Word, PPTX ...) extraction and parse API

    ...Instead of requiring developers to integrate multiple document parsing libraries individually, the system centralizes text extraction capabilities into a unified API that standardizes the output. The platform supports automated processing pipelines that detect file types and apply the appropriate extraction method to obtain the most accurate text representation possible. It can be integrated into document analysis systems, knowledge retrieval tools, and AI pipelines that rely on clean textual data. The architecture is designed to be lightweight and easily deployable, making it suitable for both local installations and cloud environments.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 16
    dataline

    dataline

    AI data analysis and visualization on CSV, Postgres, MySQL, Snowflake

    ...Once connected, users can generate tables, charts, and reports automatically based on queries produced by the AI engine. The platform is designed with a privacy-first architecture that stores data locally on the user’s device rather than sending it to external cloud services by default. It can also hide sensitive data from language models during processing, ensuring that only necessary metadata is used for query generation.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    TaxHacker

    TaxHacker

    Self-hosted AI accounting app. LLM analyzer for receipts

    TaxHacker is an open-source, self-hosted accounting application that uses artificial intelligence to automate financial record management for freelancers, independent developers, and small businesses. The system is designed to simplify bookkeeping by automatically processing financial documents such as receipts, invoices, and transaction records. It integrates large language models to analyze these documents, extract relevant financial information, and categorize expenses or income based on configurable rules. Users can deploy the application on their own infrastructure, ensuring that financial data remains private and under their control rather than being processed by external services. ...
    Downloads: 4 This Week
    Last Update:
    See Project
  • 18
    GLM-5

    GLM-5

    From Vibe Coding to Agentic Engineering

    GLM-5 is a next-generation open-source large language model (LLM) developed by the Z .ai team under the zai-org organization that pushes the boundaries of reasoning, coding, and long-horizon agentic intelligence. Building on earlier GLM series models, GLM-5 dramatically scales the parameter count (to roughly 744 billion) and expands pre-training data to significantly improve performance on complex tasks such as multi-step reasoning, software engineering workflows, and agent orchestration compared to its predecessors like GLM-4.5. It incorporates innovations like DeepSeek Sparse Attention (DSA) to preserve massive context windows while reducing deployment costs and supporting long context processing, which is crucial for detailed plans and agent tasks.
    Downloads: 182 This Week
    Last Update:
    See Project
  • 19
    Flock

    Flock

    Flock is a workflow-based low-code platform for building chatbots

    Flock is a workflow-based low-code platform designed for building AI applications such as chatbots, retrieval-augmented generation systems, and multi-agent workflows. The platform uses a visual workflow architecture where different nodes represent processing steps such as input processing, model inference, retrieval operations, and tool execution. Developers can connect these nodes to create complex pipelines that orchestrate multiple language models and external services. Built on...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 20
    DATAGEN

    DATAGEN

    AI-driven multi-agent research assistant automating hypothesis

    ...The project integrates several modern AI frameworks including LangChain, LangGraph, and large language models to manage reasoning and data processing tasks. Through this architecture, the system can combine structured data analysis with natural language reasoning to generate insights and research outputs. The platform is designed for researchers, analysts, and developers who want to accelerate data exploration and automate parts of the research lifecycle.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    OpenPlanter

    OpenPlanter

    Language-model investigation agent with a terminal UI

    OpenPlanter is an open-source Python project focused on building an intelligent automated planting or gardening system powered by software control and data processing. The repository is designed to help developers and hobbyists create programmable plant management workflows that can monitor, schedule, and optimize growing conditions. It emphasizes automation and extensibility, allowing integration with sensors, environmental data, and control logic for smart cultivation setups. The system is structured to support experimentation and customization, making it suitable for both research and DIY agriculture projects. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 22
    NLP-Knowledge-Graph

    NLP-Knowledge-Graph

    Research and application of technologies such as nl processing

    NLP-Knowledge-Graph is an open educational repository that collects resources, research materials, and tutorials focused on the intersection of natural language processing and knowledge graph technologies. The project aims to help researchers and developers understand how structured knowledge representations can enhance language processing systems. It includes curated materials covering key topics such as knowledge graph construction, entity recognition, relation extraction, graph embeddings, and semantic reasoning. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    DocStrange

    DocStrange

    Extract and convert data from any document, images, pdfs, word doc

    DocStrange is an open-source document understanding and extraction library designed to convert complex files into structured, LLM-ready outputs such as Markdown, JSON, CSV, and HTML. Developed by Nanonets, the project combines OCR, layout detection, table understanding, and structured extraction into one end-to-end pipeline, which reduces the need to stitch together multiple separate services. It is built for developers who need high-quality parsing from scans, photos, PDFs, office files,...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    AI Powered Knowledge Graph Generator

    AI Powered Knowledge Graph Generator

    AI Powered Knowledge Graph Generator

    AI-Powered Knowledge Graph is an open-source project focused on building knowledge graph systems that integrate artificial intelligence and machine learning to represent complex relationships between data entities. Knowledge graphs organize information as networks of nodes and relationships, allowing applications to analyze connections between concepts, datasets, or real-world entities. By incorporating AI techniques such as natural language processing and semantic reasoning, the project enables systems to automatically extract relationships and insights from large volumes of data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 25
    NeMo Curator

    NeMo Curator

    Scalable data pre processing and curation toolkit for LLMs

    NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • Next
MongoDB Logo MongoDB