data processing free download

Synthetic Data Generator

SDG is a specialized framework

...It also includes a data processing module capable of handling different data types, preprocessing columns, managing missing values, and converting formats automatically before model training.

Downloads: 14 This Week

Last Update: 2026-03-06

See Project

LOTUS

AI-Powered Data Processing: Use LOTUS to process all of your datasets

LOTUS is an open-source framework and query engine designed to enable efficient processing of structured and unstructured datasets using large language models. The system provides a declarative programming model that allows developers to express complex AI data operations using high-level commands rather than manually orchestrating model calls. It offers a Python interface with a Pandas-like API, making it familiar for data scientists and engineers already working with data analysis libraries. ...

Downloads: 6 This Week

Last Update: 2026-03-06

See Project

Unstructured.IO

Open source libraries and APIs to build custom preprocessing pipelines

The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data into structured outputs.

Downloads: 4 This Week

Last Update: 2 days ago

See Project

DocETL

A system for agentic LLM-powered data processing and ETL

DocETL is an open-source system designed to build and execute data processing pipelines powered by large language models, particularly for analyzing complex collections of documents and unstructured datasets. The platform allows developers and researchers to construct structured workflows that extract, transform, and organize information from sources such as reports, transcripts, legal documents, and other text-heavy data.

Downloads: 6 This Week

Last Update: 2026-03-05

See Project

DeepBI

LLM based data scientist, AI native data application

DeepBI is an AI-native data analysis platform. DeepBI leverages the power of large language models to explore, query, visualize, and share data from any data source. Users can use DeepBI to gain data insight and make data-driven decisions.

Downloads: 9 This Week

Last Update: 2024-11-30

See Project

Sparrow

Structured data extraction and instruction calling with ML, LLM

...The architecture is modular, allowing developers to build customizable processing pipelines that integrate with external tools and data extraction frameworks. Sparrow also includes workflow orchestration tools that allow multiple extraction tasks to be combined into automated pipelines for large-scale document processing.

Downloads: 6 This Week

Last Update: 2026-03-04

See Project

E2M

E2M converts various file types (doc, docx, epub, html, htm, url

E2M is a SourceForge mirror of the e2m open-source project, which focuses on providing tools or services designed to convert or process content between different formats or systems. Projects with similar naming conventions typically emphasize automation workflows where input data from one environment is transformed into another representation or output structure. The mirrored repository allows users to access the project’s codebase independently from its original hosting platform while preserving the development history and release artifacts. Systems like e2m often serve as middleware components that connect different software systems or facilitate data processing pipelines. ...

Downloads: 0 This Week

Last Update: 2026-03-09

See Project

NVIDIA NeMo

Toolkit for conversational AI

NVIDIA NeMo, part of the NVIDIA AI platform, is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures. Conversational AI architectures are typically large and require a lot of data and compute for training. ...

Downloads: 3 This Week

Last Update: 2026-03-23

See Project

MegaParse

File Parser optimised for LLM Ingestion with no loss

MegaParse is a file parser optimized for Large Language Model (LLM) ingestion, ensuring no loss of information. It efficiently parses various document formats, such as PDFs, DOCX, and PPTX, converting them into formats ideal for processing by LLMs. This tool is essential for applications that require accurate and comprehensive data extraction from diverse document types.

Downloads: 1 This Week

Last Update: 2025-02-14

See Project

WeClone

One-stop solution for creating your digital avatar from chat history

WeClone is an open source AI project designed to replicate a person’s conversational style and personality by training models on chat history data. The system analyzes message patterns, linguistic style, and contextual behavior in order to generate responses that resemble the original user’s communication style. It is intended primarily as an experimental exploration of digital personality modeling and conversational AI personalization. By processing large volumes of conversation data, WeClone can build a profile of an individual’s writing tone, vocabulary preferences, and conversational tendencies. ...

Downloads: 7 This Week

Last Update: 2026-03-04

See Project

Bespoke Curator

Synthetic data curation for post-training and data extraction

Curator is an open-source Python library designed to build synthetic data pipelines for training and evaluating machine learning models, particularly large language models. The system helps developers generate, transform, and curate high-quality datasets by combining automated generation with structured validation and filtering. It supports workflows where models are used to produce synthetic examples that can later be refined into reliable training datasets for reasoning, question...

Downloads: 6 This Week

Last Update: 2026-03-14

See Project

chatd

Chat with your documents using local AI

chatd is an open-source desktop application that allows users to interact with their documents through a locally running large language model. The software focuses on privacy and security by ensuring that all document processing and inference occur entirely on the user’s computer without sending data to external cloud services. It includes a built-in integration with the Ollama runtime, which provides a cross-platform environment for running large language models locally. The application typically runs models such as Mistral-7B and allows users to load and analyze documents while asking questions in natural language. ...

Downloads: 8 This Week

Last Update: 2026-03-09

See Project

text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API

...Instead of requiring developers to integrate multiple document parsing libraries individually, the system centralizes text extraction capabilities into a unified API that standardizes the output. The platform supports automated processing pipelines that detect file types and apply the appropriate extraction method to obtain the most accurate text representation possible. It can be integrated into document analysis systems, knowledge retrieval tools, and AI pipelines that rely on clean textual data. The architecture is designed to be lightweight and easily deployable, making it suitable for both local installations and cloud environments.

Downloads: 4 This Week

Last Update: 2026-03-05

See Project

OpenPlanter

Language-model investigation agent with a terminal UI

OpenPlanter is an open-source Python project focused on building an intelligent automated planting or gardening system powered by software control and data processing. The repository is designed to help developers and hobbyists create programmable plant management workflows that can monitor, schedule, and optimize growing conditions. It emphasizes automation and extensibility, allowing integration with sensors, environmental data, and control logic for smart cultivation setups. The system is structured to support experimentation and customization, making it suitable for both research and DIY agriculture projects. ...

Downloads: 4 This Week

Last Update: 2026-03-06

See Project

DATAGEN

AI-driven multi-agent research assistant automating hypothesis

...The project integrates several modern AI frameworks including LangChain, LangGraph, and large language models to manage reasoning and data processing tasks. Through this architecture, the system can combine structured data analysis with natural language reasoning to generate insights and research outputs. The platform is designed for researchers, analysts, and developers who want to accelerate data exploration and automate parts of the research lifecycle.

Downloads: 0 This Week

Last Update: 2026-03-06

See Project

AI Powered Knowledge Graph Generator

AI-Powered Knowledge Graph is an open-source project focused on building knowledge graph systems that integrate artificial intelligence and machine learning to represent complex relationships between data entities. Knowledge graphs organize information as networks of nodes and relationships, allowing applications to analyze connections between concepts, datasets, or real-world entities. By incorporating AI techniques such as natural language processing and semantic reasoning, the project enables systems to automatically extract relationships and insights from large volumes of data.

Downloads: 1 This Week

Last Update: 2026-03-06

See Project

NeMo Curator

Scalable data pre processing and curation toolkit for LLMs

NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for large language model (LLM) use-cases such as foundation model pretraining, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and paramter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline...

Downloads: 0 This Week

Last Update: 2026-02-23

See Project

NVIDIA Generative AI Examples

Generative AI reference workflows

...The repository includes examples covering topics such as retrieval-augmented generation pipelines, agent-based workflows, and multimodal AI applications that combine text, vision, and data processing. Many of the examples show how to deploy AI services using containerized environments, GPU acceleration, and microservices that can scale across modern infrastructure. Developers can explore sample chatbot applications, document question-answering systems, and knowledge-base pipelines that illustrate how generative AI can interact with external data sources.

Downloads: 8 This Week

Last Update: 2026-03-05

See Project

Local File Organizer

An AI-powered file management tool that ensures privacy

Local-File-Organizer is an AI-powered file management system designed to automatically analyze, categorize, and reorganize files stored on a user’s local machine. The project focuses on privacy-first file organization by performing all processing locally rather than sending data to external cloud services. It uses language and vision models to understand the contents of documents, images, and other file types so that files can be grouped intelligently according to their meaning or context. The system scans directories, extracts relevant information from files, and restructures folder hierarchies to make content easier to locate and manage. ...

Downloads: 0 This Week

Last Update: 2026-03-05

See Project

llms-from-scratch-cn

Build a large language model from 0 only with Python foundation

...Through a collection of notebooks, code examples, and translated learning materials, users can explore how to implement components such as multi-head attention, data loaders, and training pipelines using Python and PyTorch.

Downloads: 0 This Week

Last Update: 2026-03-26

See Project

Swirl

Swirl queries any number of data sources with APIs

Swirl queries any number of data sources with APIs and uses spaCy and NLTK to re-rank the unified results without extracting and indexing anything! Includes zero-code configs for Apache Solr, ChatGPT, Elastic Search, OpenSearch, PostgreSQL, Google BigQuery, RequestsGet, Google PSE, NLResearch.com, Miro & more! SWIRL adapts and distributes queries to anything with a search API - search engines, databases, noSQL engines, cloud/SaaS services etc - and uses AI (Large Language Models) to re-rank...

Downloads: 0 This Week

Last Update: 2025-12-11

See Project

xLSTM

Neural Network architecture based on ideas of the original LSTM

xLSTM is an open-source machine learning architecture that reimagines the classic Long Short-Term Memory (LSTM) network for modern large-scale language modeling and sequence processing tasks. The project introduces a new recurrent neural network design that incorporates exponential gating mechanisms and enhanced memory structures to overcome limitations of traditional LSTM models. By introducing innovations such as matrix-based memory and improved normalization techniques, xLSTM improves the ability of recurrent networks to capture long-range dependencies in sequential data.

Downloads: 1 This Week

Last Update: 2026-03-06

See Project

kg-gen

Knowledge Graph Generation from Any Text

kg-gen is an open-source framework developed by the STAIR Lab that automatically generates knowledge graphs from unstructured text using large language models. The system is designed to transform plain text sources such as documents, articles, or conversation transcripts into structured graphs composed of entities and relationships. Instead of relying on traditional rule-based extraction techniques, KG-Gen uses language models to identify entities and their relationships, producing...

Downloads: 0 This Week

Last Update: 2026-03-09

See Project

Chinese-LLaMA-Alpaca 2

Chinese LLaMA-2 & Alpaca-2 Large Model Phase II Project

This project is developed based on the commercially available large model Llama-2 released by Meta. It is the second phase of the Chinese LLaMA&Alpaca large model project. The Chinese LLaMA-2 base model and the Alpaca-2 instruction fine-tuning large model are open-sourced. These models expand and optimize the Chinese vocabulary on the basis of the original Llama-2, use large-scale Chinese data for incremental pre-training, and further improve the basic semantics and command understanding of...

Downloads: 0 This Week

Last Update: 2024-01-23

See Project

towhee

Framework that is dedicated to making neural data processing

...Towhee provides out-of-the-box integration with your favorite libraries, tools, and frameworks, making development quick and easy. Towhee includes a pythonic method-chaining API for describing custom data processing pipelines. We also support schemas, making processing unstructured data as easy as handling tabular data.

Downloads: 1 This Week

Last Update: 2023-12-05

See Project

Search Results for "data processing"

Showing 27 open source projects for "data processing"

Synthetic Data Generator

LOTUS

Unstructured.IO

DocETL

DeepBI

Sparrow

E2M

NVIDIA NeMo

MegaParse

WeClone

Bespoke Curator

chatd

text-extract-api

OpenPlanter

DATAGEN

AI Powered Knowledge Graph Generator

NeMo Curator

NVIDIA Generative AI Examples

Local File Organizer

llms-from-scratch-cn

Swirl

xLSTM

kg-gen

Chinese-LLaMA-Alpaca 2

towhee

Search Results for "data processing"

Showing 27 open source projects for "data processing"

Synthetic Data Generator

LOTUS

Unstructured.IO

DocETL

DeepBI

Sparrow

E2M

NVIDIA NeMo

MegaParse

WeClone

Bespoke Curator

chatd

text-extract-api

OpenPlanter

DATAGEN

AI Powered Knowledge Graph Generator

NeMo Curator

NVIDIA Generative AI Examples

Local File Organizer

llms-from-scratch-cn

Swirl

xLSTM

kg-gen

Chinese-LLaMA-Alpaca 2

towhee

Related Searches

Related Categories