data processing free download

Data-Juicer

Data processing for and with foundation models

Data-Juicer is an open-source data processing and augmentation framework designed to enhance the quality and diversity of datasets for machine learning tasks. It includes a modular pipeline for scalable data transformation.

Downloads: 4 This Week

Last Update: 2026-03-17

See Project

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 8 This Week

Last Update: 2025-06-09

See Project

DOLMA

Data and tools for generating and inspecting OLMo pre-training data

DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.

Downloads: 10 This Week

Last Update: 2025-06-25

See Project

Diffgram

Training data (data labeling, annotation, workflow) for all data types

...Training Data is the art of supervising machines through data. This includes the activities of annotation, which produces structured data; ready to be consumed by a machine learning model. Annotation is required because raw media is considered to be unstructured and not usable without it. That’s why training data is required for many modern machine learning use cases including computer vision, natural language processing and speech recognition.

Downloads: 8 This Week

Last Update: 2024-10-14

See Project

Awesome Fraud Detection Research Papers

A curated list of data mining papers about fraud detection

A curated list of data mining papers about fraud detection from several conferences.

Downloads: 0 This Week

Last Update: 2026-01-05

See Project

DataProfiler

Extract schema, statistics and entities from datasets

DataProfiler is an AI-powered tool for automatic data analysis and profiling, designed to detect patterns, anomalies, and schema inconsistencies in structured and unstructured datasets. The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame. Profiling the Data, the library identifies the schema, statistics, entities (PII / NPI), and...

Downloads: 5 This Week

Last Update: 2025-07-30

See Project

spaCy

Industrial-strength Natural Language Processing (NLP)

spaCy is a library built on the very latest research for advanced Natural Language Processing (NLP) in Python and Cython. Since its inception it was designed to be used for real world applications-- for building real products and gathering real insights. It comes with pretrained statistical models and word vectors, convolutional neural network models, easy deep learning integration and so much more. spaCy is the fastest syntactic parser in the world according to independent benchmarks, with...

Downloads: 116 This Week

Last Update: 2026-03-29

See Project

DataDreamer

DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models

DataDreamer is a tool designed to assist in the generation and manipulation of synthetic data for various applications, including testing and machine learning.

Downloads: 0 This Week

Last Update: 2025-02-02

See Project

Datasets

Hub of ready-to-use datasets for ML models

Datasets is a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks. Load a dataset in a single line of code, and use our powerful data processing methods to quickly get your dataset ready for training in a deep learning model. Backed by the Apache Arrow format, process large datasets with zero-copy reads without any memory constraints for optimal speed and efficiency. ...

Downloads: 6 This Week

Last Update: 2026-03-23

See Project

Classical Language Toolkit (CLTK)

The Classical Language Toolkit

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing support for classical languages, including Latin, Greek, and others.

Downloads: 6 This Week

Last Update: 2025-05-04

See Project

SetFit

Efficient few-shot learning with Sentence Transformers

SetFit is an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers. It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples.

Downloads: 8 This Week

Last Update: 2025-08-05

See Project

Superlinked

Superlinked is a Python framework for AI Engineers

Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.

Downloads: 0 This Week

Last Update: 2025-10-22

See Project

LightAutoML

Fast and customizable framework for automatic ML model creation

LightAutoML is an automated machine learning (AutoML) framework optimized for efficient model training and hyperparameter tuning, focusing on both tabular and text data.

Downloads: 0 This Week

Last Update: 2025-12-04

See Project

deepdoctection

A Repo For Document AI

...For more specific text processing tasks use one of the many other great NLP libraries.

Downloads: 2 This Week

Last Update: 6 days ago

See Project

MindNLP

Easy-to-use and high-performance NLP and LLM framework

MindNLP is a natural language processing library built on the MindSpore framework, providing tools and models for various NLP tasks.

Downloads: 0 This Week

Last Update: 2025-11-05

See Project

PaddleNLP

Easy-to-use and powerful NLP library with Awesome model zoo

...Provide rich industry-level pre-task capabilities Taskflow And process-wide text area API: Support for the loading of rich Chinese data sets Dataset API, can flexibly and efficiently complete data pretreatment Data API, Preset 60 + pre-training word vector Embedding API, Providing 100 + pre-training model Transformer API Wait, the efficiency of NLP task modeling can be greatly improved.

Downloads: 3 This Week

Last Update: 2025-05-21

See Project

NVIDIA NeMo

Toolkit for conversational AI

NVIDIA NeMo, part of the NVIDIA AI platform, is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures. Conversational AI architectures are typically large and require a lot of data and compute for training. ...

Downloads: 3 This Week

Last Update: 2026-03-23

See Project

txtai

Build AI-powered semantic search applications

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications. Traditional search systems use keywords to find data. Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords. Backed by state-of-the-art machine learning models, data is transformed into vector representations for search (also known as embeddings). Innovation is happening at a rapid...

Downloads: 8 This Week

Last Update: 2026-03-17

See Project

BettaFish

Public opinion analysis system

...Unlike simpler analytics tools, BettaFish employs agent collaboration and a “forum” style internal mechanism to combine diverse model outputs, making the analysis richer and more robust. It also integrates multimodal processing, enabling it to parse images and video alongside text.

Downloads: 1 This Week

Last Update: 2026-02-17

See Project

Haystack

Haystack is an open source NLP framework to interact with your data

Apply the latest NLP technology to your own data with the use of Haystack's pipeline architecture. Implement production-ready semantic search, question answering, summarization and document ranking for a wide range of NLP applications. Evaluate components and fine-tune models. Ask questions in natural language and find granular answers in your documents using the latest QA models with the help of Haystack pipelines. Perform semantic search and retrieve ranked documents according to meaning,...

Downloads: 14 This Week

Last Update: 2026-04-01

See Project

Lingua-Py

The most accurate natural language detection library for Python

Its task is simple: It tells you which language some text is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages. Language detection is often done as part of large machine learning frameworks or natural language processing applications. ...

Downloads: 0 This Week

Last Update: 2026-03-09

See Project

Open Interpreter

A natural language interface for computers

Open Interpreter is an open-source tool that provides a natural-language interface for interacting with your computer. It lets large language models (LLMs) run code locally (Python, JavaScript, shell, etc.), enabling you to ask your computer to do tasks like data analysis, file manipulation, browsing, etc. in human terms (“chat with your computer”), with safeguards. Runs locally or via configured remote LLM servers/inference backends, giving flexibility to use models you trust or have...

Downloads: 18 This Week

Last Update: 2025-09-12

See Project

Milvus Bootcamp

Dealing with all unstructured data, such as reverse image search

Milvus Bootcamp is a collection of tutorials, examples, and best practices for using Milvus, an open-source vector database designed for AI-powered similarity search and retrieval applications.

Downloads: 0 This Week

Last Update: 2025-05-22

See Project

Stanza

Stanford NLP Python library for many human languages

Stanza is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing. Stanza is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of...

Downloads: 6 This Week

Last Update: 2026-02-26

See Project

torchtext

Data loaders and abstractions for text and NLP

We recommend Anaconda as a Python package management system. Please refer to pytorch.org for the details of PyTorch installation. LTS versions are distributed through a different channel than the other versioned releases. Alternatively, you might want to use the Moses tokenizer port in SacreMoses (split from NLTK). You have to install SacreMoses. To build torchtext from source, you need git, CMake and C++11 compiler such as g++. When building from source, make sure that you have the same C++...

Downloads: 0 This Week

Last Update: 2024-04-16

See Project

Search Results for "data processing"

Showing 53 open source projects for "data processing"

Data-Juicer

ExtractThinker

DOLMA

Diffgram

Awesome Fraud Detection Research Papers

DataProfiler

spaCy

DataDreamer

Datasets

Classical Language Toolkit (CLTK)

SetFit

Superlinked

LightAutoML

deepdoctection

MindNLP

PaddleNLP

NVIDIA NeMo

txtai

BettaFish

Haystack

Lingua-Py

Open Interpreter

Milvus Bootcamp

Stanza

torchtext

Search Results for "data processing"

Showing 53 open source projects for "data processing"

Data-Juicer

ExtractThinker

DOLMA

Diffgram

Awesome Fraud Detection Research Papers

DataProfiler

spaCy

DataDreamer

Datasets

Classical Language Toolkit (CLTK)

SetFit

Superlinked

LightAutoML

deepdoctection

MindNLP

PaddleNLP

NVIDIA NeMo

txtai

BettaFish

Haystack

Lingua-Py

Open Interpreter

Milvus Bootcamp

Stanza

torchtext

Related Searches

Related Categories