data.6bin free download

Data-Juicer

Data processing for and with foundation models

Data-Juicer is an open-source data processing and augmentation framework designed to enhance the quality and diversity of datasets for machine learning tasks. It includes a modular pipeline for scalable data transformation.

Downloads: 0 This Week

Last Update: 2026-03-17

See Project

Diffgram

Training data (data labeling, annotation, workflow) for all data types

From ingesting data to exploring it, annotating it, and managing workflows. Diffgram is a single application that will improve your data labeling and bring all aspects of training data under a single roof. Diffgram is world’s first truly open source training data platform that focuses on giving its users an unlimited experience. This is aimed to reduce your data labeling bills and increase your Training Data Quality.

Downloads: 2 This Week

Last Update: 2024-10-14

See Project

ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs

ExtractThinker is a tool designed to facilitate the extraction and analysis of information from various data sources, aiding in data processing and knowledge discovery.

Downloads: 2 This Week

Last Update: 2025-06-09

See Project

DataProfiler

Extract schema, statistics and entities from datasets

DataProfiler is an AI-powered tool for automatic data analysis and profiling, designed to detect patterns, anomalies, and schema inconsistencies in structured and unstructured datasets. The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy. Loading Data with a single command, the library automatically formats & loads files into a DataFrame.

Downloads: 1 This Week

Last Update: 2025-07-30

See Project

Awesome Fraud Detection Research Papers

A curated list of data mining papers about fraud detection

A curated list of data mining papers about fraud detection from several conferences.

Downloads: 0 This Week

Last Update: 2026-01-05

See Project

DataDreamer

DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models

DataDreamer is a tool designed to assist in the generation and manipulation of synthetic data for various applications, including testing and machine learning.

Downloads: 0 This Week

Last Update: 2025-02-02

See Project

DOLMA

Data and tools for generating and inspecting OLMo pre-training data

DOLMA (Data Optimization and Learning for Model Alignment) is a framework designed to manage large-scale datasets for training and fine-tuning language models efficiently.

Downloads: 2 This Week

Last Update: 2025-06-25

See Project

txtai

Build AI-powered semantic search applications

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications. Traditional search systems use keywords to find data. Semantic search applications have an understanding of natural language and identify results that have the same meaning, not necessarily the same keywords. Backed by state-of-the-art machine learning models, data is transformed into vector representations for search (also known as embeddings).

Downloads: 4 This Week

Last Update: 2026-04-29

See Project

Open Interpreter

A natural language interface for computers

Open Interpreter is an open-source tool that provides a natural-language interface for interacting with your computer. It lets large language models (LLMs) run code locally (Python, JavaScript, shell, etc.), enabling you to ask your computer to do tasks like data analysis, file manipulation, browsing, etc. in human terms (“chat with your computer”), with safeguards. Runs locally or via configured remote LLM servers/inference backends, giving flexibility to use models you trust or have locally. It prompts you to approve code before executing, and supports both online LLM models and local inference servers. ...

Downloads: 21 This Week

Last Update: 2025-09-12

See Project

tidytext

Text mining using tidy tools

tidytext brings tidy data principles to text mining by converting text into a tidy data frame format. It provides tools for tokenization, sentiment analysis, n‑gram creation, and term‑document matrices, enabling interoperability with dplyr, ggplot2, and other tidyverse workflows.

Downloads: 0 This Week

Last Update: 2025-07-30

See Project

Datasets

Hub of ready-to-use datasets for ML models

...Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow). Smart caching: never wait for your data to process several times.

Downloads: 4 This Week

Last Update: 2026-04-27

See Project

Superlinked

Superlinked is a Python framework for AI Engineers

Superlinked is a Python framework designed for AI engineers to build high-performance search and recommendation applications that combine structured and unstructured data.

Downloads: 0 This Week

Last Update: 2025-10-22

See Project

LightAutoML

Fast and customizable framework for automatic ML model creation

LightAutoML is an automated machine learning (AutoML) framework optimized for efficient model training and hyperparameter tuning, focusing on both tabular and text data.

Downloads: 0 This Week

Last Update: 2025-12-04

See Project

SetFit

Efficient few-shot learning with Sentence Transformers

SetFit is an efficient and prompt-free framework for few-shot fine-tuning of Sentence Transformers. It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples.

Downloads: 1 This Week

Last Update: 2025-08-05

See Project

PaddleNLP

Easy-to-use and powerful NLP library with Awesome model zoo

PaddleNLP It is a natural language processing development library for flying paddles, with Easy-to-use text area API, Examples of applications for multiple scenarios, and High-performance distributed training Three major features, aimed at improving the modeling efficiency of the flying oar developer's text field, aiming to improve the developer's development efficiency in the text field, and provide rich examples of NLP applications. Provide rich industry-level pre-task capabilities Taskflow And process-wide text area API: Support for the loading of rich Chinese data sets Dataset API, can flexibly and efficiently complete data pretreatment Data API, Preset 60 + pre-training word vector Embedding API, Providing 100 + pre-training model Transformer API Wait, the efficiency of NLP task modeling can be greatly improved.

Downloads: 1 This Week

Last Update: 2025-05-21

See Project

BettaFish

Public opinion analysis system

BettaFish is an open-source, multi-agent public opinion analysis system built to automate the collection, deep analysis, and reporting of social media data at scale through conversational queries. It uses a modular architecture of specialized agents that collaborate to crawl mainstream platforms, extract multimodal content like text and short video, and synthesize insights through both statistical and large language model techniques. With a design that lets users pose questions in natural language and receive structured reports, charts, and visualizations, the system aims to break information cocoons and provide comprehensive views of trends and public sentiment. ...

Downloads: 2 This Week

Last Update: 2026-02-17

See Project

NVIDIA NeMo

Toolkit for conversational AI

...NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures. Conversational AI architectures are typically large and require a lot of data and compute for training. NeMo uses PyTorch Lightning for easy and performant multi-GPU/multi-node mixed-precision training. Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, Squeezeformer-CTC, Squeezeformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC. ...

Downloads: 3 This Week

Last Update: 2026-04-22

See Project

deepdoctection

A Repo For Document AI

DeepDoctection is a document AI framework that applies deep learning techniques to analyze and extract structured data from scanned documents, PDFs, and images. deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated frameworks for fine-tuning, evaluating and running models. ...

Downloads: 3 This Week

Last Update: 6 days ago

See Project

MindNLP

Easy-to-use and high-performance NLP and LLM framework

MindNLP is a natural language processing library built on the MindSpore framework, providing tools and models for various NLP tasks.

Downloads: 1 This Week

Last Update: 2025-11-05

See Project

spaCy

Industrial-strength Natural Language Processing (NLP)

spaCy is a library built on the very latest research for advanced Natural Language Processing (NLP) in Python and Cython. Since its inception it was designed to be used for real world applications-- for building real products and gathering real insights. It comes with pretrained statistical models and word vectors, convolutional neural network models, easy deep learning integration and so much more. spaCy is the fastest syntactic parser in the world according to independent benchmarks, with...

Downloads: 4 This Week

Last Update: 2026-03-29

See Project

Spark NLP

State of the Art Natural Language Processing

...Spark ML provides a set of machine learning applications that can be built using two main components, estimators and transformers. The estimators have a method that secures and trains a piece of data to such an application. The transformer is generally the result of a fitting process and applies changes to the target dataset. These components have been embedded to be applicable to Spark NLP. Pipelines are a mechanism for combining multiple estimators and transformers in a single workflow. They allow multiple chained transformations along a machine-learning task.

Downloads: 2 This Week

Last Update: 2026-04-07

See Project

Weaviate

Weaviate is a cloud-native, modular, real-time vector search engine

Weaviate in a nutshell: Weaviate is a vector search engine and vector database. Weaviate uses machine learning to vectorize and store data, and to find answers to natural language queries. With Weaviate you can also bring your custom ML models to production scale. Weaviate in detail: Weaviate is a low-latency vector search engine with out-of-the-box support for different media types (text, images, etc.). It offers Semantic Search, Question-Answer-Extraction, Classification, Customizable Models (PyTorch/TensorFlow/Keras), and more. ...

Downloads: 4 This Week

Last Update: 4 days ago

See Project

Stanford CoreNLP

Stanford CoreNLP, a Java suite of core NLP tools

...The centerpiece of CoreNLP is the pipeline. Pipelines take in raw text, run a series of NLP annotators on the text, and produce a final set of annotations. Pipelines produce CoreDocuments, data objects that contain all of the annotation information, accessible with a simple API, and serializable to a Google Protocol Buffer. CoreNLP generates a variety of linguistic annotations, including parts of speech, named entities, dependency parses, and coreference.

Downloads: 3 This Week

Last Update: 2025-06-07

See Project

Stanza

Stanford NLP Python library for many human languages

...The toolkit is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism. Stanza is built with highly accurate neural network components that also enable efficient training and evaluation with your own annotated data.

Downloads: 3 This Week

Last Update: 2026-02-26

See Project

Search-Index

A persistent, network resilient, full text search library

Search-Index is a lightweight and fast JavaScript-based search engine that enables full-text search indexing and retrieval for web applications.

Downloads: 0 This Week

Last Update: 2025-03-12

See Project

Search Results for "data.6bin"

Showing 92 open source projects for "data.6bin"

Data-Juicer

Diffgram

ExtractThinker

DataProfiler

Awesome Fraud Detection Research Papers

DataDreamer

DOLMA

txtai

Open Interpreter

tidytext

Datasets

Superlinked

LightAutoML

SetFit

PaddleNLP

BettaFish

NVIDIA NeMo

deepdoctection

MindNLP

spaCy

Spark NLP

Weaviate

Stanford CoreNLP

Stanza

Search-Index

Search Results for "data.6bin"

Showing 92 open source projects for "data.6bin"

Data-Juicer

Diffgram

ExtractThinker

DataProfiler

Awesome Fraud Detection Research Papers

DataDreamer

DOLMA

txtai

Open Interpreter

tidytext

Datasets

Superlinked

LightAutoML

SetFit

PaddleNLP

BettaFish

NVIDIA NeMo

deepdoctection

MindNLP

spaCy

Spark NLP

Weaviate

Stanford CoreNLP

Stanza

Search-Index

Related Searches

Related Categories