Data processing for and with foundation models
Self-learning data agent that grounds its answers in layers of content
An end-to-end Data Scientist
Synthetic Data Generation for tabular, relational and time series data
Official DeiT repository
Machine learning in Python
OCRmyPDF adds an OCR text layer to scanned PDF files
Label Studio is a multi-type data labeling and annotation tool
Training data (data labeling, annotation, workflow) for all data types
Conditional GAN for generating synthetic tabular data
The open-source tool for building high-quality datasets
A reactive notebook for Python
1 min voice data can also be used to train a good TTS model
DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models
Create HTML profiling reports from pandas DataFrame objects
Wan2.2: Open and Advanced Large-Scale Video Generative Model
Uncover insights, surface problems, monitor, and fine tune your LLM
Code for running inference and finetuning with SAM 3 model
AutoGluon: AutoML for Image, Text, and Tabular Data
Benchmarking synthetic data generation methods
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine
ExtractThinker is a Document Intelligence library for LLMs
Efficient Triton Kernels for LLM Training
Interact with your documents using the power of GPT