Easy DataSet is a comprehensive open-source tool designed to make creating high-quality datasets for large language model fine-tuning, retrieval-augmented generation (RAG), and evaluation as easy and automated as possible by providing intuitive interfaces and powerful parsing, segmentation, and labeling tools. It supports ingesting domain-specific documents in a wide range of formats — including PDF, Markdown, DOCX, EPUB, and plain text — and can intelligently segment, clean, and structure content into rich datasets tailored for downstream LLM training needs. The system includes automated question-generation capabilities, hierarchical label trees, and answer generation pipelines that use LLM APIs to produce coherent paired data with customizable templates. Beyond dataset creation, Easy-dataset also provides a built-in evaluation system with model testing and blind-test features, helping teams validate model performance using curated test sets.
Features
- Document ingest and intelligent parsing (PDF, DOCX, more)
- Automatic dataset generation for fine-tuning
- Question and answer generation using LLMs
- Built-in model evaluation and testing systems
- Multiple export formats (JSON/JSONL, Hugging Face)
- Support for diverse dataset types (dialogue, image QA)