DocETL is an open-source system designed to build and execute data processing pipelines powered by large language models, particularly for analyzing complex collections of documents and unstructured datasets. The platform allows developers and researchers to construct structured workflows that extract, transform, and organize information from sources such as reports, transcripts, legal documents, and other text-heavy data. Instead of relying on single prompts or ad-hoc scripts, DocETL provides a declarative pipeline framework that breaks complex document analysis tasks into manageable operations that can be optimized and orchestrated automatically. Pipelines are typically defined using a low-code YAML interface, giving users full control over prompts and processing steps while still simplifying workflow creation.
Features
- Low-code YAML interface for defining document processing pipelines
- Specialized operators for entity resolution and contextual document analysis
- Agent-based optimization that improves pipeline accuracy and output quality
- Interactive development environment for experimenting with prompts and workflows
- Python package for running production pipelines via CLI or code
- Support for extracting structured data from large collections of unstructured documents