AgentEX is an open framework from Scale for building, running, and evaluating agentic workflows, with an emphasis on reproducibility and measurable outcomes rather than ad-hoc demos. It treats an “agent” as a composition of a policy (the LLM), tools, memory, and an execution runtime so you can test the whole loop, not just prompting. The repo focuses on structured experiments: standardized tasks, canonical tool interfaces, and logs that make it possible to compare models, prompts, and tool sets fairly. It also includes evaluation harnesses that capture success criteria and partial credit, plus traces you can inspect to understand where reasoning or tool use failed. The design encourages clean separation between experiment configuration and code, which makes sharing results or re-running baselines straightforward. Teams use it to progress from prototypes to production-ready agent behaviors by iterating on prompts, adding tools, and validating improvements with consistent metrics.
Features
- Pluggable tools and memory with a stable runtime for end-to-end agent execution
- Declarative experiment configs to reproduce runs across models and prompts
- Built-in evaluators and task suites to benchmark agent behavior
- Rich tracing and logging for step-by-step debugging and error analysis
- Multi-model support to compare providers and settings with the same tasks
- CLI and Python APIs so you can script workflows or integrate with CI