muchori/terminal-bench

Python

Apache License 2.0

Terminal-Bench is a benchmark tool for testing AI agents in real terminal environments, evaluating their ability to handle end-to-end tasks like compiling code, training models, and setting up servers autonomously. It consists of a dataset of tasks and an execution harness that connects language models to a sandboxed terminal, designed for developers, researchers, and engineers building or benchmarking LLM agents. The project is currently in beta with ~100 tasks and aims to become a comprehensive testbed for AI agents in text-based environments.

Total donated

Undistributed

Share with your subscribers: