We built PEBench to evaluate AI agents on real enterprise work—private equity workflows that take humans 20+ hours, producing real document deliverables. Each task is validated by 150+ verifiers, with agents running in their native, vendor-provided harnesses.
| # | Model | Score |
|---|
Score — average of per-task verifier pass rates, with equal weight per task. Each model was evaluated on a single trajectory per task.
Score — average verifier pass rate for this task. Based on a single trajectory per model.
Explore how each model approached the task in a representative run. View the agent's step-by-step actions, tool calls, and verifier results for the selected model.