Real PE workflows.
Real enterprise complexity.

We built PEBench to evaluate AI agents on real enterprise work—private equity workflows that take humans 20+ hours, producing real document deliverables. Each task is validated by 150+ verifiers, with agents running in their native, vendor-provided harnesses.

20+ hour tasks
Real PE workflows
Live filesystem
150+ verifiers/task
Claude Code · Codex · Gemini CLI
Leaderboard
#ModelScore
Common Failure Patterns
Expand each pattern to inspect model-specific examples; collapsed rows summarize overall footprint at a glance.
Model Comparison — All Tasks

Scoreaverage of per-task verifier pass rates, with equal weight per task. Each model was evaluated on a single trajectory per task.

Select Task
~/workspace
Select a file
Click a file to view its contents

Scoreaverage verifier pass rate for this task. Based on a single trajectory per model.

Model Insights

Explore how each model approached the task in a representative run. View the agent's step-by-step actions, tool calls, and verifier results for the selected model.

Task Lifecycle
The end-to-end pipeline for creating, validating, and generating evaluation tasks — from raw data procurement through final deliverable review.
Manual Work
QC Work
Automated Work
Procurement
Sourcers
QA
Domain Experts
Expert 1
Expert 2
Expert 3
Reviewers
Reviewer 1
Reviewer 2
1
Procurement of Data
Sourcers
Source Repository + Operational Artifacts
Repo + Artifacts QA
QA
Expert 1
2
Task Creation
if pass@k ≤ 50%
3
Task Iteration & QC
Proceed when verifiers PASS against golden data
4
Verifier Iteration & QC
Proceed when verifiers PASS against ALL golden data
5
Trajectory Generation + QC
Shared Filesystem
Docs MCP.docx
Spreadsheet.xlsx
PDF MCP.pdf
Presentation.ppt
Calendar.ical
Mail MCP.mbox
Chat MCP.json
Filesystem MCP
Container Runtime
Dockerfile + Repo
Tool
Execution