Real PE workflows.
Real enterprise complexity.

We built PEBench to evaluate AI agents on real enterprise work—private equity workflows that take humans 20+ hours, producing real document deliverables. Each task is validated by 150+ verifiers, with agents running in their native, vendor-provided harnesses.

20+ hour tasks

Real PE workflows

Live filesystem

150+ verifiers/task

Claude Code · Codex · Gemini CLI

Global Rankings

Leaderboard

#	Model	Score

Aggregate Analysis

Common Failure Patterns

Expand each pattern to inspect model-specific examples; collapsed rows summarize overall footprint at a glance.

Aggregate Metrics

Model Comparison — All Tasks

Score — average of per-task verifier pass rates, with equal weight per task. Each model was evaluated on a single trajectory per task.

Tasks

Select Task

Select a file

Click a file to view its contents

Score — average verifier pass rate for this task. Based on a single trajectory per model.

Model Insights

Explore how each model approached the task in a representative run. View the agent's step-by-step actions, tool calls, and verifier results for the selected model.

Process

Task Lifecycle

The end-to-end pipeline for creating, validating, and generating evaluation tasks — from raw data procurement through final deliverable review.

Manual Work

QC Work

Automated Work

Procurement

Sourcers

QA

Domain Experts

Expert 1

Expert 2

Expert 3

Reviewers

Reviewer 1

Reviewer 2

1

Procurement of Data

Sourcers

Source Repository + Operational Artifacts

Repo + Artifacts QA

QA

Expert 1

2

Task Creation

if pass@k ≤ 50%

3

Task Iteration & QC

Proceed when verifiers PASS against golden data

4

Verifier Iteration & QC

Proceed when verifiers PASS against ALL golden data

5

Trajectory Generation + QC

Shared Filesystem

Docs MCP.docx

Spreadsheet.xlsx

PDF MCP.pdf

Presentation.ppt

Calendar.ical

Mail MCP.mbox

Chat MCP.json

Filesystem MCP

Container Runtime

Dockerfile + Repo

Tool
Execution

Real PE workflows.Real enterprise complexity.

Model Insights

Real PE workflows.
Real enterprise complexity.