Why Your SWE-Bench Verified Score Is Already Obsolete

SWE-Bench Verified vs SWE-Bench Pro SEAL leaderboard comparison 2026

If your vendor's last pitch deck quoted an 80%+ SWE-Bench Verified score as proof their coding model is "production-ready," you were sold a number OpenAI itself stopped reporting in February 2026. The leaderboard that procurement teams trusted for two years is now, by the publisher's own admission, contaminated, saturated, and measuring memorization more than capability.

The replacement — SWE-Bench Pro on the Scale SEAL leaderboard — tells a starkly different story. Same models. Same week. A 20-to-35 point honesty gap. This sub-page unpacks exactly what changed, which scores still mean something, and how to write that distinction into your 2026 procurement contracts. For the full benchmark landscape, see our parent guide to the AI coding benchmarks leaderboard 2026, which maps every metric in this audit to the procurement decision it should — and shouldn't — drive.

  • SWE-Bench Verified is officially obsolete for frontier measurement. OpenAI's February 23, 2026 disclosure found 59.4% of audited failed tasks contained flawed tests, and confirmed every frontier model showed training-data contamination.
  • SWE-Bench Pro is the procurement-grade replacement, with 1,865 multi-language tasks across 41 repositories, 250-turn limits, and standardized SEAL scaffolding that isolates raw model capability.
  • The 2026 leader gap is now visible. Claude Opus 4.7 leads SWE-Bench Pro at 64.3% (Anthropic-reported); on the strictly standardized Scale SEAL board, Claude Opus 4.5 leads at 45.9%.
  • The 20-35 point Verified-to-Pro drop is the most important number on your evaluation spreadsheet — it tells you how much of a vendor's reported score is real capability versus benchmark engineering.
  • Agent scaffolding can swing the same model by 10+ points. Comparing two "Claude Opus" rows without checking the harness is a procurement error, not a data point.

The Headline Most Buyers Still Don't Know: OpenAI Walked Away From Its Own Benchmark

On February 23, 2026, OpenAI published a post titled "Why SWE-bench Verified no longer measures frontier coding capabilities." It was a quiet detonation.

OpenAI's Frontier Evals team had audited 138 tasks — 27.6% of SWE-Bench Verified's 500-problem dataset — that its o3 model couldn't consistently solve across 64 independent runs. The verdict, after six engineers independently reviewed each case, was brutal: 59.4% of those audited problems contained flawed test cases.

Specifically:

  • 35.5% of failed tasks had "narrow" tests that enforced specific implementation details never mentioned in the problem statement — rejecting functionally correct submissions.
  • 18.8% had "broad" tests checking unrelated functionality scraped from pull request diffs.
  • Frontier models — including GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash — could reproduce gold patches and verbatim problem statements, indicating training-data exposure.
"Improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities."

OpenAI's conclusion was not subtle. They stopped reporting and recommended the industry move to SWE-Bench Pro.

For an even deeper look at how this contamination flowed through every major leaderboard, see our companion sub-page on the SWE-Bench training data leakage disclosure.

Why a Saturated Benchmark Is a Dangerous Benchmark

Verified went from 74.9% to 80.9% in six months — and then progress flattened. That ceiling wasn't a ceiling of model capability. It was a ceiling of how clean the benchmark itself could possibly score.

When the test is broken, every additional point of "progress" is just better pattern-matching against artifacts of the test.

For a procurement officer, this is the difference between a vendor claiming 87% accuracy and a vendor claiming 87% recognition of a memorized exam. Same number. Entirely different product.

SWE-Bench Verified vs SWE-Bench Pro: The Procurement-Grade Comparison

The two benchmarks share a name and almost nothing else. Here's the spec sheet that matters for your evaluation matrix:

Dimension SWE-Bench Verified SWE-Bench Pro
MaintainerPrinceton (original authors)Scale AI SEAL Lab
Task count5001,865 (731 public split)
LanguagesPython onlyPython, Go, TypeScript, JavaScript
Repositories12 popular OSS libraries41 actively maintained repos + 18 proprietary startup codebases
Avg lines changed~15107 lines across 4.1 files
ScaffoldVendor-chosenStandardized SEAL (250-turn limit)
Contamination riskHigh (publicly disclosed)Low (private holdout subset)
Status (May 2026)Deprecated by OpenAIIndustry-recommended replacement

The single most important row is the bottom one. A model scoring 80% on Verified can score 23% on Pro. That's not noise; that's the gap between memorizing a textbook and writing original code.

How SWE-Bench Pro Was Engineered to Resist Cheating

Scale AI's SEAL (Scale's Evaluation and Assessment Lab) didn't just refresh the dataset — they redesigned the evaluation contract:

  • Private holdout split. 18 proprietary codebases were licensed from startups under NDA. Models cannot have seen these during training. Performance on the private subset drops further still — GPT-5 fell from 23.1% to 14.9%, and Claude Opus 4.1 from 22.7% to 17.8%.
  • GPL-licensed public split. 11 repositories are openly available on HuggingFace, but selected from commit histories after major model training cutoffs.
  • Standardized scaffolding. Every model runs through the same mini-swe-agent harness with identical tool access and a 250-turn cap. This neutralizes the "we built a better agent loop" excuse.

For enterprises, the private subset is the killer feature. Your codebase is not Django. Your bugs are not in scikit-learn's issue tracker. Pro's private split is the closest a public benchmark gets to simulating that reality.

The SEAL Leaderboard in May 2026: Who Actually Leads

This is where the contrarian story gets sharper. The numbers vendors quote in their sales decks rarely match the numbers Scale AI publishes on the standardized SEAL board. Both can be technically true. Only one is comparable.

The Three Layers of "Leading" on SWE-Bench Pro

Buyers must distinguish three different score columns, because they measure three different things:

1. SEAL Standardized Scaffold (the apples-to-apples board)

This is the strictest comparison — every model runs the same mini-swe-agent v2 harness with a 250-turn limit. As of May 2026:

  • Claude Opus 4.5 — 45.9% (Scale-run, standardized)
  • Claude Opus 4.6 — 51.9% (Scale-run, mini-swe-agent harness)
  • GPT-5.4 (xHigh) — 59.1%
  • GPT-5.3-Codex — 56.8%
  • Muse Spark (Meta) — 55.0%
  • Gemini 3.1 Pro — 46.1%

2. Agent-System Scores (vendor's own scaffold)

Same model, custom harness. These scores aren't directly comparable but show the ceiling each lab can engineer toward:

  • Claude Opus 4.7 — 64.3% (Anthropic-reported)
  • Claude Opus 4.6 + WarpGrep v2 — 57.5% (Morph internal)
  • GPT-5.3-Codex CLI — 56.8%

3. SWE-Bench Verified (deprecated but still quoted)

For context only. Treat as a directional signal, not a measurement:

  • GPT-5.5 — 88.7% (OpenAI-reported)
  • Claude Opus 4.7 — 87.6%
  • GPT-5.3-Codex — 85.0%
  • Claude Opus 4.5 — 80.9%, Opus 4.6 — 80.8%, Gemini 3.1 Pro — 80.6%

The key procurement insight: the same Claude Opus 4.5 scores 80.9% on Verified, 45.9% on the SEAL standardized board, and somewhere in between with custom scaffolding. A 35-point spread on identical model weights. That spread is what your contract language must address.

The Scaffolding Gap Most Buyers Miss

Three different agent systems ran Claude Opus 4.5 against the SWE-Bench Pro public set. Scores ranged from 50.2% to 55.4%. That five-point spread came entirely from how the agent managed context windows, tool calls, and retrieval — not from the model.

Translation: half of "model quality" in production is actually scaffolding quality. When a vendor quotes their highest-scaffolded number against a competitor's standardized number, the comparison is sales theater, not data.

How SWE-Bench Pro Measures What Verified Couldn't

The original benchmark asked a simple question: can a model patch a known bug in a popular library? Pro asks something harder and closer to the work your engineers actually do:

  • Multi-file changes. Average task touches 4.1 files — closer to a real PR than a one-line fix.
  • Longer-horizon reasoning. 107-line average diff means context management matters as much as code generation.
  • Cross-language depth. Go, TypeScript, and JavaScript expose models that over-trained on Python.
  • Real commit pairs. Tasks are sourced from consecutive commits where one resolves a bug or adds a feature, paired with the actual test the developer wrote.

This last point is subtle but important. Verified used curated, polished test cases. Pro uses the test the original engineer wrote when they shipped the fix. That test is messier, more idiomatic, and far less gameable.

The result is a benchmark that finally penalizes the failure modes that hurt enterprises most: forgotten edge cases, broken imports, regressions in adjacent files, and the dreaded "works in isolation, breaks in the suite."

What Procurement Teams Should Actually Demand in 2026 Contracts

If you're signing or renewing a coding-model contract in the next two quarters, the SEAL data forces three new clauses into your RFP.

1. Require the SEAL standardized score, not the vendor's agent-system score.

Ask explicitly for the most recent Scale AI SEAL mini-swe-agent v2 result on the SWE-Bench Pro public split. If the vendor can only produce an agent-system score, treat it as marketing collateral, not performance data.

2. Demand the Verified-to-Pro delta.

Any model with a delta wider than 30 points is showing benchmark-engineering, not capability. The deltas to track:

  • Claude Opus 4.5: 80.9% → 45.9% (35-point drop)
  • GPT-5.2 / GPT-5.3-Codex: ~85% → 56–57% (~28-point drop)
  • Gemini 3.1 Pro: 80.6% → 46.1% (34-point drop)
  • Smaller deltas across the field would indicate a maturing market. We're not there yet.

3. Write a "leaderboard shift" clause.

SEAL refreshes monthly. Any 12-month contract signed today will outlast at least two ranking shuffles. Bake in a 90-day re-evaluation right and a tiered SLA tied to a specific benchmark score band, not a vendor name.

The Bottom Line for 2026 Buyers

The SWE-Bench Verified score on your vendor's slide deck is no longer evidence. It is, at best, a directional artifact from a deprecated test — and at worst, a memorized answer to a contaminated exam.

The SWE-Bench Pro SEAL leaderboard is the replacement, and the 20-35 point delta between the two is now the most informative number in any AI coding RFP.

For procurement, the action is simple: rewrite your evidence standard before your next renewal. Demand SEAL standardized scores. Demand the Verified-to-Pro delta. Demand a pilot on your private code. Treat anything else as vendor narrative.

Read the Full AI Coding Benchmark Audit

About the Author: Sanjay Saini

Sanjay Saini is an Enterprise AI Strategy Director specializing in digital transformation and AI ROI models. He covers high-stakes news at the intersection of leadership and sovereign AI infrastructure.

Connect on LinkedIn

Frequently Asked Questions (FAQ)

What is the difference between SWE-Bench Verified and SWE-Bench Pro?

SWE-Bench Verified is a 500-task, Python-only benchmark that OpenAI deprecated in February 2026 after finding 59.4% of audited failures had flawed tests and widespread training contamination. SWE-Bench Pro is Scale SEAL's replacement: 1,865 multi-language tasks across 41 repositories, with standardized scaffolding and a private holdout subset.

Why did OpenAI stop relying on SWE-Bench Verified for internal evaluation?

OpenAI's Frontier Evals team audited 138 failed tasks and found 59.4% had broken test cases — either too narrow (enforcing undocumented implementation details) or too broad (testing unmentioned features). They also confirmed GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash showed verbatim training-data exposure to benchmark solutions, making scores meaningless as a capability signal.

What is the SEAL leaderboard and who maintains it?

SEAL stands for Scale's Evaluation and Assessment Lab, the research arm of Scale AI. SEAL maintains the standardized SWE-Bench Pro leaderboard, running every submitted model through identical mini-swe-agent v2 scaffolding with a 250-turn limit. This isolates raw model capability from agent-engineering quality, making it the closest thing to an apples-to-apples 2026 benchmark.

Which model leads SWE-Bench Pro on the SEAL leaderboard in 2026?

As of May 2026, on the standardized SEAL scaffold, Claude Opus 4.5 leads at 45.9% with Opus 4.6 at 51.9%. With Anthropic's own scaffolding, Claude Opus 4.7 reaches 64.3%. GPT-5.4 (xHigh) holds 59.1% on SEAL; GPT-5.3-Codex reaches 56.8%. Rankings refresh monthly and remain volatile.

How does SWE-Bench measure agent capability versus base model?

SWE-Bench scores reflect two stacked variables: the underlying model and the agent scaffold wrapping it (tool access, context retrieval, turn limits). Three teams ran identical Claude Opus 4.5 against Pro and scored 50.2% to 55.4% — a five-point spread from scaffolding alone. SEAL's standardized harness strips scaffolding effects so the base model is what's measured.

What does 'training data contamination' mean in SWE-Bench?

Contamination means the model saw the benchmark's problems or solutions during pretraining. Because SWE-Bench Verified pulls from public GitHub issues, every frontier model likely ingested at least some gold patches. OpenAI confirmed GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash could reproduce verbatim solutions — turning the score into a memory test rather than a reasoning test.

How do agent-system scores differ from SEAL standardized scores?

Agent-system scores come from a vendor's own custom scaffold — proprietary tools, retrieval, and context management optimized for the benchmark. SEAL standardized scores use one common harness (mini-swe-agent v2, 250-turn cap) across all models. Agent-system scores are higher but not comparable across vendors; SEAL scores are lower but the only fair head-to-head.

Should enterprises trust SWE-Bench for procurement decisions?

Trust SWE-Bench Pro SEAL scores as one input among several, never as a single source of truth. Pair them with the Verified-to-Pro delta, the private-subset performance drop, and a pilot on your own codebase. Treat SWE-Bench Verified scores as deprecated context — useful for trend analysis but unfit for 2026 procurement weighting.

How frequently does the SEAL leaderboard refresh?

The Scale SEAL leaderboard refreshes roughly monthly, with new model entries added as labs submit and Scale completes standardized runs. Major shifts have happened almost every cycle in 2026 — Claude Opus 4.7, GPT-5.5, and Muse Spark all entered within six weeks of each other. Bake monthly monitoring into your vendor governance cadence.

What scaffold is used to evaluate models on SWE-Bench Pro?

The default SEAL evaluation scaffold is mini-swe-agent v2, a minimalist harness designed by the SWE-Bench team to standardize tool access, file editing, and command execution. It enforces a 250-turn limit and uncapped token cost, isolating model reasoning from agent engineering. Vendor-custom scaffolds (Claude Code, Codex CLI, Junie, ForgeCode) are reported separately and not directly comparable.

Sources & References