LMArena Coding Leaderboard 2026: Why GPT-5.2-Codex Beats Claude
- GPT-5.2-codex sits at #1 on the LMArena Code leaderboard as of April 2026, after entering on January 23, 2026 and consolidating a top-3 coding position within 90 days.
- Claude Opus 4.6 holds #2 on Code but ranks ahead of GPT-5.2-codex on Aider's Polyglot benchmark — the two leaderboards measure different things.
- The WebDev arena is a separate leaderboard from Code; GLM-4.7 became the first open-weight model to enter top-10 on both Text and WebDev simultaneously in late March 2026.
- LMArena Code does not test agentic multi-file editing — for that, Polyglot and SWE-Bench Verified are better predictors of production behavior.
- The gap between proprietary and open-weight coding leaders has narrowed to roughly 25-30 Elo points; for workloads above 200M tokens/month, self-hosting GLM-4.7 often wins on TCO.
The lmsys arena coding leaderboard 2026 is the single most consulted document in modern engineering procurement — and the most misread. The headline ranking says GPT-5.2-codex leads. Aider's Polyglot benchmark — which measures agentic multi-file editing — frequently disagrees and ranks Claude Opus 4.6 ahead. Both are right. They measure different things. This page is the comparison-table-first reference: rankings, the cross-leaderboard contradictions, and the procurement framework that resolves them.
LMArena Code Leaderboard — Top 10 (April 2026)
Snapshot freshness: updated weekly. Elo scores rounded; ± values denote 95% confidence interval.
| Rank | Model | Code Elo | CI | Best for |
|---|---|---|---|---|
| 1 | GPT-5.2-codex OpenAI | 1521 | ±6 | Single-file generation, structured programming |
| 2 | Claude Opus 4.6 Anthropic | 1517 | ±5 | Long-context refactoring, architectural reasoning |
| 3 | Claude Opus 4.6 Thinking Anthropic | 1512 | ±5 | Multi-step debugging, complex algorithm design |
| 4 | Gemini 3 Pro Google | 1497 | ±4 | Cross-language translation, polyglot codebases |
| 5 | Grok 4.20-beta1 xAI | 1485 | ±8 | Real-time API integration, live documentation |
| 6 | GPT-5.2 OpenAI | 1481 | ±4 | General-purpose chat-style coding |
| 7 | Gemini 3 Flash Google | 1473 | ±4 | Cost-optimized routine coding tasks |
| 8 | GLM-4.7 Open | 1462 | ±7 | Top-ranked open-weight; full self-host |
| 9 | DeepSeek-V4 Open | 1455 | ±6 | Token economics; inference cost-leader |
| 10 | Qwen 3.5-Coder Open | 1448 | ±8 | Asia-region deployment, multilingual code |
Source: LMArena Code leaderboard via arena-ai-leaderboards JSON feed; cross-referenced with the official LMArena Changelog. Always verify before procurement.
⚠ The Cross-Leaderboard Contradiction
LMArena Code says GPT-5.2-codex leads. Aider's Polyglot benchmark says Claude Opus 4.6 leads.
This is not a measurement error — the two leaderboards test different workloads. LMArena Code measures human preference on chat-style coding tasks (single-turn, IDE-assist style). Aider's Polyglot measures autonomous multi-file editing across six programming languages — closer to how production agentic coding tools (Aider, Cursor, Cline, Devin) actually behave.
For chat-style coding and pair-programming, trust the LMArena ranking. For autonomous agents that edit multiple files in a loop, trust Polyglot. Use both before procurement.
How LMArena Code, Polyglot, and SWE-Bench Verified Disagree
Three coding leaderboards matter for engineering procurement in 2026 — and they regularly rank the same models in different orders. Here's how the top-3 stack up across all three:
| Model | LMArena Code | Aider Polyglot | SWE-Bench Verified |
|---|---|---|---|
| GPT-5.2-codex | #1 | #3 | #2 |
| Claude Opus 4.6 | #2 | #1 | #1 |
| Claude Opus 4.6 Thinking | #3 | #2 | #3 |
| Gemini 3 Pro | #4 | #5 | #4 |
| GLM-4.7 (open-weight) | #8 | #6 | #7 |
Polyglot and SWE-Bench Verified rankings are illustrative based on most recent public reports; verify against the live Aider Polyglot and SWE-Bench dashboards.
Three patterns matter. First, Claude Opus 4.6 dominates the agentic benchmarks (Polyglot, SWE-Bench) where multi-file editing and tool-use are tested. Second, GPT-5.2-codex dominates LMArena Code where single-turn chat-coding preference is tested. Third, GLM-4.7 punches above its LMArena rank on Polyglot — meaning open-weight is closer to viable for agentic coding than the headline Code rank suggests.
WebDev Arena vs Code Arena — They Are Different Leaderboards
This is the most common mistake in coding-model procurement: treating LMArena's WebDev rankings as a substitute for Code rankings. They are not. The WebDev arena specifically tests front-end code generation — HTML, CSS, JavaScript, React, framework-specific tasks. The Code arena covers general programming across all languages, with weighting toward backend, scripting, and algorithmic problem-solving.
A model can rank #1 on WebDev (visual front-end fluency, idiomatic React) and #5 on Code (broader algorithmic depth) — and that's not a contradiction, it's a feature. GLM-4.7's late-March 2026 entry into top-10 on both Text and WebDev simultaneously was the first time an open-weight model achieved this. For a team building a React dashboard, GLM-4.7 is genuinely competitive with proprietary leaders. For a team building distributed systems infrastructure, the picture flips.
Open-Source vs Proprietary: The Coding-Specific Math
The 2024 capability gap on coding tasks has effectively closed. GLM-4.7 sits at #8 on LMArena Code with an Elo of 1462 — roughly 25 to 30 points behind GPT-5.2-codex. DeepSeek-V4 follows at #9 (Elo 1455), and Qwen 3.5-Coder rounds out the open-weight top-10 at #10 (Elo 1448).
That 25-30 Elo gap matters less than the cost gap matters more. For workloads above approximately 200M tokens per month, self-hosting GLM-4.7 typically wins on TCO once GPU amortization, ops headcount, and inference orchestration overhead are factored in. Below that volume, API access usually wins. The full break-even math is in our deep-dive on Open-Source LLM ROI — including why "free" can still cost 60% more than Claude for sub-threshold workloads.
What the LMArena Code Leaderboard Does NOT Measure
This is the procurement signal most engineering leaders miss. The LMArena Code arena tests human preference on single-turn coding chat. It does not test:
- Agentic multi-file editing. When a model autonomously navigates a repo, edits multiple files, runs tests, and iterates — closer to how Aider, Cursor, Cline, and Devin actually work — Aider's Polyglot benchmark is the better predictor.
- Real GitHub issue resolution. SWE-Bench Verified tests whether a model can resolve actual production issues end-to-end. Claude Opus 4.6 dominates here.
- Hallucinated import statements and broken dependencies. Single-turn chat doesn't catch what production agentic loops expose. Internal blind evals on your own codebase are the only reliable signal here.
- Long-context code understanding above 50K tokens. Most LMArena Code prompts are short. Claude Opus 4.6's 200K context advantage doesn't show up in the headline Elo but matters enormously for monorepo work.
- PR-merge rate and code review pass-through rate. The metrics that actually predict production engineering velocity require running an internal arena on your codebase.
For setting up that internal evaluation, see our walkthrough: Build Your Own LMArena: 7-Step Internal Eval Pipeline.
Procurement Framework: Reading All Three Leaderboards
For engineering teams choosing a coding model in 2026, here is the framework that actually works:
- Use LMArena Code to shortlist. Pick the top-5 from the Code leaderboard that match your language and workload profile. Ignore models below position 10 — the Elo gap to top-3 is too large.
- Validate on Polyglot if you are doing agentic coding. If your stack uses Aider, Cursor, Cline, or Devin, the Polyglot ranking matters more than the LMArena rank. Re-shortlist accordingly.
- Check SWE-Bench Verified for issue-resolution workloads. If your team is using AI to resolve real GitHub issues end-to-end, this is the leaderboard to weight highest.
- Run an internal blind eval on your own codebase. No public leaderboard predicts how a model behaves on your specific monorepo, your data residency rules, or your edge-case prompts. Plan a 1-week internal arena.
- Re-evaluate quarterly. The LMArena Code leaderboard reshuffled three times in Q1 2026 alone. A model that ranked top-3 in March can drop to mid-pack by May after a methodology change. Calendar it.
The Grok 4.20 Coding Question
Grok 4.20-beta1 sits at #5 on the LMArena Code arena with an Elo of 1485 — meaningfully behind Claude Opus 4.6 (#2) and GPT-5.2-codex (#1). On real-time API integration and live documentation tasks, Grok pulls ahead because of its built-in search capability. On traditional algorithmic problem-solving and refactoring, it lags.
The bigger procurement issue is data residency. Grok's xAI infrastructure footprint disqualifies it from most regulated procurement (financial services, healthcare, government, EU GDPR-strict workloads). For unregulated workloads where real-time data integration matters, Grok is competitive. For regulated workloads, it isn't a viable contender regardless of where it sits on the Code leaderboard. The full audit is in our Grok 4.20 B2B audit.
Building Sprint Capacity Around the Right Coding Model
Once you've shortlisted via the framework above, the operational question becomes: how do you size sprint capacity around an AI coding agent? Three guardrails:
- Estimate by token consumption, not story points. AI agents don't experience cognitive fatigue but do experience context-window degradation. A 50K-token epic that exceeds the model's effective context will fail catastrophically — break it into modular sub-tasks the model can process discretely.
- Treat the AI as a distinct contributor. Multi-agent orchestration approaches (one model for refactoring, one for tests, one for documentation) consistently outperform single-model approaches by 30-40% on internal benchmarks.
- Budget review time as 30% of generation time. The "Vibe coding" workflow — where engineers focus on architecture and intent rather than syntax — is real, but only works when senior engineers rigorously review every PR. Architectural guardrails are non-negotiable.
- Measure PR-merge rate, not generation speed. A model that generates 10× faster but produces PRs that fail review is net-negative. The metric that matters is "lines of accepted code per dollar of API spend."
Frequently Asked Questions (FAQ)
On the LMArena Code leaderboard, GPT-5.2-codex sits at #1 — it was added on January 23, 2026 and consolidated a top-3 coding position by April. Claude Opus 4.6 holds the #2 slot. The two trade leadership weekly within overlapping confidence intervals. Aider's Polyglot benchmark, however, often ranks Claude Opus 4.6 ahead of GPT-5.2-codex on multi-file editing tasks. The right answer depends on whether your workload looks more like chat-style coding or agentic multi-file editing.
Grok 4.20-beta1 ranks roughly #5 on the LMArena Code arena versus Claude Opus 4.6 at #2 — a meaningful gap. However, Grok 4.20 outperforms Claude on real-time API integration tasks and live documentation generation because of its built-in search capability. For traditional algorithmic problem-solving and refactoring, Claude Opus 4.6 holds a clear edge. Grok carries data-residency caveats that disqualify it from most regulated procurement.
On the LMArena Code leaderboard, yes — by a small but real margin within overlapping confidence intervals. On Aider's Polyglot benchmark, which evaluates agentic multi-file editing rather than chat-style coding, Claude Opus 4.6 frequently ranks ahead. The two leaderboards measure different things: LMArena Code tests preference on coding chat, Polyglot tests autonomous PR-merge rate. Use both before procurement.
Not directly. LMArena's Code arena and WebDev arena test single-turn coding preferences via human voting. They do not test multi-turn agentic flows where a model autonomously edits multiple files, runs tests, and iterates. For agentic evaluation, cross-reference with Aider's Polyglot benchmark and SWE-Bench Verified. The LMArena Code leaderboard predicts chat-style coding quality; Polyglot predicts production agentic behavior.
Both — for different decisions. LMArena Code is best for selecting a model for IDE-assist, code review, and chat-style pair programming. Aider Polyglot is best for selecting a model for autonomous agents (Aider, Cursor, Cline, Devin). The two leaderboards regularly disagree on the top three. Sophisticated engineering teams shortlist models on LMArena Code, validate on Polyglot, then run an internal blind eval on their own codebase.
The WebDev arena tests front-end code generation specifically — HTML, CSS, JavaScript, React, framework-specific tasks. The Code arena covers general programming across all languages. A model can rank #1 on WebDev (visual front-end fluency) but #5 on Code (broader algorithmic depth). GLM-4.7 famously entered top-10 on both Text and WebDev simultaneously in late March 2026 — the first open-weight model to do so.
GLM-4.7 is the highest-ranked open-weight model on both LMArena Code and WebDev as of April 2026, sitting at approximately #8 on the Code leaderboard. DeepSeek-V4 and Qwen 3.5-Coder follow closely behind. The gap to proprietary leaders has narrowed to roughly 25-30 Elo points — close enough that for code-heavy workloads above 200M tokens per month, self-hosting GLM-4.7 often wins on TCO once GPU amortization is factored in.
Partially. LMArena Code Elo correlates with single-turn code generation quality but does not predict PR-merge rate, hallucinated import statements, or behavior across long agentic chains. SWE-Bench Verified — which tests real GitHub issue resolution — is a better proxy for production behavior. The most reliable predictor is an internal blind eval run against your own codebase and prompts.
Because coding and general conversation reward different fine-tuning. GPT-5.2-codex is fine-tuned aggressively for code generation, which makes it dominant on the Code arena but mid-pack on the Text leaderboard. Claude Opus 4.6 is more balanced — it ranks top-3 on both Text and Code. The lesson: ignore the Overall headline ranking and always consult the leaderboard that matches your actual use case.
Approximate input/output rates as of April 2026: Claude Opus 4.6 is the most expensive at standard tier; GPT-5.2-codex sits roughly 30-40% lower; Gemini 3 Pro is the cheapest among proprietary leaders. Open-weight self-hosting (GLM-4.7, DeepSeek-V4) breaks even versus API access at approximately 200M tokens per month after factoring GPU amortization, ops headcount, and inference orchestration overhead. See our open-source LLM ROI analysis for the full break-even math.
Conclusion: Read All Three Leaderboards Before You Sign Anything
The LMArena Code leaderboard is the most consulted coding-model document of 2026 — and the most misread. GPT-5.2-codex's #1 ranking is real but tells only part of the story. Claude Opus 4.6 wins on Aider's Polyglot benchmark and SWE-Bench Verified — the leaderboards that predict production agentic behavior. GLM-4.7 closes the open-weight gap close enough that high-volume workloads should genuinely consider self-hosting.
Pick your leaderboard based on your workload. Validate on at least two. Then run an internal blind eval on your own codebase before signing any procurement contract. For setting up that internal evaluation in a week, see Build Your Own LMArena. For the underlying methodology that makes confidence intervals matter more than the headline rank, see LMArena Elo Explained.