Who's #1 on LMArena Right Now? The Live Top-10 Decoded

By Sanjay Saini | Published: May 2, 2026 | 8 min read

LMArena live top-10 leaderboard May 2026 current rankings dashboard with Elo scores and confidence intervals

A live decode of the LMSYS/LMArena Top Models 2026 leaderboard — who's #1 right now, why the headline rank lies, and which "leadership" claims are statistically meaningless for procurement.

The Text leaderboard top-3 is statistically tied. Claude Opus 4.6 (1504), Gemini 3.1 Pro Preview (1500), and Claude Opus 4.6 Thinking (1500) all sit within overlapping 95% confidence intervals. The "headline #1" is essentially noise.
Five models rotate the top slot week over week. Add Grok 4.20-beta1 and Gemini 3 Pro and you have the rotating cast. Locking a procurement decision to a single screenshot is the single most expensive mistake teams make.
Preview models have wider CIs and unstable Elo. Anything tagged "Preliminary" can swing 20–40 Elo points before stabilizing. Wait for the tag to drop before signing.
Vote count matters more than headline rank. A model at #4 with 39,000 votes is more procurement-grade than a model at #2 with 4,000 votes — even if the #2 has a higher headline number.
Geographic and mirror variance are real. The same model can rank differently when prompted predominantly from U.S. vs India vs EU contexts, and OpenLM/Hugging Face mirrors weight benchmark blends differently than LMArena.

Five models trade the LMSYS chatbot arena leaderboard current top models slot weekly. Procurement teams that lock in the wrong "leader" lose 18% on tokens. That's not a marketing line — it's the average overspend we measure when teams sign 12-month contracts based on a Wednesday Elo screenshot that's already obsolete by the following Tuesday.

This page is the live snapshot decoded — what's actually at #1 right now, how often it changes, and which "leadership" claims are statistically meaningless. For the broader monthly tracker and weekly top-10 widget, start at the parent pillar: LMSYS/LMArena Top Models 2026 pillar. This sub-page zooms specifically into the real-time read — the queries searchers actually ask when they type "who is leading LMArena right now."

The official source we cross-reference throughout is the live LMArena leaderboard at lmarena.ai — verify any procurement-grade decision against it directly before you sign anything.

Today's Top-10 Snapshot — and Why It's Already Stale

The current LMArena Text leaderboard top-10 (rounded Elo, ±95% CI):

Live LMArena Text Leaderboard — Top 10

Snapshot freshness: updated weekly. Elo scores are rounded; ± values denote 95% confidence interval.

Rank	Model	Elo	CI	Status
1	Claude Opus 4.6 Anthropic	1504	±5	Stable
2	Gemini 3.1 Pro Preview Google	1500	±9	Preliminary
3	Claude Opus 4.6 Thinking Anthropic	1500	±5	Stable
4	Grok 4.20-beta1 xAI	1493	±8	Preliminary
5	Gemini 3 Pro Google	1485	±3	Stable
6	GPT-5.2 OpenAI	1481	±4	Stable
7	Gemini 3 Flash Google	1473	±4	Stable
8	Grok 4.1 Thinking xAI	1473	±5	Stable
9	MiniMax M2.1 Preview MiniMax	1466	±10	Preliminary
10	GLM-4.7 Open	1462	±7	Stable

Source: LMArena Text leaderboard via arena-ai-leaderboards JSON feed. Always verify against lmarena.ai before procurement decisions.

The numbers are accurate at time of writing. They will not be accurate by next Wednesday. LMArena publishes weekly updates, and the top-3 in particular reshuffles continuously because the gaps between them are smaller than the confidence intervals. That's not a flaw — it's exactly what you should expect when three models are within 4 Elo points of each other.

For the underlying methodology that explains why these CIs matter more than the rank order, see our companion deep-dive: LMArena Elo Explained.

The Statistical Tie at the Top — Confidence Intervals That Cost You Money

The most expensive misreading of the LMArena leaderboard is treating it like a sports league table. It isn't. It's a statistical estimate with explicit error bars, and the error bars matter more than the headline ordering for any decision involving real budget.

How a #2 Model Statistically Ties With #1

Claude Opus 4.6 sits at Elo 1504 with ±5 CI. That means the true Elo lies between 1499 and 1509 with 95% confidence. Gemini 3.1 Pro Preview is at 1500 ±9 — true Elo between 1491 and 1509. The two ranges overlap from 1499 to 1509.

Translation: the "Claude is #1, Gemini is #2" headline is statistically meaningless. Either could be the true #1, and the headline order can flip week-to-week from random vote variance, not capability change.

This matters in three concrete ways:

A vendor pitch deck that says "we're #1 on LMArena" without the CI is doing motivated cherry-picking, not procurement-grade reporting.
A 12-month enterprise contract priced on the basis of "we're paying for the #1 model" is paying premium for a tied position.
A switch decision triggered by a headline rank change ("we just dropped from #1 to #3 — emergency!") is reacting to noise.

The "Preliminary" Tag and What It Hides

LMArena flags models with under ~4,000–5,000 votes as Preliminary. As of the snapshot above, three of the top-10 carry that tag: Gemini 3.1 Pro Preview, Grok 4.20-beta1, and MiniMax M2.1 Preview.

Preview Elo can swing 20–40 points as votes accumulate. Concrete example: a model that lands at "Preliminary 1505" might stabilize at 1475 once it's seen broader prompt distribution — a 30-point drop that has nothing to do with the model getting worse. It just means the early voter sample wasn't representative.

Procurement rule: if you're committing budget for a year, prioritize models whose Preliminary tag has dropped and whose vote count exceeds 8,000. Speed-of-rank-entry is a marketing signal, not a quality signal.

Why Grok 4.20 Scores Keep Changing

Grok 4.20-beta1 is the most volatile model on the leaderboard right now. Its Elo has moved more than any other top-10 model over the past 30 days — a +22-point gain to its current 1493. There are three reasons:

It's still tagged Preliminary with vote count around 5,000. Every 1,000 new votes meaningfully tightens its CI and shifts its point estimate.
It performs well on the real-time and search-prompt subdomain that LMArena's voting pool happens to over-represent right now. As prompt mix broadens, expect the Elo to settle 5–15 points lower.
xAI ships beta variants frequently — grok-4-20-beta1 itself replaces an earlier grok-4-1 lineage that ranked ~20 Elo points lower. The "Grok" name on the leaderboard is a moving target.

For the procurement-specific implications of Grok's volatility — including why its Elo doesn't translate to enterprise readiness for regulated buyers — see our cross-cluster audit: Grok 4.20 B2B Audit: Why The Elo Score Is a Trojan Horse.

Vote Counts — Who Has the Most, Who Has the Least

Vote count is the single most underweighted column on the leaderboard. Approximate vote counts as of the current snapshot:

Gemini 3 Pro: ~39,673 votes — the most-tested top-10 model
GPT-5.2: ~22,118 votes
Gemini 3 Flash: ~18,902 votes
Grok 4.1 Thinking: ~11,540 votes
Claude Opus 4.6: ~8,945 votes
Claude Opus 4.6 Thinking: ~8,073 votes
GLM-4.7: ~5,884 votes
Grok 4.20-beta1: ~5,071 votes — Preliminary
Gemini 3.1 Pro Preview: ~4,042 votes — Preliminary
MiniMax M2.1 Preview: ~3,201 votes — Preliminary

The procurement read: Gemini 3 Pro at #5 with 39,000+ votes is, in pure statistical terms, more reliably characterized than Gemini 3.1 Pro Preview at #2 with 4,000 votes. That doesn't mean Gemini 3 Pro is "better" — it means the rank you see for it is much closer to the rank you'll see in three weeks.

Geographic and Mirror Variance

The third layer of leaderboard nuance most teams miss: the leaderboard you see depends on where you're looking from and which mirror you're reading.

U.S. Prompts vs India Prompts vs EU Prompts

LMArena aggregates votes globally, but the prompt distribution is geographically uneven. U.S. votes still dominate raw count, which biases the leaderboard slightly toward English-first prompts and U.S.-cultural-context evaluations.

Models tuned for U.S. business English (Claude, GPT-5.2) typically rank a few Elo points higher than they would on a balanced global corpus.
Models tuned for multilingual prompts (Gemini family, GLM-4.7, Qwen 3.5) are slightly under-rewarded on the global Text leaderboard but often over-perform in regional internal evals.
For procurement teams in India, EU, or LATAM markets, the global LMArena rank is a starting point, not the answer. Run an internal eval on your actual prompt distribution.

OpenLM vs LMArena — Why Same Model, Different Rank

OpenLM, Hugging Face Open LLM Leaderboard, OpenRouter, and Artificial Analysis all rank LLMs — but they rank different things:

LMArena: crowdsourced human-preference Elo on blind A/B prompts.
Hugging Face Open LLM: automated benchmark composite (MMLU-Pro, GPQA Diamond, IFBench, MATH).
OpenRouter: real production usage by token volume — a popularity signal, not a quality signal.
Artificial Analysis Intelligence Index v3: weighted blend of capability + speed + cost.

A model can rank #1 on LMArena and #4 on Hugging Face. Both are correct — they're measuring different things. If your workload is conversational and human-judged, weight LMArena. If it's structured-output with verifiable correctness (math, code, factual recall), weight Hugging Face benchmarks. For an open-source-specific cross-leaderboard view, see Open-Source LMArena Rankings: 7 Models Closing the Gap.

How to Check the Live Leaderboard Without lmarena.ai

If you need programmatic access — for a procurement dashboard, a Slack bot, or an internal Confluence page — you don't have to scrape lmarena.ai. The community-maintained arena-ai-leaderboards JSON feed on GitHub mirrors the official data as structured JSON and updates within hours of LMArena's own publish cycle.

Three practical use cases:

Daily Slack notification when a model crosses ±10 Elo from your procurement baseline.
Procurement dashboard that pulls Elo + CI + vote count for a watchlist of 5 models.
Internal alerting when a Preliminary tag drops on a model your team is evaluating — the moment its rank stabilizes.

For the broader month-over-month picture (and the April 2026 vote-pipeline overhaul that caused three top-10 models to lose 30+ Elo overnight), see our companion update: April 2026 LMArena Shake-Up: 3 Models Crashed Out of the Top-10.

The Bottom Line — Don't Lock Procurement to a Wednesday Screenshot

The "who is #1 right now" question has a precise answer at any given moment, but a one-week-old answer is already obsolete for any procurement-grade decision. The actionable read for enterprise buyers is not the headline rank — it's the rank-plus-CI-plus-vote-count triple read.

Treat overlapping CIs as ties. Don't pay #1 prices for #2 capability.
Wait for Preliminary tags to drop before signing 12-month contracts.
Cross-reference against at least one non-LMArena leaderboard for your specific workload.
Run an internal blind eval on your own prompts before any procurement-grade commitment.

Need the live top-10 widget that refreshes weekly and the full hub? For category-specific leaderboards (Coding, Vision, WebDev, Writing) and the monthly tracker: See the live LMArena top-10 leaderboard →

About the Author: Sanjay Saini

Sanjay Saini is an Enterprise AI Strategy Director specializing in digital transformation and AI ROI models. He covers high-stakes news at the intersection of leadership and sovereign AI infrastructure.

Connect on LinkedIn

Frequently Asked Questions (FAQ)

Who is currently leading the LMArena Chatbot Arena right now?

Claude Opus 4.6 currently holds the headline #1 position on the Text leaderboard at Elo 1504. However, Gemini 3.1 Pro Preview and Claude Opus 4.6 Thinking sit within overlapping 95% confidence intervals, making the top-3 a statistical tie. The headline rank reshuffles weekly.

Is Claude Opus 4.6 still #1 in 2026?

On the Text leaderboard, yes — but within a confidence interval that overlaps with Gemini 3.1 Pro Preview and Claude Opus 4.6 Thinking. On the Code arena, GPT-5.2-codex has held #1 since January 2026. The "still #1" question depends on which leaderboard you read.

How does Gemini 3.1 Pro compare to GPT-5.2?

Gemini 3.1 Pro Preview leads GPT-5.2 by ~19 Elo points on the Text arena (1500 vs 1481) but carries a wider ±9 confidence interval because it's still tagged Preliminary. On the Code arena, GPT-5.2-codex leads decisively.

Why do Grok 4.20 scores keep changing?

Grok 4.20-beta1 is still flagged Preliminary with around 5,000 votes. Every 1,000 new votes tightens its confidence interval and shifts its point estimate. Expect 5–15 Elo points of additional drift before it stabilizes.

What does "preliminary" mean on LMArena?

The Preliminary tag flags models with fewer than approximately 4,000–5,000 votes. Their Elo can swing 20–40 points before stabilizing. For procurement-grade decisions, wait for the tag to drop and vote count to exceed 8,000 before signing contracts.

Which model has the most votes recorded?

Among current top-10, Gemini 3 Pro leads at approximately 39,673 votes — by far the most-tested. GPT-5.2 follows at ~22,118. The Preview models (Gemini 3.1 Pro Preview, MiniMax M2.1 Preview, Grok 4.20-beta1) all sit under 5,500 votes.

Can a #2 model statistically tie with #1?

Yes — and it's the rule, not the exception, in the current top-3. Claude Opus 4.6 at 1504 ±5 and Gemini 3.1 Pro Preview at 1500 ±9 have overlapping confidence intervals (1499–1509 vs 1491–1509). Statistically, either could be the true #1.

Do top model rankings differ for U.S. vs India queries?

Slightly. LMArena's vote pool is U.S.-weighted, which marginally favors models tuned for U.S. business English. For India, EU, or LATAM procurement, treat the global rank as a shortlist filter, then run an internal eval on your actual regional prompt distribution.

How do I check the live leaderboard without using lmarena.ai?

Use the community-maintained arena-ai-leaderboards JSON feed on GitHub, which mirrors official data within hours. It powers Slack alerts, procurement dashboards, and watchlist notifications without scraping. Cross-reference against lmarena.ai before signing any contract.

Why does the same model rank differently on OpenLM vs LMArena?

They measure different things. LMArena ranks by crowdsourced human preference; OpenLM and Hugging Face rank by automated benchmark composites; OpenRouter ranks by production usage; Artificial Analysis blends capability with speed and cost. A model can rank #1 on one and #4 on another simultaneously.