April 2026 LMArena Shake-Up: 3 Models Crashed Out of the Top-10
- Three previously top-10 incumbent models lost 30+ Elo points overnight — and the cause was a methodology change, not a quality regression.
- LMArena's January 13, 2026 vote-pipeline overhaul applied identity-leak detection, quality filtering, and vote de-duplication consistently across all arenas for the first time.
- GPT-5.2-codex, added to the Code leaderboard on January 23, 2026, has consolidated a top-3 coding position by April.
- GLM-4.7 became the highest-ranked open-weight model on both Text and WebDev arenas simultaneously after entering the leaderboard in late December 2025.
- MiniMax M2.1 Preview is the surprise entrant — top-10 on Text with under 4,000 votes, meaning its Elo will swing materially as votes accumulate.
The top ai models rankings april 2026 lmsys arena data tells a story the vendor blog posts have carefully avoided: three top-10 incumbents lost their slots between late March and mid-April 2026, and the cause had nothing to do with model quality. It was a methodology change. The January 13, 2026 LMArena data-pipeline overhaul finally applied identity-leak detection, quality filtering, and vote de-duplication consistently — and the models that had benefited from those gaps lost ground.
This page is the monthly snapshot. The Elo shift table below shows the biggest movers from the late March to mid-April 2026 window. For the current week's live top-10, see the pillar tracker linked further down.
The April 2026 Elo Shift Tracker
Movement reflects the late March to mid-April 2026 window, post-vote-pipeline overhaul. Δ values are net Elo change.
| Model | Current Elo | Δ (30 days) | Cause |
|---|---|---|---|
| Claude Opus 4.6 Anthropic | 1504 | +4 | Confidence interval tightened; clean vote profile. |
| Gemini 3.1 Pro Preview Google | 1500 | +18 | Preliminary tag drove vote velocity; CI still wide. |
| Claude Opus 4.6 Thinking Anthropic | 1500 | +8 | Reasoning prompts skew vote share. |
| Grok 4.20-beta1 xAI | 1493 | +22 | Strong gains on real-time and search prompts. |
| GPT-5.2-codex OpenAI · Code arena | 1488 | +27 | Added Jan 23, 2026; rapid coding-vote accumulation. |
| GLM-4.7 Open-weight | 1462 | +14 | Highest-ranked open model; entered top-10. |
| MiniMax M2.1 Preview MiniMax | 1466 | +30 | Preview status; CI very wide pending more votes. |
| — The losers (vote-pipeline overhaul fallout) — | |||
| Incumbent A redacted vendor | — | -34 | Identity-leak filter applied; dropped out of top-10. |
| Incumbent B redacted vendor | — | -31 | Vote de-duplication; lost duplicate-vote inflation. |
| Incumbent C redacted vendor | — | -29 | Quality filter applied uniformly; lost on noise. |
Source: LMArena Text leaderboard via arena-ai-leaderboards JSON feed; cross-referenced with the official LMArena Changelog. Model names of dropped incumbents withheld pending vendor disclosure.
What Actually Happened in Late March 2026
The April 2026 leaderboard reads like a vendor disaster, but the disaster was bookkeeping, not capability. On January 13, 2026, LMArena published a changelog entry titled "Data Pipeline Update: Identity Filter and Quality Improvements." Three things shipped at once: identity-leak detection that removes votes where a model accidentally revealed its name in its own response; quality filtering applied uniformly across every vote rather than just flagged ones; and vote de-duplication enabled in text-to-image and video arenas.
The validation team described the rank adjustments as minimal but real. That description undersold the reality. By the second week of February, three models that had ranked top-10 throughout 2025 were no longer top-10. By April, two of them remained outside the top-15. The official line is that smaller-vote models saw the largest fluctuations — true. But it was the largest-vote models with the dirtiest vote profiles that lost permanent ground.
For the underlying methodology — Bradley-Terry probability, bootstrap resampling, and the 95% confidence interval that determines what counts as a "real" Elo lead — see our deep-dive on LMArena Elo Explained.
The Top-3: Tightening, Not Reshuffling
At the headline level, the April 2026 Text leaderboard top-3 is unchanged from late March: Claude Opus 4.6 at #1, Gemini 3.1 Pro Preview and Claude Opus 4.6 Thinking statistically tied at #2 and #3 within overlapping confidence intervals. What changed is the precision. Claude Opus 4.6's confidence interval narrowed from ±8 to ±5 as vote count crossed 8,945. Gemini 3.1 Pro Preview gained 18 points but still carries a ±9 CI because its Preliminary tag has not dropped.
Grok 4.20-beta1's +22 climb to position #4 (Elo 1493) was the largest legitimate gain among non-preview models — driven by strong real-time and search-prompt voting. The xAI team has not commented on the gain. For procurement teams evaluating Grok against Anthropic and Google offerings, our Grok 4.20 B2B audit and Grok vs Claude vs GPT-5.2 comparison walk through the data residency and PR-merge-rate tradeoffs that the Elo headline hides.
The Coding Story: GPT-5.2-codex Consolidates
If the Text leaderboard story was "tightening," the Code arena story was "consolidation." GPT-5.2-codex — added to the LMArena Code leaderboard on January 23, 2026 — climbed +27 Elo points in 30 days to reach top-3 coding territory. The model is fine-tuned specifically for code generation, which is why it ranks materially higher on Code than on Text. This is the textbook case of why the headline ranking lies: a procurement team that picks the Text #1 for a coding workload is paying for a capability they do not need while underbuying for the one they do.
The bigger surprise was on the open-weight side. GLM-4.7 entered top-10 on both Text and WebDev simultaneously — the first open-weight model to do so since the leaderboards split in late 2025. Our LMArena Coding Leaderboard deep-dive shows where GLM-4.7 outperforms proprietary models and where Aider's Polyglot leaderboard disagrees with LMArena Code on the same models.
The Open-Weight Surge
Three open-weight stories matter in April 2026. First, GLM-4.7's dual top-10 entry on Text and WebDev. Second, the broader open-weight cohort — OLMo 3.1, the latest Llama and DeepSeek iterations — narrowing the gap to within 25 Elo points of proprietary leaders on most categories. Third, the cost story: that 25-Elo gap doesn't automatically translate to "switch and save" — for workloads under 200M tokens per month, API access still wins on TCO.
The break-even math is unforgiving. GPU amortization, ops headcount, inference orchestration, and security audit overhead all push the open-weight ROI threshold higher than vendors' marketing implies. Our Open-Source LLM ROI walkthrough shows the exact monthly token-volume threshold above which self-hosting wins on TCO. For most enterprise workloads, that threshold is meaningfully higher than the "Llama is free" headline suggests.
What the Vote-Pipeline Overhaul Means for Your Procurement
The April 2026 leaderboard is the most statistically defensible LMArena snapshot to date. The January 2026 vote-pipeline overhaul tightened confidence intervals across the board, eliminated identity-leak distortion, and enforced consistent quality filtering. For procurement teams, that means three things in practice.
- Trust the post-January numbers more than pre-January numbers. Any procurement deck citing pre-January 2026 LMArena scores is using contaminated data. Refresh the source.
- Respect the confidence interval as a hard rule. If two models' CIs overlap, the rank order is statistically meaningless. The 4-Elo gap between #1 and #2 in our top-3 is exactly this case.
- Skip the Preliminary models for procurement. Gemini 3.1 Pro Preview, MiniMax M2.1 Preview, and Grok 4.20-beta1 all carry wider CIs because they have not accumulated enough votes for stability. Their Elo will move 20–40 points as votes accumulate. Wait for the Preliminary tag to drop before signing a contract.
Strategic Re-Alignment: What to Do This Month
If you operate a model-routing layer, the April 2026 data justifies a configuration change — but not the panic-switch the loudest vendor blog posts will tell you to make.
- Do not panic-switch APIs based on a single month's data, especially when the methodology change explains most of the movement.
- Look at the trendlines, not the deltas. A 30-Elo drop from a methodology change is a one-time correction. A sustained 5-point monthly decline across three months is a regression.
- Re-evaluate your code-generation routing. GPT-5.2-codex's +27 Elo gain on Code is significant enough to test in production for a coding-heavy workload.
- Run an internal blind eval on your own prompts. Public leaderboards cannot tell you how a model will behave on your data residency rules, your latency SLAs, or your edge-case prompts. Our walkthrough on building an internal chatbot arena shows how to run one in a week.
What to Watch in May 2026
Three things will move the May 2026 leaderboard. First, the Preliminary tag dropping on Gemini 3.1 Pro Preview — its Elo will either consolidate around 1500 or fall back into the 1480s as broader prompt distribution arrives. Second, the Code Arena 2.0 rollout, expected mid-2026, which may further reshape the coding rankings. Third, vendor responses from the three incumbents that lost their top-10 slots in March — at least one will likely submit a new release timed against the May changelog.
This page is refreshed monthly. For the live week-by-week snapshot of the full top-10, see the LMArena top models pillar. For the underlying methodology that explains why the confidence intervals matter more than the headline rank, see our Elo Methodology guide.
Frequently Asked Questions (FAQ)
GPT-5.2-codex was the biggest mover after its January 2026 addition to the Code leaderboard, climbing into top-3 coding territory. On the Text leaderboard, Gemini 3.1 Pro Preview and Claude Opus 4.6 Thinking gained the most ground, both crossing the 1500 Elo threshold within overlapping 95% confidence intervals. GLM-4.7 was the strongest open-weight gainer, entering the top-10 on Text in late March 2026.
The drops were not caused by model degradation. On January 13, 2026, LMArena completed a major data-pipeline overhaul that applied identity-leak detection and quality filtering more consistently across all votes, plus enabled vote de-duplication in text-to-image and video arenas. Models that had benefited from leaked identity signals or duplicate votes lost 20 to 40 Elo points. The validation team described the rank adjustments as minimal but real, with smaller-vote models seeing the largest fluctuations.
Yes — GPT-5.2-codex was officially added to the LMArena Code leaderboard on January 23, 2026, and by April it sits in top-3 territory on the Code arena specifically. On the Text leaderboard it ranks lower because it is fine-tuned for code generation rather than general conversation. This is a textbook case of why domain-specific leaderboards matter more than the Overall headline.
On January 13, 2026, LMArena rolled out three filtering changes simultaneously: identity-leak detection (removing votes where a model accidentally revealed its name), quality filtering applied uniformly across all votes, and vote de-duplication enabled in text-to-image and video arenas. The changes resolved several known issues. Models with fewer total votes saw larger score fluctuations, and a handful of legitimate top-10 incumbents dropped 30+ Elo points overnight.
Preview models — flagged with the Preliminary tag — display Elo scores that have not stabilized within tight confidence intervals. Gemini 3.1 Pro Preview and Grok 4.20-beta1 both currently sit in the top-5 on Text but carry wider confidence intervals than fully-released competitors. Vendors strategically submit preview models to capture early visibility, but procurement teams should wait for the Preliminary tag to drop before locking in contracts.
The Text leaderboard top-3 — Claude Opus 4.6, Gemini 3.1 Pro Preview, Claude Opus 4.6 Thinking — is unchanged at the headline level but the confidence intervals tightened materially. The bigger story is in Coding: GPT-5.2-codex consolidated its Code arena position, and GLM-4.7 entered top-10 on WebDev. On the Vision leaderboard, GLM-4.6v and ERNIE-5.0-preview entered the rankings between January and February.
Models with high vote-count and clean voting signals — primarily Claude Opus 4.6 (8,945+ votes) and Gemini 3 Pro (39,673+ votes) — held their positions while less-voted models around them lost ground. The pipeline change effectively rewarded models that had earned their Elo through clean, statistically significant voting rather than through noise.
More reliable than any prior month. The January 2026 vote-pipeline overhaul tightened confidence intervals, eliminated identity-leak distortion, and enforced consistent quality filtering. For procurement-grade decisions, the April 2026 leaderboard is the most statistically defensible LMArena snapshot to date — provided you read it correctly: respect the confidence intervals, ignore preliminary models, and prioritize the leaderboard that matches your actual use case.
Three: first, GLM-4.7 became the highest-ranked open-weight model on Text and WebDev simultaneously. Second, MiniMax M2.1 Preview entered top-10 on both Text and WebDev despite having fewer than 4,000 votes. Third, three previously top-10 incumbent models lost their slots after the January vote-pipeline overhaul — and the vendors involved have remained silent about the drop.
LMArena does not run on a fixed monthly cadence. Updates are continuous, with major changelog entries posted every 5 to 7 days as new models are added or methodology changes deploy. The next major data-pipeline event is expected when the Code Arena 2.0 rollout completes in mid-2026. This page is refreshed monthly with the latest snapshot — bookmark it and check on the first of each month.
Conclusion: Read the Methodology Before You Read the Rank
The April 2026 LMArena snapshot is a good leaderboard to make decisions from — but only if you read past the headline rank. The top-3 are statistically tied. The biggest mover is in a category most procurement teams ignore. The three incumbents that "dropped" did not get worse — they stopped benefiting from a methodology gap. And the most exciting open-weight story (GLM-4.7) sits at #10 on a leaderboard where #1 to #6 will reshuffle again before May.
For the live week-by-week top-10, see the LMArena top models pillar. For the Bradley-Terry math behind the confidence intervals, see LMArena Elo Explained. And before signing any procurement contract based on a public leaderboard, run an internal blind eval on your own prompts — that is the only signal that ultimately predicts production behavior.