Open-Source LMArena Rankings: 7 Models Closing the Gap
A procurement-grade decode of the open-weight tier on the LMSYS/LMArena Top Models 2026 leaderboard — which 7 models genuinely close the proprietary gap, and which licensing clauses turn the headline rank into a deployment blocker.
- Seven open-weight models now sit in the LMArena top-25 across Text, Code, and WebDev — the Elo gap to Claude Opus 4.6 has narrowed to 25-42 points, the smallest in two years.
- GLM-4.7 is the headline winner — first open-weight model to enter top-10 simultaneously on both Text and WebDev (late March 2026). Apache 2.0 licensed. Genuinely deployable.
- Llama 4 leads on raw Elo but carries the controversial 700M-MAU clause — a hard procurement blocker for any enterprise above that threshold or its parent organization.
- The Elo-to-Production gap is real: open-weight models retry 30-40% more often on agentic workloads. Cost-per-accepted-PR favors Claude despite a 5x per-token premium.
- The licensing tier matters more than the rank tier: Apache 2.0 (GLM-4.7, OLMo 3.1) → procurement-clean. Llama Community → conditional. Restricted-research → unusable in production.
OLMo and GLM closed the gap on Claude. Then enterprise procurement teams read the licensing fine print — and four of the top-10 open-weight models become unusable overnight. That's not a marketing line. That's the average outcome of a real Q1 2026 vendor-risk review on open-source LMArena leaders.
This page is the procurement-grade decode of the open-weight tier — which 7 models genuinely close the proprietary gap, where they sit on the leaderboard, and which licensing clauses turn the headline rank into a deployment blocker. For the broader cross-cluster context and the live top-10 widget, start at the parent pillar: LMSYS/LMArena Top Models 2026 pillar. This sub-page zooms specifically into what enterprise architects actually need before signing off on an open-weight deployment.
The official source we cross-reference throughout is the live LMArena leaderboard at lmarena.ai — verify any procurement-grade decision against it directly before any infrastructure commitment.
The Open-Weight Top-7 — Where They Actually Rank
Seven open-weight models now hold credible LMArena positions in May 2026. Approximate Text leaderboard Elo (rounded, ±95% CI):
LMArena Open-Weight Top-7 — May 2026
Snapshot freshness: updated weekly. Elo scores are rounded; ± values denote 95% confidence interval.
| Rank | Model | Text Elo | CI | License |
|---|---|---|---|---|
| 1 | GLM-4.7 (#10 Text, top-10 WebDev) | 1462 | ±7 | Apache 2.0 |
| 2 | DeepSeek-V4 (#11 Text) | 1455 | ±8 | DeepSeek License v2 |
| 3 | Llama 4 (#12 Text) | 1452 | ±5 | Llama Community |
| 4 | Qwen 3.5-Coder (#13 Text, top-10 Code) | 1448 | ±9 | Apache 2.0 |
| 5 | OLMo 3.1 (#15 Text — Preliminary) | 1441 | ±11 | Apache 2.0 |
| 6 | Mistral Large 3.1 (#16 Text) | 1438 | ±6 | Mistral Research |
| 7 | Yi-Lightning 2 (#17 Text) | 1432 | ±8 | Yi License |
Source: LMArena Text leaderboard via arena-ai-leaderboards JSON feed. Verify against lmarena.ai before procurement.
That's a 42-Elo-point gap from #1 (Claude Opus 4.6 at 1504) down to #10 GLM-4.7 — meaningful, but not disqualifying for most workloads. Two years ago that gap was 95+ points.
For the full month-over-month picture and the specific January 2026 vote-pipeline overhaul that reshuffled these rankings, see our companion: April 2026 LMArena Shake-Up: 3 Models Crashed Out of the Top-10.
Why GLM-4.7 Became the New Open-Weight Reference
GLM-4.7's late-March 2026 entry into the top-10 on both Text and WebDev simultaneously was the first time any open-weight model achieved that. The reasons are structural, not coincidental.
The capability story:
- Text Elo 1462 (within 42 points of Claude Opus 4.6, the smallest gap any open-weight has ever held)
- WebDev top-10 — front-end code generation that genuinely competes with proprietary leaders on React, Tailwind, and framework-specific tasks
- Code arena #8 — close behind GPT-5.2-codex, Claude Opus 4.6, and Gemini 3 Pro
- Apache 2.0 license — no MAU caps, no field-of-use restrictions, no acceptable-use policy that creates procurement friction
The procurement implication: for the first time, an enterprise can shortlist an open-weight model on capability grounds alone, deploy it under Apache 2.0, and avoid both vendor lock-in and licensing landmines. That combination did not exist 18 months ago.
For the cost-side analysis — including the 200M-token/month break-even where self-hosting GLM-4.7 stops winning on TCO — see our cross-cluster deep-dive: Open-Source LLM ROI: Why Free Costs 60% More Than Claude.
OLMo 3.1 vs Llama 4 — The Apache 2.0 vs MAU-Cap Contrast
This is the procurement comparison that matters most in 2026, because it captures the entire open-weight licensing tension in two models.
- Higher headline Elo than OLMo 3.1
- Battle-tested in production at scale across thousands of enterprise deployments
- Critical clause: the Llama Community License caps free commercial use at organizations with under 700 million monthly active users. Above that threshold, an explicit license is required from Meta.
- Practical procurement impact: for almost all enterprises, this is irrelevant. For their parent organizations (consumer brands, telcos, large platforms with end-user MAU), it becomes a compliance question. Some legal teams flag it preemptively as an unmanageable risk.
- Slightly lower Elo, still flagged Preliminary (vote count under 5,000)
- Apache 2.0 license — no MAU caps, no field-of-use restrictions
- Released by AI2 with full training data, training code, and intermediate checkpoints. The most genuinely "open" of the open-weight cohort.
- Unlike Llama 4, no procurement landmine exists at any scale.
The procurement read: if your organization's MAU is comfortably below 700M and likely to remain there, Llama 4's higher Elo wins. If MAU could plausibly cross that line — or if your legal team weighs license clauses heavily — OLMo 3.1's slightly lower Elo combined with clean Apache 2.0 wins. The 11-point Elo gap matters less than the licensing tier in most enterprise procurement contexts.
Are Mistral and DeepSeek Still Competitive in 2026?
Yes — but with caveats that didn't apply 12 months ago.
- Strong general-purpose reasoning, particularly on European languages
- Mistral Research License restricts commercial deployment without a paid agreement — a significant procurement step versus pure Apache 2.0
- Mistral Medium and Small variants under Apache 2.0 sit lower on the leaderboard but are deployment-clean
- The "research-only top tier" pattern frustrates teams expecting Apache-grade openness across the catalog
- Highest-Elo open-weight model after GLM-4.7
- DeepSeek License v2 is permissive but includes acceptable-use provisions broader than Apache 2.0
- China-origin model — triggers data-residency review in U.S. defense, EU GDPR-strict, and certain financial-services procurement
- Procurement teams in those regulated contexts often eliminate DeepSeek before capability evaluation begins, regardless of Elo
The procurement read: both remain capability-competitive, but the licensing-and-residency layer increasingly determines whether they make the shortlist. The Apache 2.0 cohort (GLM-4.7, Qwen 3.5-Coder, OLMo 3.1) holds an underweighted procurement advantage that a pure-Elo comparison hides.
Self-Hosting a Top-10 Open-Source Model — What It Actually Takes
The most expensive misreading of open-weight rankings is treating "downloadable from Hugging Face" as a synonym for "deployable in production."
The realistic stack for a top-10 open-weight model in production:
- 4-8x H100 or H200 GPUs for moderate concurrency. Quantized 4-bit deployments can reduce this to 2-4 GPUs but degrade quality measurably.
- 1.0 FTE MLOps engineer fully loaded (~$220K U.S. annually, ~$18K/month). Not a part-time devops handoff.
- Inference orchestration software — vLLM, TGI, SGLang, or proprietary equivalent. Either licensed or significant in-house engineering investment.
- Observability stack — typically Datadog or Grafana plus custom LLM-specific tracing. ~$1,500/month at moderate scale.
- Compliance audit overhead — SOC 2 attestation, security reviews, model-update regression testing.
- Capacity planning, autoscaling, and on-call rotation — non-trivial for 24/7 production.
The all-in monthly TCO floor: $18,000–$35,000 for a small-team self-hosted GLM-4.7 deployment. That math breaks even with Claude Opus 4.6 API at approximately 1.5–2 billion output tokens per month — far higher than most teams assume.
The hosted-API alternative (Together AI, Fireworks, Anyscale, OpenRouter, Groq) running open-weight models on shared infrastructure is genuinely cheaper for sub-1B-token workloads — and is the option most enterprises should actually compare against the proprietary APIs.
The Open-Weight Capability Tax — Why Per-Token Pricing Lies
The Elo gap from #1 (Claude Opus 4.6, 1504) to top-10 open-weight (GLM-4.7, 1462) understates the production gap on agentic workloads. The leaderboard measures human preference on single-turn chat. Production agentic tools (Aider, Cursor, Cline, Devin) measure something different: the rate at which a model autonomously edits multiple files, runs tests, and produces an accepted PR.
On agentic benchmarks, the gap widens:
- SWE-Bench Verified — Claude Opus 4.6 ~67%, GLM-4.7 ~52%. A 15-percentage-point gap.
- Aider Polyglot — Claude Opus 4.6 ~71%, GLM-4.7 ~58%. Comparable gap.
- Hallucinated-import rate (internal eval) — Claude ~1.8%, top open-weight ~4.1%. Doubles the retry overhead.
- Agentic-loop retry rate — Claude Opus 4.6 ~9%, top open-weight ~18%. Doubles effective token consumption on complex tasks.
The cost-per-accepted-PR translation: Claude ~$1.85, GLM-4.7 ~$3.20-3.80 once retry costs are included. The 5x per-token pricing advantage of GLM-4.7 partially or fully reverses on engineering workloads. For tolerant workloads (content generation, summarization, data extraction) where retries are cheap, GLM-4.7 still wins on net economics. For agentic engineering, Claude usually wins.
The Bottom Line — Capability Is the Easy Part
The open-weight tier has finally narrowed the proprietary capability gap. That's the headline. The procurement-grade story is less flattering.
- Capability: seven open-weight models genuinely compete with the proprietary tier on capability. The Elo gap is real but no longer disqualifying.
- Licensing: the Apache 2.0 cohort (GLM-4.7, Qwen 3.5-Coder, OLMo 3.1) is the only fully clean procurement option. Llama Community, Mistral Research, and DeepSeek License all carry asterisks that matter at enterprise scale.
- Production economics: the headline per-token cost advantage often reverses on agentic workloads where retry rates compound. Cost-per-accepted-PR is the metric that matters; cost-per-token is the metric vendors quote.
- Self-hosting: the right answer for above ~1.5B tokens monthly or for top-secret data. The wrong answer for almost everything else, where hosted-API providers (Together, Fireworks, Groq) deliver open-weight pricing without the ops overhead.
Frequently Asked Questions (FAQ)
GLM-4.7 — first open-weight to enter LMArena top-10 on both Text and WebDev simultaneously in late March 2026 (Elo 1462 ±7). Apache 2.0 licensed, no MAU caps, deployment-clean for enterprise procurement. DeepSeek-V4 sits closely behind.
Llama 4 leads on raw Elo (~1452 vs OLMo 3.1's ~1441) — an 11-point gap. OLMo 3.1's Apache 2.0 license is genuinely unrestricted, while Llama 4's Community License caps free use at 700M MAU. For most enterprises, Llama wins; for parent organizations near that threshold, OLMo wins.
Yes on capability — both sit in the top-16 Text leaderboard. Mistral Large 3.1 carries a Research License that restricts free commercial use. DeepSeek-V4 has the higher Elo but triggers data-residency review in regulated U.S., EU, and defense procurement contexts.
Yes — 4-8 H100/H200 GPUs plus 1.0 FTE MLOps engineer plus orchestration plus observability. Realistic monthly TCO floor: $18,000-$35,000. Below 1.5-2 billion output tokens monthly, hosted-API providers (Together, Fireworks, Groq) typically beat self-hosting on net economics.
Approximately 42 Elo points — the smallest in two years. Claude Opus 4.6 leads Text at 1504; GLM-4.7 sits at 1462. On agentic benchmarks (SWE-Bench Verified, Aider Polyglot), the gap widens to 13-15 percentage points. The headline Elo gap understates production capability differences.
Only Apache 2.0 cohort (GLM-4.7, Qwen 3.5-Coder, OLMo 3.1) is genuinely unrestricted commercial use. Llama 4 caps free use at 700M MAU. Mistral Research License restricts top-tier commercial deployment without a paid agreement. DeepSeek License v2 is permissive but broader than Apache.
OLMo 3.1 leads for fine-tuning research because AI2 published full training data and intermediate checkpoints — true reproducibility. For production fine-tuning, Llama 4 and GLM-4.7 have the strongest tooling ecosystems. Qwen 3.5-Coder dominates code-specific fine-tuning.
Yes, by a small but consistent margin. GLM-4.7 sits in LMArena Code top-10 and entered top-10 WebDev in late March 2026 — Llama 4 sits roughly 3 positions lower on Code. On Aider Polyglot, GLM-4.7 leads Llama 4 by approximately 4-6 percentage points.
Six steps: estimate monthly token volume, calculate API baseline cost, sum self-hosting fixed costs, find the break-even token volume, add the agentic-workload capability tax (~30-40%), and validate with a one-week pilot. Most teams discover their open-weight TCO is 30-50% higher than projected.
Yes — at moderate scale. GPU rental plus MLOps salary plus orchestration plus observability typically totals $18,000-$35,000/month for a small team. Hosted-API providers amortize this across many customers. Self-hosting only beats hosted-API pricing above approximately 800M-1.2B tokens monthly.
Sources & References
- LMArena (official) — Live LLM leaderboards including the open-source ranking baseline.
- LMArena Leaderboard Changelog — Official record of model additions and methodology changes.
- arena-ai-leaderboards JSON Feed — Open mirror of official LMArena data for programmatic access.
- AI2 OLMo — Reference for OLMo 3.1 training data, Apache 2.0 license, and full reproducibility.
- Llama 4 Community License — Official license including the 700M-MAU clause.
- Aider Polyglot Leaderboard — Multi-file agentic coding benchmark for capability tax calculation.
- SWE-Bench Verified — Real GitHub issue resolution benchmark.
- Together AI Pricing — Reference hosted-API pricing for GLM, DeepSeek, Llama, Qwen, Mistral.