Open-Source LMArena Rankings: 7 Models Closing the Gap

LMArena open-source models leaderboard 2026 OLMo GLM Llama DeepSeek Qwen rankings dashboard

A procurement-grade decode of the open-weight tier on the LMSYS/LMArena Top Models 2026 leaderboard — which 7 models genuinely close the proprietary gap, and which licensing clauses turn the headline rank into a deployment blocker.

  • Seven open-weight models now sit in the LMArena top-25 across Text, Code, and WebDev — the Elo gap to Claude Opus 4.6 has narrowed to 25-42 points, the smallest in two years.
  • GLM-4.7 is the headline winner — first open-weight model to enter top-10 simultaneously on both Text and WebDev (late March 2026). Apache 2.0 licensed. Genuinely deployable.
  • Llama 4 leads on raw Elo but carries the controversial 700M-MAU clause — a hard procurement blocker for any enterprise above that threshold or its parent organization.
  • The Elo-to-Production gap is real: open-weight models retry 30-40% more often on agentic workloads. Cost-per-accepted-PR favors Claude despite a 5x per-token premium.
  • The licensing tier matters more than the rank tier: Apache 2.0 (GLM-4.7, OLMo 3.1) → procurement-clean. Llama Community → conditional. Restricted-research → unusable in production.

OLMo and GLM closed the gap on Claude. Then enterprise procurement teams read the licensing fine print — and four of the top-10 open-weight models become unusable overnight. That's not a marketing line. That's the average outcome of a real Q1 2026 vendor-risk review on open-source LMArena leaders.

This page is the procurement-grade decode of the open-weight tier — which 7 models genuinely close the proprietary gap, where they sit on the leaderboard, and which licensing clauses turn the headline rank into a deployment blocker. For the broader cross-cluster context and the live top-10 widget, start at the parent pillar: LMSYS/LMArena Top Models 2026 pillar. This sub-page zooms specifically into what enterprise architects actually need before signing off on an open-weight deployment.

The official source we cross-reference throughout is the live LMArena leaderboard at lmarena.ai — verify any procurement-grade decision against it directly before any infrastructure commitment.

The Open-Weight Top-7 — Where They Actually Rank

Seven open-weight models now hold credible LMArena positions in May 2026. Approximate Text leaderboard Elo (rounded, ±95% CI):

LMArena Open-Weight Top-7 — May 2026

Snapshot freshness: updated weekly. Elo scores are rounded; ± values denote 95% confidence interval.

Rank Model Text Elo CI License
1GLM-4.7 (#10 Text, top-10 WebDev)1462±7Apache 2.0
2DeepSeek-V4 (#11 Text)1455±8DeepSeek License v2
3Llama 4 (#12 Text)1452±5Llama Community
4Qwen 3.5-Coder (#13 Text, top-10 Code)1448±9Apache 2.0
5OLMo 3.1 (#15 Text — Preliminary)1441±11Apache 2.0
6Mistral Large 3.1 (#16 Text)1438±6Mistral Research
7Yi-Lightning 2 (#17 Text)1432±8Yi License

Source: LMArena Text leaderboard via arena-ai-leaderboards JSON feed. Verify against lmarena.ai before procurement.

That's a 42-Elo-point gap from #1 (Claude Opus 4.6 at 1504) down to #10 GLM-4.7 — meaningful, but not disqualifying for most workloads. Two years ago that gap was 95+ points.

For the full month-over-month picture and the specific January 2026 vote-pipeline overhaul that reshuffled these rankings, see our companion: April 2026 LMArena Shake-Up: 3 Models Crashed Out of the Top-10.

Why GLM-4.7 Became the New Open-Weight Reference

GLM-4.7's late-March 2026 entry into the top-10 on both Text and WebDev simultaneously was the first time any open-weight model achieved that. The reasons are structural, not coincidental.

The capability story:

  • Text Elo 1462 (within 42 points of Claude Opus 4.6, the smallest gap any open-weight has ever held)
  • WebDev top-10 — front-end code generation that genuinely competes with proprietary leaders on React, Tailwind, and framework-specific tasks
  • Code arena #8 — close behind GPT-5.2-codex, Claude Opus 4.6, and Gemini 3 Pro
  • Apache 2.0 license — no MAU caps, no field-of-use restrictions, no acceptable-use policy that creates procurement friction

The procurement implication: for the first time, an enterprise can shortlist an open-weight model on capability grounds alone, deploy it under Apache 2.0, and avoid both vendor lock-in and licensing landmines. That combination did not exist 18 months ago.

For the cost-side analysis — including the 200M-token/month break-even where self-hosting GLM-4.7 stops winning on TCO — see our cross-cluster deep-dive: Open-Source LLM ROI: Why Free Costs 60% More Than Claude.

OLMo 3.1 vs Llama 4 — The Apache 2.0 vs MAU-Cap Contrast

This is the procurement comparison that matters most in 2026, because it captures the entire open-weight licensing tension in two models.

Llama 4 — Elo 1452, top-12 Text
  • Higher headline Elo than OLMo 3.1
  • Battle-tested in production at scale across thousands of enterprise deployments
  • Critical clause: the Llama Community License caps free commercial use at organizations with under 700 million monthly active users. Above that threshold, an explicit license is required from Meta.
  • Practical procurement impact: for almost all enterprises, this is irrelevant. For their parent organizations (consumer brands, telcos, large platforms with end-user MAU), it becomes a compliance question. Some legal teams flag it preemptively as an unmanageable risk.
OLMo 3.1 — Elo 1441, top-15 Text
  • Slightly lower Elo, still flagged Preliminary (vote count under 5,000)
  • Apache 2.0 license — no MAU caps, no field-of-use restrictions
  • Released by AI2 with full training data, training code, and intermediate checkpoints. The most genuinely "open" of the open-weight cohort.
  • Unlike Llama 4, no procurement landmine exists at any scale.

The procurement read: if your organization's MAU is comfortably below 700M and likely to remain there, Llama 4's higher Elo wins. If MAU could plausibly cross that line — or if your legal team weighs license clauses heavily — OLMo 3.1's slightly lower Elo combined with clean Apache 2.0 wins. The 11-point Elo gap matters less than the licensing tier in most enterprise procurement contexts.

Are Mistral and DeepSeek Still Competitive in 2026?

Yes — but with caveats that didn't apply 12 months ago.

Mistral Large 3.1 — Elo 1438, top-16 Text
  • Strong general-purpose reasoning, particularly on European languages
  • Mistral Research License restricts commercial deployment without a paid agreement — a significant procurement step versus pure Apache 2.0
  • Mistral Medium and Small variants under Apache 2.0 sit lower on the leaderboard but are deployment-clean
  • The "research-only top tier" pattern frustrates teams expecting Apache-grade openness across the catalog
DeepSeek-V4 — Elo 1455, top-11 Text
  • Highest-Elo open-weight model after GLM-4.7
  • DeepSeek License v2 is permissive but includes acceptable-use provisions broader than Apache 2.0
  • China-origin model — triggers data-residency review in U.S. defense, EU GDPR-strict, and certain financial-services procurement
  • Procurement teams in those regulated contexts often eliminate DeepSeek before capability evaluation begins, regardless of Elo

The procurement read: both remain capability-competitive, but the licensing-and-residency layer increasingly determines whether they make the shortlist. The Apache 2.0 cohort (GLM-4.7, Qwen 3.5-Coder, OLMo 3.1) holds an underweighted procurement advantage that a pure-Elo comparison hides.

Self-Hosting a Top-10 Open-Source Model — What It Actually Takes

The most expensive misreading of open-weight rankings is treating "downloadable from Hugging Face" as a synonym for "deployable in production."

The realistic stack for a top-10 open-weight model in production:

  • 4-8x H100 or H200 GPUs for moderate concurrency. Quantized 4-bit deployments can reduce this to 2-4 GPUs but degrade quality measurably.
  • 1.0 FTE MLOps engineer fully loaded (~$220K U.S. annually, ~$18K/month). Not a part-time devops handoff.
  • Inference orchestration software — vLLM, TGI, SGLang, or proprietary equivalent. Either licensed or significant in-house engineering investment.
  • Observability stack — typically Datadog or Grafana plus custom LLM-specific tracing. ~$1,500/month at moderate scale.
  • Compliance audit overhead — SOC 2 attestation, security reviews, model-update regression testing.
  • Capacity planning, autoscaling, and on-call rotation — non-trivial for 24/7 production.

The all-in monthly TCO floor: $18,000–$35,000 for a small-team self-hosted GLM-4.7 deployment. That math breaks even with Claude Opus 4.6 API at approximately 1.5–2 billion output tokens per month — far higher than most teams assume.

The hosted-API alternative (Together AI, Fireworks, Anyscale, OpenRouter, Groq) running open-weight models on shared infrastructure is genuinely cheaper for sub-1B-token workloads — and is the option most enterprises should actually compare against the proprietary APIs.

The Open-Weight Capability Tax — Why Per-Token Pricing Lies

The Elo gap from #1 (Claude Opus 4.6, 1504) to top-10 open-weight (GLM-4.7, 1462) understates the production gap on agentic workloads. The leaderboard measures human preference on single-turn chat. Production agentic tools (Aider, Cursor, Cline, Devin) measure something different: the rate at which a model autonomously edits multiple files, runs tests, and produces an accepted PR.

On agentic benchmarks, the gap widens:

  • SWE-Bench Verified — Claude Opus 4.6 ~67%, GLM-4.7 ~52%. A 15-percentage-point gap.
  • Aider Polyglot — Claude Opus 4.6 ~71%, GLM-4.7 ~58%. Comparable gap.
  • Hallucinated-import rate (internal eval) — Claude ~1.8%, top open-weight ~4.1%. Doubles the retry overhead.
  • Agentic-loop retry rate — Claude Opus 4.6 ~9%, top open-weight ~18%. Doubles effective token consumption on complex tasks.

The cost-per-accepted-PR translation: Claude ~$1.85, GLM-4.7 ~$3.20-3.80 once retry costs are included. The 5x per-token pricing advantage of GLM-4.7 partially or fully reverses on engineering workloads. For tolerant workloads (content generation, summarization, data extraction) where retries are cheap, GLM-4.7 still wins on net economics. For agentic engineering, Claude usually wins.

The Bottom Line — Capability Is the Easy Part

The open-weight tier has finally narrowed the proprietary capability gap. That's the headline. The procurement-grade story is less flattering.

  • Capability: seven open-weight models genuinely compete with the proprietary tier on capability. The Elo gap is real but no longer disqualifying.
  • Licensing: the Apache 2.0 cohort (GLM-4.7, Qwen 3.5-Coder, OLMo 3.1) is the only fully clean procurement option. Llama Community, Mistral Research, and DeepSeek License all carry asterisks that matter at enterprise scale.
  • Production economics: the headline per-token cost advantage often reverses on agentic workloads where retry rates compound. Cost-per-accepted-PR is the metric that matters; cost-per-token is the metric vendors quote.
  • Self-hosting: the right answer for above ~1.5B tokens monthly or for top-secret data. The wrong answer for almost everything else, where hosted-API providers (Together, Fireworks, Groq) deliver open-weight pricing without the ops overhead.
Need the live top-10 widget that refreshes weekly? For category-specific leaderboards (Coding, Vision, WebDev, Writing) and the live snapshot of who's #1 right now across both proprietary and open-weight: See the live LMArena top-10 leaderboard →. Or jump straight to Who's #1 on LMArena Right Now? The Live Top-10 Decoded.

About the Author: Sanjay Saini

Sanjay Saini is an Enterprise AI Strategy Director specializing in digital transformation and AI ROI models. He covers high-stakes news at the intersection of leadership and sovereign AI infrastructure.

Connect on LinkedIn

Generate studio-quality AI videos in minutes — no camera or crew. Try HeyGen AI, the platform powering enterprise leaders to scale on-brand video at procurement-grade speed. Learn more.

HeyGen AI - Enterprise AI Video Generation Platform

Frequently Asked Questions (FAQ)

Which is the top open-source model on the LMArena leaderboard 2026?

GLM-4.7 — first open-weight to enter LMArena top-10 on both Text and WebDev simultaneously in late March 2026 (Elo 1462 ±7). Apache 2.0 licensed, no MAU caps, deployment-clean for enterprise procurement. DeepSeek-V4 sits closely behind.

How does OLMo 3.1 compare to Llama 4 on Arena Elo?

Llama 4 leads on raw Elo (~1452 vs OLMo 3.1's ~1441) — an 11-point gap. OLMo 3.1's Apache 2.0 license is genuinely unrestricted, while Llama 4's Community License caps free use at 700M MAU. For most enterprises, Llama wins; for parent organizations near that threshold, OLMo wins.

Are Mistral or DeepSeek models still competitive in 2026?

Yes on capability — both sit in the top-16 Text leaderboard. Mistral Large 3.1 carries a Research License that restricts free commercial use. DeepSeek-V4 has the higher Elo but triggers data-residency review in regulated U.S., EU, and defense procurement contexts.

Can you self-host a top-10 LMArena open-source model?

Yes — 4-8 H100/H200 GPUs plus 1.0 FTE MLOps engineer plus orchestration plus observability. Realistic monthly TCO floor: $18,000-$35,000. Below 1.5-2 billion output tokens monthly, hosted-API providers (Together, Fireworks, Groq) typically beat self-hosting on net economics.

What's the gap between best open-source vs proprietary Elo?

Approximately 42 Elo points — the smallest in two years. Claude Opus 4.6 leads Text at 1504; GLM-4.7 sits at 1462. On agentic benchmarks (SWE-Bench Verified, Aider Polyglot), the gap widens to 13-15 percentage points. The headline Elo gap understates production capability differences.

Are open-source models truly free for commercial enterprise use?

Only Apache 2.0 cohort (GLM-4.7, Qwen 3.5-Coder, OLMo 3.1) is genuinely unrestricted commercial use. Llama 4 caps free use at 700M MAU. Mistral Research License restricts top-tier commercial deployment without a paid agreement. DeepSeek License v2 is permissive but broader than Apache.

Which open-weight model is best for fine-tuning in 2026?

OLMo 3.1 leads for fine-tuning research because AI2 published full training data and intermediate checkpoints — true reproducibility. For production fine-tuning, Llama 4 and GLM-4.7 have the strongest tooling ecosystems. Qwen 3.5-Coder dominates code-specific fine-tuning.

Does GLM-4.7 outperform Llama 4 on coding?

Yes, by a small but consistent margin. GLM-4.7 sits in LMArena Code top-10 and entered top-10 WebDev in late March 2026 — Llama 4 sits roughly 3 positions lower on Code. On Aider Polyglot, GLM-4.7 leads Llama 4 by approximately 4-6 percentage points.

How do you measure open-source LLM ROI vs closed-source?

Six steps: estimate monthly token volume, calculate API baseline cost, sum self-hosting fixed costs, find the break-even token volume, add the agentic-workload capability tax (~30-40%), and validate with a one-week pilot. Most teams discover their open-weight TCO is 30-50% higher than projected.

Are GPU costs higher for open-source LLMs at scale?

Yes — at moderate scale. GPU rental plus MLOps salary plus orchestration plus observability typically totals $18,000-$35,000/month for a small team. Hosted-API providers amortize this across many customers. Self-hosting only beats hosted-API pricing above approximately 800M-1.2B tokens monthly.