Why Karpathy Just Killed Vibe Coding (Audit Inside) (May 2026)

On 6 February 2026, Andrej Karpathy — the engineer who coined "vibe coding" — declared the term passé and quietly hand-coded his next project.

Yet most enterprise CTOs are still shipping under that deprecated label, while AI-generated code now produces CVEs at 2.74× the rate of human-written code and March 2026 alone logged 35 AI-caused CVEs (up from 6 in January).

This is the definitive CTO audit for the post-vibe-coding world — the seven gates of agentic engineering, the discipline Karpathy is implicitly endorsing, and the exact compliance posture you need before your next sprint review.

Executive Summary — The 7-Gate Agentic Engineering Audit

If you skim nothing else, this is the snapshot for your next board memo.

#	Gate	What it Stops	Owner	Compliance Anchor
1	Intent Capture	Prompt drift, untraceable code provenance	Lead Engineer	EU AI Act Art. 12
2	Scoped Agent Execution	Blast-radius creep, secrets exfiltration	Platform Eng	NIST AI RMF MAP-3
3	Diff-Level Human Review	Silent regressions, hallucinated APIs	Senior Eng	EU AI Act Art. 14
4	Adversarial Test Synthesis	XSS, SSRF, prompt-injection regressions	AppSec	OWASP LLM Top 10
5	Provenance & SBOM Tagging	License poisoning, IP contamination	Release Mgmt	ISO/IEC 42001 §9
6	Production Readiness Score	Borderline merges, blast-radius mismatch	Eng Manager	SOC 2 CC8.1
7	Post-Merge Telemetry Loop	Drift in agent quality month-over-month	SRE / DevEx	EU AI Act Art. 15

The bottom line: vibe coding shipped outputs. Agentic engineering ships outputs plus oversight artifacts.

If your team cannot produce gate-by-gate evidence for any merged AI-generated commit in the last 90 days, you are non-compliant with the EU AI Act provisions taking effect 2 August 2026 — regardless of where you are headquartered, if you ship to EU users.

1. What Karpathy Actually Said in February 2026 — and Why It Reframes Everything

The pivot was quiet but unambiguous. In a short post, Karpathy noted that programming via LLM agents "is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny" — and that the goal is to claim agent leverage "without any compromise on the quality of the software."

He then disclosed that his own latest project was hand-coded.

Read carefully, this is not nostalgia. It is a public concession that the original vibe-coding posture — describe what you want, accept what you get — has failed the production bar.

The discipline he is gesturing toward already has a name in industry: agentic engineering.

For enterprise leaders, this matters for three reasons.

First, terminology drives policy. Internal AI usage policies, vendor contracts, and even SOC 2 attestations still reference "vibe coding" or "AI-assisted coding" — terms that now signal an immature governance posture to auditors and acquirers.

Second, the empirical case has hardened. A 470-PR analysis published in early 2026 found AI-written code produces flaws at 2.74× the rate of human-written code, and the OWASP-adjacent studies put XSS protection failure at 86% across major coding assistants.

Third, the legal clock is running. EU AI Act enforcement of general-purpose AI provisions begins 2 August 2026, and high-risk system obligations follow on the same calendar.

Article 14 (human oversight) and Article 15 (accuracy, robustness, cybersecurity) are written exactly to penalize the vibe-coded SDLC.

If your engineering org has not yet evolved its terminology, its review gates, and its evidence trail, you are not running a "modern" workflow — you are running a deprecated one with a compliance liability attached.

PMO Warning: When auditors arrive in late 2026, they will not ask whether your team uses AI. They will ask for the review-trail artifacts for every AI-generated commit shipped to production. No artifact, no defense.

2. Vibe Coding vs Agentic Engineering — The Definitional Line

The distinction is not branding. It is structural.

Vibe coding treats the LLM as a creative collaborator: a developer issues a natural-language description, accepts large generated blocks, iterates on feel, and ships.

The artifact is the code; the prompt is ephemeral.

Agentic engineering treats the LLM as a delegated agent under formal oversight.

The developer still authors intent, but every agent action is scoped, logged, diff-reviewed, adversarially tested, and tagged with provenance. The artifact is the code and the oversight trail.

The table below makes the contrast explicit for an enterprise PMO context.

Dimension	Vibe Coding (Deprecated)	Agentic Engineering (Current)
Primary intent capture	Free-form chat prompt	Structured spec + acceptance criteria
Agent scope	Open-ended ("build me X")	Scoped task with explicit blast-radius limits
Review depth	"Looks right, merge"	Diff-level, gate-by-gate
Test posture	Generated code passes generated tests	Adversarial test synthesis (XSS, SSRF, prompt-injection regressions)
Provenance	Untagged AI commits	SBOM-tagged, signed agent attribution
Compliance fit	Fails EU AI Act Art. 14 & 15	Designed against Art. 14 & 15, NIST AI RMF, ISO 42001
Resume signal	Junior-to-mid	Senior-to-staff
Insurance posture	Often excluded as gross negligence	Coverable with documented controls

If your current workflow falls in the left column, your organization is carrying — at minimum — a documentation debt that will surface in your next audit or breach response.

The longer-form comparison, including a side-by-side rewrite of three real PR templates, lives in our companion analysis on the terminology shift and what it means for hiring.

3. The Hard Data — Why the CVE Curve Forced the Pivot

The vibe-coding model collapsed because the numbers got too loud to ignore.

Between January and March 2026, three independent measurements aligned in a way that left no room for the "AI code is fine if you just review it" defense.

The headline numbers, drawn from published 2026 studies and CVE trackers:

AI-generated code contains exploitable vulnerabilities in 40–62% of samples, depending on study and language.
AI-written code produces flaws at 2.74× the rate of human-written code across a 470-PR analysis.
Cross-site scripting protection fails 86% of the time in AI-generated handlers.
35 CVEs in March 2026 alone were directly attributable to AI-generated code — up from 6 in January.

The cost-per-CVE on AI code runs higher than the human-code baseline because root-cause analysis must reconstruct the agent session that produced it.

What changed is not the AI. The models got better.

What changed is deployment velocity — teams ship 4–10× more AI-generated code per engineer-week than 18 months ago, so even a stable per-line defect rate produces a CVE-curve that looks like an outbreak.

The mathematical exposure is fully decomposed, with the OWASP-mapped breakdown by tool and by language, in our deep-dive on AI Code CVE Statistics 2026.

Compliance Note: Under the EU AI Act, an undiscovered high-risk AI system carries a fine of up to €15 million or 3% of global annual turnover. Knowingly shipping AI-generated code without documented adversarial testing is precisely the conduct Article 15 was drafted to penalize.

4. The Information Gain — Why "More Review" Is Actually the Wrong Fix

This is the section most CTO playbooks get wrong. The default reflex when CVE rates rise is to mandate more human review — two reviewers per PR, longer review windows, more SAST scans.

The data says this does not work.

Here is the counter-intuitive finding from the 2026 reviewer-fatigue studies: doubling the reviewer count on AI-generated PRs increases catch rate by only 11–14%, while increasing time-to-merge by 60–80%.

Worse, on PRs over 400 lines, the second reviewer's catch rate drops below the first reviewer's by approximately 35% — a documented effect of inherited-trust bias. The second reviewer assumes the first reviewer caught the obvious flaws and skims.

The real problem is not review quantity. It is review modality.

Human reviewers reading AI-generated code look at the same surface (syntax, structure, naming) where AI code is already strongest. The flaws hide in semantically valid but contextually wrong constructs — authentication bypasses that pass linting, sanitization calls that don't sanitize the right field, race conditions in async patterns the model has never been corrected on.

The fix is not more eyes. The fix is different machinery: adversarial test synthesis at gate 4, scoped agent execution at gate 2, and a production readiness score at gate 6. Each replaces human attention with a different kind of attention that the AI's failure modes cannot pattern-match around.

Pro Tip: If your current AI code review policy is "two senior eyes on every PR over 100 lines," you are spending senior-engineer time on the work AI code is least likely to fail at, and you have no defense against the work it is most likely to fail at. Move two of those review-hours per week into adversarial test synthesis and watch your catch rate climb without adding headcount.

The full reviewer-fatigue dataset and the six-gate HITL playbook that replaces the brute-force "more reviewers" anti-pattern is detailed in our companion piece on human-in-the-loop AI code review.

5. The 7-Gate Agentic Engineering Audit — Gate by Gate

Each gate exists to neutralize a specific failure mode of the vibe-coded SDLC. They are designed to be enforced as code — in GitHub Actions, GitLab pipelines, or your platform-engineering control plane — not as documents that engineers may or may not read.

Gate 1 — Intent Capture

Every AI-generated commit must originate from a captured intent record: a structured spec containing the user story, acceptance criteria, security constraints, and the explicit blast-radius.

Free-form chat prompts do not qualify. The intent record is stored alongside the PR and referenced in the commit metadata. This single discipline cuts hallucinated-API merges by 30–40% because it forces the engineer to specify the contract before invoking the agent.

Gate 2 — Scoped Agent Execution

The agent runs inside a sandbox with explicit filesystem, network, and credential limits derived from the intent record.

No production credentials. No write access outside the scoped paths. No outbound network beyond an allow-list. This is where the machine identity discipline meets agentic engineering — a topic we cover at depth in our AgentOps Machine Identity guide.

Gate 3 — Diff-Level Human Review

The reviewer reads diffs, not files. They reference the intent record, not the prompt.

They check whether the diff matches the captured contract, not whether the code "looks reasonable." This is the discipline EU AI Act Article 14 effectively mandates: meaningful human oversight, evidenced by an artifact trail.

Gate 4 — Adversarial Test Synthesis

For every AI-generated handler, the pipeline auto-synthesizes adversarial tests targeting the failure modes that AI code is statistically prone to: XSS, SSRF, IDOR, time-of-check / time-of-use races, and prompt-injection paths if the handler invokes downstream agents.

This is the replacement for the "two human reviewers" anti-pattern.

Gate 5 — Provenance & SBOM Tagging

The commit is tagged with: which agent (model + version), which prompt template, which intent record, which reviewer, and which test suite ran.

The SBOM is updated to reflect any new dependencies the agent introduced. This is the evidence trail your post-incident forensics and your auditor will both ask for.

Gate 6 — Production Readiness Score

Before merge, the PR receives a 7-dimension score: correctness, security, performance, observability, maintainability, blast-radius, and rollback complexity.

Borderline scores route to staff-engineer review with explicit blast-radius limits.

Gate 7 — Post-Merge Telemetry Loop

Every AI-generated commit is tracked in production for 14–30 days for elevated error rates, latency regressions, and incident attribution.

The aggregated signal feeds back into the agent's prompt templates and the team's training data on which patterns to scrutinize harder.

The complete 21-item operational checklist that implements these seven gates inside a real engineering pipeline is published in our companion piece on the agentic engineering workflow checklist.

6. Case Study — The Lovable Security Crisis and What It Proved

If you needed a single case study to settle the boardroom debate, the Lovable incident of early 2026 supplied it.

A widely-used vibe-coding platform shipped projects with default configurations that exposed user credentials and personal data at scale — not because the platform itself was malicious, but because the generated code lacked basic gate-3 and gate-4 review, and the platform's default scaffolding inherited misconfigurations the agent had no reason to flag.

The post-mortem reads like a checklist of the gates above:

Gate 1 failure — Users described what they wanted; nobody captured a security-constraint intent.
Gate 2 failure — The agent had broad scope to wire up datastores with default-permissive configurations.
Gate 4 failure — No adversarial tests for unauthenticated-read paths.
Gate 5 failure — No provenance trail when the breach was investigated.

The teardown of the Lovable incident — exactly which misconfigurations propagated, which audit questions enterprise procurement now asks vendors, and which insurance exclusions activated — is in our companion analysis on the Lovable security crisis.

The lesson is structural, not vendor-specific. Any platform that lets users generate-and-ship without enforced gates will reproduce some version of Lovable.

The post-vibe-coding standard is gates enforced by the platform, not gates suggested by documentation.

PMO Warning: Procurement teams that signed vibe-coding platform contracts in 2024–2025 should pull those contracts and check the indemnity clauses. A growing number of incidents are landing in the gap between "platform liability" and "user-configuration error," and standard cyber insurance is starting to invoke gross-negligence exclusions when the deployed code is AI-generated without documented review.

7. The Migration Path — From Vibe-Coded Backlog to Agentic Engineering Today

You inherited a codebase with hundreds of vibe-coded commits already in production.

Telling your team to "stop vibe coding" does not make those commits compliant. Here is the pragmatic 90-day migration most enterprise teams can execute without halting feature delivery.

Days 1–14 — Inventory and triage. Run a provenance pass on the last 18 months of merges. Tag any commit where the PR description, commit message, or known-tool signature indicates AI generation. Risk-rank by blast radius: anything touching auth, data access, payment, or PII gets the highest tag.
Days 15–45 — Gate retrofit on new work. Implement gates 1, 3, and 4 in the pipeline for all new AI-generated commits. This stops the bleeding immediately. Gates 2, 5, 6, and 7 land in the next phase.
Days 46–75 — Adversarial test backfill on high-risk inventory. For the high-risk tags from your inventory, synthesize adversarial test suites and run them retroactively. Expect to find issues. Triage by exploitability, not by aesthetic severity.
Days 76–90 — Provenance backfill and full-gate rollout. Tag the inventory with what you can reconstruct of its provenance. Roll out gates 2, 5, 6, 7 across the pipeline. Begin the post-merge telemetry loop.

Teams that pair this with a parallel transition of their broader AI-augmented SDLC practice — Scrum events, definition-of-done, sprint reviews when agents demo the product — tend to land the change without a velocity dip.

The accumulated playbook for that broader transition is in our 23-post pillar on the Agile-Agentic SDLC.

For organizations whose current AI coding policy was built around the deprecated "vibe coding" framing — including hiring rubrics, IP clauses, and team-management norms — the legacy material is preserved at our managing vibe coding teams pillar, which now serves as the migration source for the agentic engineering posture.

8. Measuring ROI — What CFOs and Boards Actually Want to See

The board will not approve agentic engineering on safety grounds alone.

They will approve it when the ROI math is unambiguous. Here is the math that has cleared 2026 CFO reviews at multiple mid-cap enterprises.

Cost side — gate implementation runs 6–10 engineer-weeks of platform-engineering work, plus 2–4 weeks of pipeline-tooling integration. Ongoing overhead is 8–12% of engineering hours on enforced reviews and test synthesis.

Benefit side — typical observed deltas in the first two quarters after rollout:

CVE backlog down 35–55% — the single largest line item.
Time-to-merge up modestly (10–15%) for AI-generated PRs, but down 5–8% for human-only PRs as reviewer attention is reallocated.
Production incident attribution time down 60–75% — provenance tags collapse forensic effort.
Cyber-insurance renewal premiums hold or drop, versus a 15–25% increase observed in peer organizations without documented agentic controls.
Audit preparation time for SOC 2, ISO 42001, and EU AI Act readiness down 40–60% — gate evidence is the audit evidence.

The honest answer to "what's the ROI" is that agentic engineering pays for itself inside the first audit cycle, and the security savings compound from quarter two onward.

The 7-point production readiness scorecard that quantifies each PR's contribution to that math is detailed in our companion piece on grading AI agent code production readiness.

Read the Full Playbook

The pillar above is the strategic spine. The full enforcement detail lives in the companion pieces below — each engineered to drop into an enterprise governance pack on its own.

Frequently Asked Questions (FAQ)

What exactly did Andrej Karpathy say to deprecate vibe coding in February 2026?

Karpathy publicly described vibe coding as passé and noted that LLM-agent programming is now a default professional workflow only with greater oversight and scrutiny. He stated the goal is agent leverage without quality compromise, then disclosed that his own latest project was hand-coded, signalling the end of the unreviewed-output era.

How is agentic engineering different from vibe coding in a CTO's enterprise workflow?

Vibe coding ships AI outputs based on conversational prompts and surface-level review. Agentic engineering ships AI outputs plus an oversight trail: captured intent, scoped agent execution, diff-level review, adversarial tests, provenance tags, a readiness score, and post-merge telemetry. The artifacts, not the code, are what auditors examine.

Why did Karpathy abandon his own vibe coding term and hand-code his latest project?

The empirical data turned. AI-generated code now produces CVEs at 2.74x human rates and fails XSS protection 86% of the time. Karpathy's hand-coded project signals that the original vibe-coding posture — accept what the model gives you — no longer meets the production bar he applies to his own work.

What percentage of AI-generated code contains exploitable security vulnerabilities in 2026?

Published 2026 studies converge on a range of 40–62% of AI-generated samples containing exploitable vulnerabilities, depending on language and tool. Cross-site scripting protection fails in approximately 86% of AI-generated handlers, and the overall AI-versus-human flaw-rate ratio sits at 2.74x per a 470-PR analysis.

Is vibe coding still safe to use for internal tools or only banned for production?

Internal-only is not a safe carve-out. Internal tools frequently access production data, customer PII, and credentials. The same gates apply: intent capture, scoped execution, diff-level review, adversarial tests. The Lovable incident proved that internal misconfigurations propagate to external impact faster than most procurement teams realize.

Which governance gates separate agentic engineering from unreviewed AI-assisted coding?

Seven: intent capture, scoped agent execution, diff-level human review, adversarial test synthesis, provenance and SBOM tagging, production readiness scoring, and post-merge telemetry. Each maps to a specific failure mode of the vibe-coded SDLC and to a specific clause of the EU AI Act, NIST AI RMF, or ISO/IEC 42001.

Why does AI-generated code produce CVEs at 2.74x the rate of human-written code?

AI models pattern-match on common code without consistently encoding contextual security invariants — which user identity owns this resource, which field needs sanitization, which race conditions matter. The output passes linters and basic tests but fails on adversarial inputs that target the model's blind spots, especially in auth, sanitization, and async paths.

Should enterprise teams rebrand their AI coding policy from vibe coding to agentic engineering?

Yes — and not as marketing. Auditors, insurers, and acquirers in 2026 read policy language to assess governance maturity. A policy still using vibe coding or AI-assisted coding without gate definitions signals an immature posture. Agentic engineering language, paired with the seven gates as enforced controls, signals the opposite.

How do CTOs measure ROI on agentic engineering versus traditional pair programming?

Track five lines: CVE backlog reduction, incident-attribution time, audit preparation time, cyber-insurance premium movement, and time-to-merge split by AI versus human PRs. Mid-cap 2026 deployments report 35–55% CVE backlog reduction and 40–60% audit prep reduction inside two quarters, more than offsetting the 8–12% review overhead.

What is the EU AI Act exposure for shipping vibe-coded software after August 2, 2026?

For high-risk AI systems, fines run up to €15 million or 3% of global annual turnover. Article 14 mandates meaningful human oversight; Article 15 mandates accuracy, robustness, and cybersecurity. Vibe-coded SDLC posture — unscoped agents, surface-level review, no provenance — is functionally a failure of both articles by design.

Why Karpathy Just Killed Vibe Coding (Audit Inside)