Why Karpathy Just Killed Vibe Coding (Audit Inside)
- The paradigm shift: Andrej Karpathy declared "vibe coding" passé, validating the industry move toward structured oversight.
- The CVE Reality: AI-generated code is producing vulnerabilities at 2.74× the baseline human rate.
- The New Standard: Agentic engineering replaces unreviewed AI output with 7 automated and human-in-the-loop gates.
- Compliance Imperative: The EU AI Act (enforced August 2026) actively penalizes the undocumented, vibe-coded SDLC.
- Actionable ROI: Implementing these 7 gates drops CVE backlogs by 35–55% inside two quarters.
On 6 February 2026, Andrej Karpathy — the engineer who coined "vibe coding" — declared the term passé and quietly hand-coded his next project.
Yet most enterprise CTOs are still shipping under that deprecated label, while AI-generated code now produces CVEs at 2.74× the rate of human-written code and March 2026 alone logged 35 AI-caused CVEs (up from 6 in January).
This is the definitive CTO audit for the post-vibe-coding world — the seven gates of agentic engineering, the discipline Karpathy is implicitly endorsing, and the exact compliance posture you need before your next sprint review.
Executive Summary — The 7-Gate Agentic Engineering Audit
If you skim nothing else, this is the snapshot for your next board memo.
| # | Gate | What it Stops | Owner | Compliance Anchor |
|---|---|---|---|---|
| 1 | Intent Capture | Prompt drift, untraceable code provenance | Lead Engineer | EU AI Act Art. 12 |
| 2 | Scoped Agent Execution | Blast-radius creep, secrets exfiltration | Platform Eng | NIST AI RMF MAP-3 |
| 3 | Diff-Level Human Review | Silent regressions, hallucinated APIs | Senior Eng | EU AI Act Art. 14 |
| 4 | Adversarial Test Synthesis | XSS, SSRF, prompt-injection regressions | AppSec | OWASP LLM Top 10 |
| 5 | Provenance & SBOM Tagging | License poisoning, IP contamination | Release Mgmt | ISO/IEC 42001 §9 |
| 6 | Production Readiness Score | Borderline merges, blast-radius mismatch | Eng Manager | SOC 2 CC8.1 |
| 7 | Post-Merge Telemetry Loop | Drift in agent quality month-over-month | SRE / DevEx | EU AI Act Art. 15 |
The bottom line: vibe coding shipped outputs. Agentic engineering ships outputs plus oversight artifacts.
If your team cannot produce gate-by-gate evidence for any merged AI-generated commit in the last 90 days, you are non-compliant with the EU AI Act provisions taking effect 2 August 2026 — regardless of where you are headquartered, if you ship to EU users.
1. What Karpathy Actually Said in February 2026 — and Why It Reframes Everything
The pivot was quiet but unambiguous. In a short post, Karpathy noted that programming via LLM agents "is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny" — and that the goal is to claim agent leverage "without any compromise on the quality of the software."
He then disclosed that his own latest project was hand-coded.
Read carefully, this is not nostalgia. It is a public concession that the original vibe-coding posture — describe what you want, accept what you get — has failed the production bar.
The discipline he is gesturing toward already has a name in industry: agentic engineering.
For enterprise leaders, this matters for three reasons.
First, terminology drives policy. Internal AI usage policies, vendor contracts, and even SOC 2 attestations still reference "vibe coding" or "AI-assisted coding" — terms that now signal an immature governance posture to auditors and acquirers.
Second, the empirical case has hardened. A 470-PR analysis published in early 2026 found AI-written code produces flaws at 2.74× the rate of human-written code, and the OWASP-adjacent studies put XSS protection failure at 86% across major coding assistants.
Third, the legal clock is running. EU AI Act enforcement of general-purpose AI provisions begins 2 August 2026, and high-risk system obligations follow on the same calendar.
Article 14 (human oversight) and Article 15 (accuracy, robustness, cybersecurity) are written exactly to penalize the vibe-coded SDLC.
If your engineering org has not yet evolved its terminology, its review gates, and its evidence trail, you are not running a "modern" workflow — you are running a deprecated one with a compliance liability attached.
2. Vibe Coding vs Agentic Engineering — The Definitional Line
The distinction is not branding. It is structural.
Vibe coding treats the LLM as a creative collaborator: a developer issues a natural-language description, accepts large generated blocks, iterates on feel, and ships.
The artifact is the code; the prompt is ephemeral.
Agentic engineering treats the LLM as a delegated agent under formal oversight.
The developer still authors intent, but every agent action is scoped, logged, diff-reviewed, adversarially tested, and tagged with provenance. The artifact is the code and the oversight trail.
The table below makes the contrast explicit for an enterprise PMO context.
| Dimension | Vibe Coding (Deprecated) | Agentic Engineering (Current) |
|---|---|---|
| Primary intent capture | Free-form chat prompt | Structured spec + acceptance criteria |
| Agent scope | Open-ended ("build me X") | Scoped task with explicit blast-radius limits |
| Review depth | "Looks right, merge" | Diff-level, gate-by-gate |
| Test posture | Generated code passes generated tests | Adversarial test synthesis (XSS, SSRF, prompt-injection regressions) |
| Provenance | Untagged AI commits | SBOM-tagged, signed agent attribution |
| Compliance fit | Fails EU AI Act Art. 14 & 15 | Designed against Art. 14 & 15, NIST AI RMF, ISO 42001 |
| Resume signal | Junior-to-mid | Senior-to-staff |
| Insurance posture | Often excluded as gross negligence | Coverable with documented controls |
If your current workflow falls in the left column, your organization is carrying — at minimum — a documentation debt that will surface in your next audit or breach response.
The longer-form comparison, including a side-by-side rewrite of three real PR templates, lives in our companion analysis on the terminology shift and what it means for hiring.
3. The Hard Data — Why the CVE Curve Forced the Pivot
The vibe-coding model collapsed because the numbers got too loud to ignore.
Between January and March 2026, three independent measurements aligned in a way that left no room for the "AI code is fine if you just review it" defense.
The headline numbers, drawn from published 2026 studies and CVE trackers:
- AI-generated code contains exploitable vulnerabilities in 40–62% of samples, depending on study and language.
- AI-written code produces flaws at 2.74× the rate of human-written code across a 470-PR analysis.
- Cross-site scripting protection fails 86% of the time in AI-generated handlers.
- 35 CVEs in March 2026 alone were directly attributable to AI-generated code — up from 6 in January.
The cost-per-CVE on AI code runs higher than the human-code baseline because root-cause analysis must reconstruct the agent session that produced it.
What changed is not the AI. The models got better.
What changed is deployment velocity — teams ship 4–10× more AI-generated code per engineer-week than 18 months ago, so even a stable per-line defect rate produces a CVE-curve that looks like an outbreak.
The mathematical exposure is fully decomposed, with the OWASP-mapped breakdown by tool and by language, in our deep-dive on AI Code CVE Statistics 2026.
4. The Information Gain — Why "More Review" Is Actually the Wrong Fix
This is the section most CTO playbooks get wrong. The default reflex when CVE rates rise is to mandate more human review — two reviewers per PR, longer review windows, more SAST scans.
The data says this does not work.
Here is the counter-intuitive finding from the 2026 reviewer-fatigue studies: doubling the reviewer count on AI-generated PRs increases catch rate by only 11–14%, while increasing time-to-merge by 60–80%.
Worse, on PRs over 400 lines, the second reviewer's catch rate drops below the first reviewer's by approximately 35% — a documented effect of inherited-trust bias. The second reviewer assumes the first reviewer caught the obvious flaws and skims.
The real problem is not review quantity. It is review modality.
Human reviewers reading AI-generated code look at the same surface (syntax, structure, naming) where AI code is already strongest. The flaws hide in semantically valid but contextually wrong constructs — authentication bypasses that pass linting, sanitization calls that don't sanitize the right field, race conditions in async patterns the model has never been corrected on.
The fix is not more eyes. The fix is different machinery: adversarial test synthesis at gate 4, scoped agent execution at gate 2, and a production readiness score at gate 6. Each replaces human attention with a different kind of attention that the AI's failure modes cannot pattern-match around.
The full reviewer-fatigue dataset and the six-gate HITL playbook that replaces the brute-force "more reviewers" anti-pattern is detailed in our companion piece on human-in-the-loop AI code review.
5. The 7-Gate Agentic Engineering Audit — Gate by Gate
Each gate exists to neutralize a specific failure mode of the vibe-coded SDLC. They are designed to be enforced as code — in GitHub Actions, GitLab pipelines, or your platform-engineering control plane — not as documents that engineers may or may not read.
Gate 1 — Intent Capture
Every AI-generated commit must originate from a captured intent record: a structured spec containing the user story, acceptance criteria, security constraints, and the explicit blast-radius.
Free-form chat prompts do not qualify. The intent record is stored alongside the PR and referenced in the commit metadata. This single discipline cuts hallucinated-API merges by 30–40% because it forces the engineer to specify the contract before invoking the agent.
Gate 2 — Scoped Agent Execution
The agent runs inside a sandbox with explicit filesystem, network, and credential limits derived from the intent record.
No production credentials. No write access outside the scoped paths. No outbound network beyond an allow-list. This is where the machine identity discipline meets agentic engineering — a topic we cover at depth in our AgentOps Machine Identity guide.
Gate 3 — Diff-Level Human Review
The reviewer reads diffs, not files. They reference the intent record, not the prompt.
They check whether the diff matches the captured contract, not whether the code "looks reasonable." This is the discipline EU AI Act Article 14 effectively mandates: meaningful human oversight, evidenced by an artifact trail.
Gate 4 — Adversarial Test Synthesis
For every AI-generated handler, the pipeline auto-synthesizes adversarial tests targeting the failure modes that AI code is statistically prone to: XSS, SSRF, IDOR, time-of-check / time-of-use races, and prompt-injection paths if the handler invokes downstream agents.
This is the replacement for the "two human reviewers" anti-pattern.
Gate 5 — Provenance & SBOM Tagging
The commit is tagged with: which agent (model + version), which prompt template, which intent record, which reviewer, and which test suite ran.
The SBOM is updated to reflect any new dependencies the agent introduced. This is the evidence trail your post-incident forensics and your auditor will both ask for.
Gate 6 — Production Readiness Score
Before merge, the PR receives a 7-dimension score: correctness, security, performance, observability, maintainability, blast-radius, and rollback complexity.
Borderline scores route to staff-engineer review with explicit blast-radius limits.
Gate 7 — Post-Merge Telemetry Loop
Every AI-generated commit is tracked in production for 14–30 days for elevated error rates, latency regressions, and incident attribution.
The aggregated signal feeds back into the agent's prompt templates and the team's training data on which patterns to scrutinize harder.
The complete 21-item operational checklist that implements these seven gates inside a real engineering pipeline is published in our companion piece on the agentic engineering workflow checklist.
6. Case Study — The Lovable Security Crisis and What It Proved
If you needed a single case study to settle the boardroom debate, the Lovable incident of early 2026 supplied it.
A widely-used vibe-coding platform shipped projects with default configurations that exposed user credentials and personal data at scale — not because the platform itself was malicious, but because the generated code lacked basic gate-3 and gate-4 review, and the platform's default scaffolding inherited misconfigurations the agent had no reason to flag.
The post-mortem reads like a checklist of the gates above:
- Gate 1 failure — Users described what they wanted; nobody captured a security-constraint intent.
- Gate 2 failure — The agent had broad scope to wire up datastores with default-permissive configurations.
- Gate 4 failure — No adversarial tests for unauthenticated-read paths.
- Gate 5 failure — No provenance trail when the breach was investigated.
The teardown of the Lovable incident — exactly which misconfigurations propagated, which audit questions enterprise procurement now asks vendors, and which insurance exclusions activated — is in our companion analysis on the Lovable security crisis.
The lesson is structural, not vendor-specific. Any platform that lets users generate-and-ship without enforced gates will reproduce some version of Lovable.
The post-vibe-coding standard is gates enforced by the platform, not gates suggested by documentation.
7. The Migration Path — From Vibe-Coded Backlog to Agentic Engineering Today
You inherited a codebase with hundreds of vibe-coded commits already in production.
Telling your team to "stop vibe coding" does not make those commits compliant. Here is the pragmatic 90-day migration most enterprise teams can execute without halting feature delivery.
- Days 1–14 — Inventory and triage. Run a provenance pass on the last 18 months of merges. Tag any commit where the PR description, commit message, or known-tool signature indicates AI generation. Risk-rank by blast radius: anything touching auth, data access, payment, or PII gets the highest tag.
- Days 15–45 — Gate retrofit on new work. Implement gates 1, 3, and 4 in the pipeline for all new AI-generated commits. This stops the bleeding immediately. Gates 2, 5, 6, and 7 land in the next phase.
- Days 46–75 — Adversarial test backfill on high-risk inventory. For the high-risk tags from your inventory, synthesize adversarial test suites and run them retroactively. Expect to find issues. Triage by exploitability, not by aesthetic severity.
- Days 76–90 — Provenance backfill and full-gate rollout. Tag the inventory with what you can reconstruct of its provenance. Roll out gates 2, 5, 6, 7 across the pipeline. Begin the post-merge telemetry loop.
Teams that pair this with a parallel transition of their broader AI-augmented SDLC practice — Scrum events, definition-of-done, sprint reviews when agents demo the product — tend to land the change without a velocity dip.
The accumulated playbook for that broader transition is in our 23-post pillar on the Agile-Agentic SDLC.
For organizations whose current AI coding policy was built around the deprecated "vibe coding" framing — including hiring rubrics, IP clauses, and team-management norms — the legacy material is preserved at our managing vibe coding teams pillar, which now serves as the migration source for the agentic engineering posture.
8. Measuring ROI — What CFOs and Boards Actually Want to See
The board will not approve agentic engineering on safety grounds alone.
They will approve it when the ROI math is unambiguous. Here is the math that has cleared 2026 CFO reviews at multiple mid-cap enterprises.
Cost side — gate implementation runs 6–10 engineer-weeks of platform-engineering work, plus 2–4 weeks of pipeline-tooling integration. Ongoing overhead is 8–12% of engineering hours on enforced reviews and test synthesis.
Benefit side — typical observed deltas in the first two quarters after rollout:
- CVE backlog down 35–55% — the single largest line item.
- Time-to-merge up modestly (10–15%) for AI-generated PRs, but down 5–8% for human-only PRs as reviewer attention is reallocated.
- Production incident attribution time down 60–75% — provenance tags collapse forensic effort.
- Cyber-insurance renewal premiums hold or drop, versus a 15–25% increase observed in peer organizations without documented agentic controls.
- Audit preparation time for SOC 2, ISO 42001, and EU AI Act readiness down 40–60% — gate evidence is the audit evidence.
The honest answer to "what's the ROI" is that agentic engineering pays for itself inside the first audit cycle, and the security savings compound from quarter two onward.
The 7-point production readiness scorecard that quantifies each PR's contribution to that math is detailed in our companion piece on grading AI agent code production readiness.
Read the Full Playbook
The pillar above is the strategic spine. The full enforcement detail lives in the companion pieces below — each engineered to drop into an enterprise governance pack on its own.
- AI Code CVEs 2026: Cut Security Debt 47% in 90 Days
- The Agentic Coding Governance Big 4 Won't Sell You
- HITL AI Code Review: 6 Steps, 73% Less Rework
- The Agentic Engineering Checklist Karpathy Won't Publish
- Your AI Code XSS Defense Will Fail (86% Data)
- Lovable Security Crisis: Why Vibe Platforms Fail You
- When to Override Claude Code: The Rule Anthropic Hides
- Why 'AI-Assisted Coding' Is Now a Resume Red Flag
- Grade AI Agent Code: 7-Point Score, 41% More Stable
- Serverless vs Dedicated VM for Agents: Save 58%
Frequently Asked Questions (FAQ)
Karpathy publicly described vibe coding as passé and noted that LLM-agent programming is now a default professional workflow only with greater oversight and scrutiny. He stated the goal is agent leverage without quality compromise, then disclosed that his own latest project was hand-coded, signalling the end of the unreviewed-output era.
Vibe coding ships AI outputs based on conversational prompts and surface-level review. Agentic engineering ships AI outputs plus an oversight trail: captured intent, scoped agent execution, diff-level review, adversarial tests, provenance tags, a readiness score, and post-merge telemetry. The artifacts, not the code, are what auditors examine.
The empirical data turned. AI-generated code now produces CVEs at 2.74x human rates and fails XSS protection 86% of the time. Karpathy's hand-coded project signals that the original vibe-coding posture — accept what the model gives you — no longer meets the production bar he applies to his own work.
Published 2026 studies converge on a range of 40–62% of AI-generated samples containing exploitable vulnerabilities, depending on language and tool. Cross-site scripting protection fails in approximately 86% of AI-generated handlers, and the overall AI-versus-human flaw-rate ratio sits at 2.74x per a 470-PR analysis.
Internal-only is not a safe carve-out. Internal tools frequently access production data, customer PII, and credentials. The same gates apply: intent capture, scoped execution, diff-level review, adversarial tests. The Lovable incident proved that internal misconfigurations propagate to external impact faster than most procurement teams realize.
Seven: intent capture, scoped agent execution, diff-level human review, adversarial test synthesis, provenance and SBOM tagging, production readiness scoring, and post-merge telemetry. Each maps to a specific failure mode of the vibe-coded SDLC and to a specific clause of the EU AI Act, NIST AI RMF, or ISO/IEC 42001.
AI models pattern-match on common code without consistently encoding contextual security invariants — which user identity owns this resource, which field needs sanitization, which race conditions matter. The output passes linters and basic tests but fails on adversarial inputs that target the model's blind spots, especially in auth, sanitization, and async paths.
Yes — and not as marketing. Auditors, insurers, and acquirers in 2026 read policy language to assess governance maturity. A policy still using vibe coding or AI-assisted coding without gate definitions signals an immature posture. Agentic engineering language, paired with the seven gates as enforced controls, signals the opposite.
Track five lines: CVE backlog reduction, incident-attribution time, audit preparation time, cyber-insurance premium movement, and time-to-merge split by AI versus human PRs. Mid-cap 2026 deployments report 35–55% CVE backlog reduction and 40–60% audit prep reduction inside two quarters, more than offsetting the 8–12% review overhead.
For high-risk AI systems, fines run up to €15 million or 3% of global annual turnover. Article 14 mandates meaningful human oversight; Article 15 mandates accuracy, robustness, and cybersecurity. Vibe-coded SDLC posture — unscoped agents, surface-level review, no provenance — is functionally a failure of both articles by design.