The Agentic Engineering Checklist Karpathy Won't Publish

The Agentic Engineering Checklist Karpathy Won't Publish
  • The Karpathy Gap: Karpathy signaled the end of unreviewed AI code but left the implementation details to the community.
  • The 21-Point Standard: A robust pipeline requires 21 specific checks mapped across 7 mandatory governance gates.
  • Automation is Mandatory: These checklist items must be enforced in GitHub Actions or GitLab pipelines, not just documented in a wiki.
  • SOC 2 Alignment: Skipping these checks exposes your enterprise to critical failures during SOC 2 CC8.1 (Change Management) audits.
  • Pipeline Velocity: When automated correctly, this checklist reduces AI-introduced CVEs by up to 55% without crippling sprint velocity.

On X (formerly Twitter), Andrej Karpathy recently hinted at the strict oversight required for modern LLM-agent programming, but he stopped short of publishing the exact operational gates.

Today, we are decoding the exact agentic engineering workflow checklist CTO teams use before merging AI code. The 21-point gate framework that industry leaders are quietly implementing is decoded and ready for your pipeline today.

If you need the strategic background on why this shift from "vibe coding" occurred, read our foundational agentic engineering CTO playbook.

Teams transitioning away from the deprecated methods detailed in our legacy guide must immediately adopt these formalized checks to prevent catastrophic CVE spikes.

Decoding the 21-Point Agentic Engineering Workflow Checklist

The standard "looks good to me" PR approval is a liability in 2026. AI models write syntactically perfect code that often harbors deep, contextual security flaws.

To combat this, CTOs must enforce a rigorous checklist. We have organized the 21 critical checks into three operational phases that align with the core gates of agentic engineering.

Phase 1: Intent & Sandbox Constraints

Before the AI agent generates a single line of code, the foundation must be secure. Free-form chat prompts are the root cause of hallucinated APIs and untraceable provenance.

  • 1. Intent Contract Captured: A structured spec must be documented in the ticket.
  • 2. Acceptance Criteria Defined: Clear pass/fail conditions are explicitly listed.
  • 3. Security Constraints Noted: Required sanitization and auth boundaries are documented.
  • 4. Explicit Blast-Radius Set: The agent's allowable impact zone is defined.
  • 5. Sandbox Initialized: The agent environment is isolated from production.
  • 6. Filesystem Limits Enforced: Directory write-access is strictly limited.
  • 7. Network Allow-List Active: Outbound network calls are restricted to approved endpoints.

Phase 2: Execution & Diff-Level Validation

Once the code is generated, the review process must be fundamentally restructured. This phase works hand-in-hand with a robust protocol.

  • 8. Diff-to-Contract Match: Reviewers check the diff against the Phase 1 Intent Contract.
  • 9. Logic Hallucination Check: Manual verification that no non-existent libraries were invoked.
  • 10. Override Protocol Ready: Clear documentation if the human rejects the AI's logic.
  • 11. Adversarial Test Synthesized: CI/CD automatically generates hostile tests for the new code.
  • 12. XSS Payload Simulated: The pipeline actively attempts cross-site scripting bypasses.
  • 13. SSRF & IDOR Checks Passed: Deep authorization edge-cases are validated.
  • 14. Prompt-Injection Resilient: Downstream agent handlers are tested against malicious prompt overrides.

Phase 3: Provenance & Production Readiness

The final phase transforms the code into a legally defensible, auditable artifact. This is what prevents massive fines under the EU AI Act's Article 15 requirements.

  • 15. Model Version Tagged: The specific LLM used for generation is recorded.
  • 16. Prompt Template Logged: The exact prompt structure is attached to the commit.
  • 17. Reviewer Identity Cryptographically Signed: Immutable proof of human oversight.
  • 18. Dependency SBOM Updated: Any new packages introduced by the agent are cataloged.
  • 19. Performance Regression Scored: Latency and compute metrics are baseline-checked.
  • 20. Rollback Complexity Graded: The difficulty of reverting the AI code is evaluated.
  • 21. Post-Merge Telemetry Active: The commit is flagged for 30-day production monitoring.

Automating the Gates in Your CI/CD

An agentic engineering workflow checklist CTO directive is useless if it only exists in a Notion document. Engineers will bypass manual steps under sprint pressure.

You must codify these 21 points directly into your platform engineering control plane. Use GitHub Actions to block any merge that lacks a cryptographic provenance tag or fails an automated adversarial test.

By enforcing the checklist as code, you eliminate human error and automatically generate the exact evidence trail your auditors will demand.

Final Thoughts and Next Steps

The transition to agent-driven development requires a fundamental upgrade to your engineering governance.

Adopting the agentic engineering workflow checklist CTO leaders use is the fastest way to stabilize your pipeline and prepare for incoming regulations.

Do not wait for an audit failure or a CVE outbreak; begin codifying these 21 points into your CI/CD pipelines this week.

About the Author: Sanjay Saini

Sanjay Saini is an Enterprise AI Strategy Director specializing in digital transformation and AI ROI models. He covers high-stakes news at the intersection of leadership and sovereign AI infrastructure.

Connect on LinkedIn

Frequently Asked Questions (FAQ)

1. What 21 items belong on a CTO's agentic engineering workflow checklist for 2026?

The 21 items span seven distinct gates: intent capture, sandboxed execution, diff-level review, adversarial testing, provenance tagging, readiness scoring, and telemetry. Each gate requires specific validation checks to ensure AI-generated code is secure and compliant.

2. Which agentic engineering checklist items did Karpathy hint at on X but never publish in full?

Karpathy emphasized the critical need for "greater oversight and scrutiny" regarding LLM agents. However, he withheld the granular, step-by-step pipeline configurations—like explicit diff-to-contract matching and adversarial test synthesis—that actively enforce this oversight in production.

3. How does an agentic engineering workflow checklist differ for staff versus principal engineers?

Staff engineers primarily execute the checklist, ensuring individual PRs meet the 21-point standard. Principal engineers design and maintain the automation of the checklist itself, continuously updating adversarial tests and sandbox constraints as underlying AI models evolve.

4. Can the agentic engineering workflow checklist be enforced inside GitHub Actions or only manually?

It absolutely must be enforced inside tools like GitHub Actions or GitLab CI. Manual checklists fail under pressure. Automated CI/CD gates ensure that PRs lacking cryptographic provenance or failing adversarial tests are physically blocked from merging.

5. Which steps on the agentic engineering workflow checklist are non-negotiable for SOC 2 audits?

For SOC 2 Type II (specifically CC8.1 Change Management), the most critical steps are structured intent capture, diff-level human review logs, and cryptographic SBOM/provenance tagging. These provide the required evidence that unauthorized changes were not introduced.

6. How long does an average pull request take when the agentic engineering checklist is applied?

While initial PRs may take 10-15% longer as engineers adapt to structured intent capture, overall velocity stabilizes. Reallocating human review time toward adversarial validation actually decreases total time-to-merge by eliminating massive post-merge rework cycles.

7. Should the agentic engineering workflow checklist live in the repo, the wiki, or the runbook?

The checklist's logic must live as code in the repository's CI/CD configuration files (e.g., .github/workflows). The underlying philosophy and escalation protocols should be documented in the engineering runbook for onboarding and audit reference.

8. What does Anthropic's internal agentic engineering checklist reportedly include that public ones don't?

Industry whispers suggest Anthropic's internal gates rely heavily on secondary "judge models" explicitly trained to catch edge-case hallucinations in primary agent outputs, alongside extremely rigid, dynamic sandbox constraints tailored to specific code modules.

9. How often should a CTO update the agentic engineering workflow checklist as models change?

CTOs should formally review and update the checklist quarterly. As foundational models become more capable, their failure modes shift, requiring updated adversarial test synthesis parameters and revised sandbox restrictions to address newly discovered vulnerability patterns.

10. Which agentic engineering checklist items break down at startup scale versus enterprise scale?

Startups often struggle with the overhead of dedicated post-merge telemetry and maintaining comprehensive adversarial synthesis suites. However, intent capture, sandboxed execution, and diff-level review are universally applicable and critical regardless of organizational size.