The Agentic Engineering Checklist Karpathy Won't Publish
- The Karpathy Gap: Karpathy signaled the end of unreviewed AI code but left the implementation details to the community.
- The 21-Point Standard: A robust pipeline requires 21 specific checks mapped across 7 mandatory governance gates.
- Automation is Mandatory: These checklist items must be enforced in GitHub Actions or GitLab pipelines, not just documented in a wiki.
- SOC 2 Alignment: Skipping these checks exposes your enterprise to critical failures during SOC 2 CC8.1 (Change Management) audits.
- Pipeline Velocity: When automated correctly, this checklist reduces AI-introduced CVEs by up to 55% without crippling sprint velocity.
On X (formerly Twitter), Andrej Karpathy recently hinted at the strict oversight required for modern LLM-agent programming, but he stopped short of publishing the exact operational gates.
Today, we are decoding the exact agentic engineering workflow checklist CTO teams use before merging AI code. The 21-point gate framework that industry leaders are quietly implementing is decoded and ready for your pipeline today.
If you need the strategic background on why this shift from "vibe coding" occurred, read our foundational agentic engineering CTO playbook.
Teams transitioning away from the deprecated methods detailed in our legacy guide must immediately adopt these formalized checks to prevent catastrophic CVE spikes.
Decoding the 21-Point Agentic Engineering Workflow Checklist
The standard "looks good to me" PR approval is a liability in 2026. AI models write syntactically perfect code that often harbors deep, contextual security flaws.
To combat this, CTOs must enforce a rigorous checklist. We have organized the 21 critical checks into three operational phases that align with the core gates of agentic engineering.
Phase 1: Intent & Sandbox Constraints
Before the AI agent generates a single line of code, the foundation must be secure. Free-form chat prompts are the root cause of hallucinated APIs and untraceable provenance.
- 1. Intent Contract Captured: A structured spec must be documented in the ticket.
- 2. Acceptance Criteria Defined: Clear pass/fail conditions are explicitly listed.
- 3. Security Constraints Noted: Required sanitization and auth boundaries are documented.
- 4. Explicit Blast-Radius Set: The agent's allowable impact zone is defined.
- 5. Sandbox Initialized: The agent environment is isolated from production.
- 6. Filesystem Limits Enforced: Directory write-access is strictly limited.
- 7. Network Allow-List Active: Outbound network calls are restricted to approved endpoints.
Phase 2: Execution & Diff-Level Validation
Once the code is generated, the review process must be fundamentally restructured. This phase works hand-in-hand with a robust protocol.
- 8. Diff-to-Contract Match: Reviewers check the diff against the Phase 1 Intent Contract.
- 9. Logic Hallucination Check: Manual verification that no non-existent libraries were invoked.
- 10. Override Protocol Ready: Clear documentation if the human rejects the AI's logic.
- 11. Adversarial Test Synthesized: CI/CD automatically generates hostile tests for the new code.
- 12. XSS Payload Simulated: The pipeline actively attempts cross-site scripting bypasses.
- 13. SSRF & IDOR Checks Passed: Deep authorization edge-cases are validated.
- 14. Prompt-Injection Resilient: Downstream agent handlers are tested against malicious prompt overrides.
Phase 3: Provenance & Production Readiness
The final phase transforms the code into a legally defensible, auditable artifact. This is what prevents massive fines under the EU AI Act's Article 15 requirements.
- 15. Model Version Tagged: The specific LLM used for generation is recorded.
- 16. Prompt Template Logged: The exact prompt structure is attached to the commit.
- 17. Reviewer Identity Cryptographically Signed: Immutable proof of human oversight.
- 18. Dependency SBOM Updated: Any new packages introduced by the agent are cataloged.
- 19. Performance Regression Scored: Latency and compute metrics are baseline-checked.
- 20. Rollback Complexity Graded: The difficulty of reverting the AI code is evaluated.
- 21. Post-Merge Telemetry Active: The commit is flagged for 30-day production monitoring.
Automating the Gates in Your CI/CD
An agentic engineering workflow checklist CTO directive is useless if it only exists in a Notion document. Engineers will bypass manual steps under sprint pressure.
You must codify these 21 points directly into your platform engineering control plane. Use GitHub Actions to block any merge that lacks a cryptographic provenance tag or fails an automated adversarial test.
By enforcing the checklist as code, you eliminate human error and automatically generate the exact evidence trail your auditors will demand.
Final Thoughts and Next Steps
The transition to agent-driven development requires a fundamental upgrade to your engineering governance.
Adopting the agentic engineering workflow checklist CTO leaders use is the fastest way to stabilize your pipeline and prepare for incoming regulations.
Do not wait for an audit failure or a CVE outbreak; begin codifying these 21 points into your CI/CD pipelines this week.
Frequently Asked Questions (FAQ)
The 21 items span seven distinct gates: intent capture, sandboxed execution, diff-level review, adversarial testing, provenance tagging, readiness scoring, and telemetry. Each gate requires specific validation checks to ensure AI-generated code is secure and compliant.
Karpathy emphasized the critical need for "greater oversight and scrutiny" regarding LLM agents. However, he withheld the granular, step-by-step pipeline configurations—like explicit diff-to-contract matching and adversarial test synthesis—that actively enforce this oversight in production.
Staff engineers primarily execute the checklist, ensuring individual PRs meet the 21-point standard. Principal engineers design and maintain the automation of the checklist itself, continuously updating adversarial tests and sandbox constraints as underlying AI models evolve.
It absolutely must be enforced inside tools like GitHub Actions or GitLab CI. Manual checklists fail under pressure. Automated CI/CD gates ensure that PRs lacking cryptographic provenance or failing adversarial tests are physically blocked from merging.
For SOC 2 Type II (specifically CC8.1 Change Management), the most critical steps are structured intent capture, diff-level human review logs, and cryptographic SBOM/provenance tagging. These provide the required evidence that unauthorized changes were not introduced.
While initial PRs may take 10-15% longer as engineers adapt to structured intent capture, overall velocity stabilizes. Reallocating human review time toward adversarial validation actually decreases total time-to-merge by eliminating massive post-merge rework cycles.
The checklist's logic must live as code in the repository's CI/CD configuration files (e.g., .github/workflows). The underlying philosophy and escalation protocols should be documented in the engineering runbook for onboarding and audit reference.
Industry whispers suggest Anthropic's internal gates rely heavily on secondary "judge models" explicitly trained to catch edge-case hallucinations in primary agent outputs, alongside extremely rigid, dynamic sandbox constraints tailored to specific code modules.
CTOs should formally review and update the checklist quarterly. As foundational models become more capable, their failure modes shift, requiring updated adversarial test synthesis parameters and revised sandbox restrictions to address newly discovered vulnerability patterns.
Startups often struggle with the overhead of dedicated post-merge telemetry and maintaining comprehensive adversarial synthesis suites. However, intent capture, sandboxed execution, and diff-level review are universally applicable and critical regardless of organizational size.