Stripe recently published a post about its internal AI agents — Minions. The numbers are striking: over 1,000 pull requests per week, produced autonomously by AI agents, reviewed and merged by human engineers. These are not toy examples. They are production changes to one of the most important financial infrastructure companies in the world.
This is the new normal. Every major engineering organization is deploying AI agents to write code at scale. GitHub Copilot, Cursor, Devin, internal systems like Stripe’s — the velocity of code production has fundamentally changed.
The part that is not getting equal attention: the velocity of security testing has not changed at all.
the asymmetry problem
AI agents produce code at a rate that would have been inconceivable two years ago. A thousand PRs per week at one company. Multiplied across every engineering team now using AI coding tools, the global volume of new code being written and shipped has increased by an order of magnitude.
How that code gets security-tested, for most organizations, is: mostly it does not. Some companies run static analysis in CI — tools like Semgrep or CodeQL that check for known patterns. A smaller number run periodic penetration tests, typically quarterly. An even smaller number have dedicated security engineers who manually review high-risk changes.
The math does not work. AI agents writing a thousand PRs per week, and humans reviewing them for security at the rate of maybe twenty per week, is not a sustainable equilibrium. The gap between code production and security review grows every day.
static analysis is necessary but not sufficient
This is not an argument against static scanners. They catch real bugs. They belong in every CI pipeline. They have a fundamental limitation: they match patterns, they do not understand intent.
Every CVE the lab surfaced during the npm audit work was in code that would pass static analysis cleanly. The node-forge certificate forgery: the code was syntactically correct, followed the library’s internal conventions, and had no pattern that a linter would flag. The bug was a logical error — checking a property only when its container was present, rather than treating absence as a failure. You cannot write a regex for that.
The mysql2 connection override: a URL parser that processes query parameters in the wrong order. The uptime kuma SSTI bypass: a fallback code path that skipped validation. The jsPDF XSS: string concatenation instead of DOM construction. Each one is a semantic issue that requires understanding what the code is supposed to do, not just what it does.
This is where AI agents change the game. An LLM-powered security agent can read the code, understand the intended behavior, trace the data flow, and identify where the implementation diverges from secure design. It does what a human security researcher does — without the throughput constraint.
why verification changes everything
The largest waste of time in security is not finding vulnerabilities. It is triaging false positives. Every static scanner produces a pile of “possible” findings that turn out to be nothing. Security teams spend 80% of their time proving things are not broken. This is why most organizations do not run aggressive scanning — the signal-to-noise ratio is too low to be actionable.
Real attackers do not have this problem. They attempt to exploit something. If it works, it is real. If it does not, they move on. That is the workflow worth automating.
This is why pwnkit runs an agentic pipeline, not a single scan. And why the verification agent is the most important piece.
map the attack surface. endpoints, system prompts, tool schemas, auth flows, data flows.
run systematic test cases against the target. prompt injection, tool poisoning, data exfiltration, auth bypass.
independently re-exploit every finding. different agent, fresh context. if it can't reproduce, the finding dies.
generate SARIF for GitHub Security, markdown for humans, JSON for automation. full evidence chains.
The verification agent does not trust the attack agent. It re-runs each exploit independently, with its own analysis of the target. If the attack agent says “prompt injection found” but the verification agent cannot reproduce it, the finding is killed. If a finding only works with a contrived input no real user would send, it gets flagged and downgraded.
This is what separates an agentic security tool from a scanner that produces a list of maybes. The output is not “these 47 things might be problems.” The output is “these 6 things are confirmed vulnerabilities, here is the proof for each one, here is how to fix them.”
the stripe parallel
What Stripe built with Minions is instructive. The agents do not just generate code — they operate within a structured pipeline. The agent produces a PR. A human reviews and approves. The system learns from feedback. The result is high-throughput, high-quality code production.
The same architecture applies to security testing. An AI agent produces a security assessment. A human reviews the findings. The system refines its approach based on what is confirmed versus what is noise. High-throughput, high-quality security analysis.
The critical difference is that in security, the verification step can be automated. A human is not needed to confirm that a vulnerability is real if there is a working proof of concept. The PoC is the confirmation. An agent that produces a working exploit has already done the verification a human reviewer would do — faster, more consistently, and with better documentation.
what this means for the industry
A shift in how security testing works is approaching:
- Every PR gets a security review. Not a linter pass. An actual security review by an AI agent that reads the diff, understands the context, and checks for vulnerability classes static analysis cannot detect. The cost is low enough — cents per review — to run on every commit.
- Continuous pentesting replaces quarterly assessments. Instead of hiring a pentest firm once a quarter, organizations run AI agents against their own systems continuously. The agents adapt as the codebase changes. New endpoints get tested the day they ship.
- Supply chain auditing becomes table stakes. Most teams currently rely on implicit trust in their npm dependencies. When an AI agent can audit a package in minutes for a few cents, that posture is hard to defend.
- The false-positive problem reduces. Verification-based scanning means every reported finding comes with proof. Security teams stop spending 80% of their time on triage and start spending it on remediation.
this is not hypothetical
Three weeks of Claude Opus auditing npm packages produced 73 findings, disclosed vulnerabilities published as CVEs, GHSAs, and fixed-without-public-id advisories, in packages with 55+ million weekly npm downloads. The vulnerabilities were real — certificate forgery, connection hijacking, server-side template injection, PDF injection, XSS. Each one verified with a working proof of concept. Each one responsibly disclosed and fixed by the maintainers.
That is what an AI agent can do when pointed at source code with a security researcher’s methodology. pwnkit is the open-source version of that workflow. Autonomous agents. Discover, attack, verify, report. Point it at a target, get back confirmed findings with evidence.
The age of agentic coding is here. Stripe’s Minions are writing a thousand PRs a week. Other companies are doing the same. The volume of code being produced by AI agents is growing exponentially.
Agentic security has to keep pace. Every AI-written PR should be AI-tested for security. Every dependency should be AI-audited before it enters the supply chain. Every AI/LLM app and MCP server should be AI-pentested before it goes to production.
The alternative is straightforward: attackers will use AI agents too. They will not bother with responsible disclosure.
npx pwnkit-cli scan --target https://your-app.com/api/chat