Field notes

The pwnkit blog.

Name: pwnkit
Author: Doruk Tan Ozturk

Field notes on AI pentesting, agentic security, the XBOW benchmark, and the vulnerabilities autonomous AI agents find when you point them at real code. Built and written by the team behind pwnkit, the leading open-source AI pentest agent.

May 16, 2026
agentic pentesting on XBOW: shell-first architecture

shell-first design as a discovered architecture for autonomous pentesting agents. one bash tool outperforms a structured toolkit across the XBOW benchmark, the AI/LLM security suite, and adjacent domains.
May 16, 2026
the triage moat and multi-benchmark validation

ablation testing as scientific methodology. an 11-layer false-positive triage stack, the one broken layer that almost masked the rest, and the multi-benchmark portfolio that surfaces what a single suite would miss.
May 16, 2026
the XBOW benchmark methodology and verification

39 of XBOW's 104 challenges will not build on a clean machine because the docker images and apt repos they pin have rotted out from under them. every published AI-pentest score on the internet today lives on a patched substrate. here is what 'we scored 96% on xbow' actually means.
May 7, 2026
orchestration, not frontier — what the IronCurtain post means for pwnkit

Niels Provos shipped a vuln-discovery framework that replicates Mythos-class findings on commercial models — and one autonomous CVE on an open-weight model. it's the same bet pwnkit is built on. here's what we already do, what we have to borrow, and the four issues we just filed.
April 7, 2026
deleting better-sqlite3 from pwnkit, and what it cost us

pwnkit 0.7.1 ships with zero native modules. the persistence layer was migrated from better-sqlite3 to a pure-wasm sqlite implementation. here's what broke, what we kept, and why every npx install on every node version now just works.
April 6, 2026
introducing 0sec

an autonomous ai attacker on contract, pointed at your product. closed beta. by application only. founder-led from zürich.
April 5, 2026
open-source pentesting at commercial scale

pwnkit's measured performance on the public xbow benchmark now matches the published single-model results from commercial pentest stacks. the engine is open source. the methodology is public.
April 4, 2026
the attack surface XBOW and KinoSec don't test

traditional web vuln benchmarks miss the entire AI/LLM security attack surface. prompt injection, jailbreaks, MCP tool abuse -- none of it shows up in XBOW's 104 challenges.
April 3, 2026
100% on our AI security benchmark

10 challenges. 10 flags extracted. zero false positives. how pwnkit's agentic pipeline handles prompt injection, jailbreaks, SSRF, and multi-turn escalation.
March 29, 2026
blind verification: how false positives get killed

every security scanner drowns its user in false positives. closing the gap took three architectural attempts before one of them worked.
March 26, 2026
how AI agents found vulnerabilities in popular npm packages

a three-week methodology validation: Claude Opus, applied systematically to popular npm packages, surfaced 73 findings and disclosed vulnerabilities across packages with 55M+ weekly downloads. here is how the workflow operates.
March 26, 2026
the age of agentic security

if AI agents can write 1,000 pull requests a week, AI agents should be testing 1,000 pull requests a week. the asymmetry is about to collapse.

The pwnkit blog. blog.

agentic pentesting on XBOW: shell-first architecture

the triage moat and multi-benchmark validation

the XBOW benchmark methodology and verification

orchestration, not frontier — what the IronCurtain post means for pwnkit

deleting better-sqlite3 from pwnkit, and what it cost us

introducing 0sec

open-source pentesting at commercial scale

the attack surface XBOW and KinoSec don't test

100% on our AI security benchmark

blind verification: how false positives get killed

how AI agents found vulnerabilities in popular npm packages

the age of agentic security

The pwnkit blog.