Blog / / 10 min read

the triage moat and multi-benchmark validation

ablation testing as scientific methodology. an 11-layer false-positive triage stack, the one broken layer that almost masked the rest, and the multi-benchmark portfolio that surfaces what a single suite would miss.

The default reaction to a bad ablation result is to ship a fix. The harder discipline is to re-run, suspect the measurement before the system, and let single-run findings stay single-run findings until a second pass replicates them. This post documents an ablation cycle where that discipline made the difference between turning off something that works and turning off the one thing that does not — and the broader multi-benchmark portfolio that gave the result its meaning.

why a moat exists at all

pwnkit v0.6 shipped with an 11-layer triage pipeline internally referred to as “the moat.” The marketing claim attached to it was: false-positive rate down from approximately 50% to under 5%, comparable to Endor Labs’ 95% and Semgrep Assistant’s 96%. The directory in the codebase was literally named moat/.

The layers each address a separate failure mode in agentic finding generation: proof-of-vulnerability gating, reachability analysis, multimodal cross-checking, adversarial debate between agents, memory-aware deduplication, exploit graph search (egats), and consensus voting, among others. None of them is novel in isolation — comparable patterns appear across Semgrep Assistant, Endor Labs, and academic work on agentic verification. The bet is in the composition.

the ablation that almost flipped the wrong switch

Stubborn-14 — the 14 hardest XBOW challenges in the suite — was the smallest signal-rich slice available. A single-attempt ablation produced these numbers:

profileflags on stubborn-14
baseline (no triage)4
moat enabled0

Read at face value, the moat cost four flags. The internal debate was whether to ship a release that turned it off by default. The decision instead was to re-run, this time at limit=50 (the full benchmark slice the moat was designed for), with feature-flagged single-layer isolation runs so the contribution of each layer was independently measurable.

Twenty-one runs. $300 in model spend. Six hours.

what the larger ablation showed

White-box XBOW, limit=50:

profileflagsfindingscost$/flag
none43/5067$14$0.33
no-triage44/5067$17$0.39
moat-only41/5025$27$0.66
moat41/5025$22$0.53

The moat cut findings 63% (67 → 25) at a cost of 2 flags (44 → 41). A pareto tradeoff, not a regression. Whether it is a good one depends on whether the downstream user wants fewer findings or more flags.

Black-box XBOW, limit=25 — where the agent has no source-code access:

profileflagsfindingscost$/flag
none18/2527$14$0.76
no-triage19/2534$10$0.55
moat-only18/2513$11$0.62
moat19/2514$10$0.53

Strict pareto dominance in black-box. More flags, fewer findings, cheaper per flag. The structural reason is plausible: when the agent has no source, triage layers add value by re-checking noisy external-signal findings. When the agent has source, it generates high-confidence exploits and triage layers second-guess a confident agent.

the one broken layer

The single-layer isolation runs surfaced the actual cause of the stubborn-14 zero:

layer added to defaultflags on stubborn-14delta$/flag
default2/14$3.62
+pov4/14+2$2.39
+reachability5/14+3$1.61
+multimodal3/14+1$2.52
+debate5/14+3$2.65
+memories4/14+2$3.35
+egats1/14−1$15.93
+consensus3/14+1$2.67

Six of seven layers help. egats loses a flag and costs roughly 10x the next-worst layer. When the full moat runs, egats prunes exploration branches that other layers would have used. The interaction is multiplicatively destructive on hard challenges.

This is the full explanation of the original “moat catastrophically regresses” finding. It was not the moat. It was one layer of the moat — and the layer was disabled in the default profile the same afternoon. The code stays in the tree, gated off. reachability is the standout in the other direction: +3 flags at $1.61 per flag, the best cost-per-flag of any layer in isolation.

the correction of the correction

A re-run of the full white-box matrix against the commit that disabled egats produced:

profilebatch 1batch 2delta
none43/5044/50+1
no-triage44/5043/50−1
moat-only41/5042/50+1
moat41/5042/50+1

Removing egats: +1 flag, −25% cost on moat profiles. All four profiles within a 1-flag band. The gap is noise.

A second batch 1 finding had claimed that “stable features” (early-stop, script templates, progress handoff) caused npm-bench FPR to climb from 0.11 to 0.19 on 27 safe packages. The batch 2 default got 0.11, matching batch 1’s none. The 0.19 was a 2-package swing on a 27-package sample. The conclusion: 27 safe packages is not enough for single-run FPR conclusions. A finding that does not replicate 12 hours later on the same code is noise, not signal. That sentence is now the lab’s internal rule for any ablation finding on a small slice.

what replicated, what did not

Replicated across both batches:

  • egats is the one broken layer. Removing it improved the moat by +1 flag at −25% cost.
  • The moat cuts findings ~60% (67 → 25, 72 → 27).
  • 100% recall on npm-bench across every profile, both batches.
  • Black-box moat strictly dominates the baseline (37/50 at limit=50).

Did not replicate:

  • “The moat costs 2 flags on white-box.” After egats, the gap is 0–2 flags. Noise at this sample size.
  • “Stable features cause npm-bench FPR.” Batch 2 default got 0.11, matching batch 1 none.

The honest framing is that no single static triage policy wins on all three slices simultaneously. no-triage wins white-box by raw flags. moat wins black-box in strict pareto. none wins npm-bench on FPR. This is the direct motivation for learned per-finding routing — a classifier that picks which layers to run based on the finding’s features, not a flag the operator sets once. The per-layer telemetry generated during this ablation is the training data. The v2 dataset has 1514 rows. The architecture draws on VulnBERT (hybrid handcrafted features + neural embeddings, 91.4% recall at 5.9% FPR on linux kernel commits), starting with XGBoost on the 45-feature vector as the simplest first cut.

why multiple benchmarks exist at all

Most AI pentesting tools benchmark against a single suite — usually XBOW. That tells the buyer how the tool performs on traditional web vulnerabilities. It says nothing about everything else the same tool claims to do.

pwnkit operates across five domains: web pentesting, AI/LLM application security, npm supply-chain auditing, LLM safety boundary probing, and network pentesting. A single benchmark cannot cover that. The lab maintains five suites — one per domain — with the explicit goal of catching cases where a method that wins on one domain regresses on another.

web pentesting — XBOW (104 challenges)

The standard reference benchmark. The shell-first architecture and the moat ablation discussed above both ground out here. Latest CI runs: white-box 36/50 (72%), black-box 28/41 (68%), with 55 unique flags across all runs of the tested subset.

pnpm --filter @pwnkit/benchmark xbow --agentic

AI/LLM security (10 challenges)

A custom suite covering prompt injection, jailbreaks, system prompt extraction, encoding bypasses, SSRF via MCP tools, and multi-turn escalation. Each challenge hides a FLAG{...} extractable only by exploiting the underlying vulnerability. Baseline mode (no API key, deterministic checks) catches 3/10. Agentic mode catches 10/10 with zero false positives. The gap between baseline and agentic is the AI-specific attack surface — regex cannot match a jailbreak and a template cannot script multi-turn escalation.

AutoPenBench (33 network pentest tasks)

AutoPenBench covers real network pentesting and CVE exploitation — service enumeration, vulnerability scanning, exploit development. The published bar is 21% (the original paper’s best automated agent). pwnkit’s runner hooks into the shell-first pipeline directly; the agent gets bash on a network of vulnerable targets. The runner builds; full scoring requires a linux/amd64 CI host because of Docker multi-container topology, which is where the run lives.

HarmBench (510 LLM-safety behaviors)

HarmBench inverts the question: instead of testing whether pwnkit can break into an LLM, it tests whether pwnkit can elicit harmful behavior from an LLM under safety constraints. Attack Success Rate (ASR) is the metric. A lightweight harness reuses pwnkit’s sendPrompt() function — no separate infrastructure.

npm-bench (81 packages)

There was no existing npm security benchmark when this work started. The suite was built from scratch: 27 known-malicious packages (install scripts that exfiltrate env vars, obfuscated backdoors, typosquats), packages with real CVEs (prototype pollution, ReDoS, path traversal), and safe packages a scanner should not flag. Precision, recall, and F1 are the metrics. The npm supply chain is one of the highest-impact attack surfaces in software, and standardized measurement was missing. Current F1: 0.973 with 100% recall, FPR 0.11.

the playwright addition

A persistent gap on the XBOW suite was zero XSS challenges cracked. Every XSS challenge requires a browser runtime — curl cannot trigger DOM-based XSS. pwnkit gained a browser tool powered by Playwright. The agent now opens pages in a headless Chromium instance, interacts with the DOM, injects scripts, and observes the results. It sits alongside bash; the agent picks whichever fits. Curl for API and header work, browser for anything that needs JavaScript execution. The same primitive supports real-world XSS testing.

overnight cross-benchmark, while the ablation was running

Five other benchmark suites completed during the same window:

benchmarkscorenotes
picoctf8/10 (80%)client-side, robots.txt, login bypass, sqli
hackbench3/5 (60%)subset of the 16-challenge suite
argus2/5 (40%)multi-step APT scenarios
portswigger0/10likely expired lab session (under investigation)
bountybench0/3different attack domain (patch/detect/exploit)

portswigger at 0/10 with zero findings on basic SQLi and reflected XSS is suspicious in a “the session expired” way rather than a “the agent cannot do XSS” way. Tracking issue #120.

what shipped in 48 hours

After the ablation cycle resolved:

  1. Per-finding layer telemetry. Every finding logs which triage layer touched it, the verdict, the cost, the duration. Commit 6f1a889.
  2. npm-bench feature-flag support mirroring xbow-bench. Enables the same ablation matrix on the npm domain.
  3. triage-dataset-v2.jsonl. 1514 labeled rows from 32 results files, 163 with per-layer verdicts.
  4. egats disabled in all moat profiles. Commit aadcf32.
  5. Stable-feature isolation tokens (no-script-templates, no-handoff, no-early-stop).
  6. fp-reduction-moat.md rewritten with measured numbers; seven related docs updated to match.
  7. Dynamic routing design doc for the learned per-finding router.

the methodology lesson

The marketing claim attached to the moat was wrong in its first form, closer to right after the ablation, and most accurate after the replication run that disabled egats. The first data almost forced the team to turn off a triage pipeline that does its job — because one of its eleven layers was poisoning the rest, and the smallest signal-rich slice (n=14) could not resolve which layer was responsible.

Three rules the lab now treats as load-bearing:

  1. Stubborn-slice evaluations diagnose failure modes. They do not measure whether to ship.
  2. A finding that does not replicate 12 hours later on the same code is noise, not signal.
  3. Per-layer telemetry is the cheapest insurance against “the moat catastrophically regresses” framings. If the pipeline is composed of N independent components, the failure mode of one component cannot be allowed to mask the contribution of the other N-1.

The point of running five benchmarks across five domains is that no single suite would have produced the same answer. The black-box result on XBOW, the recall preservation on npm-bench, and the FPR noise floor on a small safe-package sample are three independent observations. Without all three, the right tuning is not visible.