Blog / / 9 min read

Orchestration, Not Frontier: What the IronCurtain Post Means for 0sec

Niels Provos shipped a vulnerability-discovery framework that replicates Mythos-class findings on commercial models — and one autonomous CVE on an open-weight model. It is the same bet 0sec is built on. Here is what we already do, what we need to borrow, and the four gaps we are closing.

On April 29, Niels Provos published Finding Zero-Days with Any Model. His headline thesis, in his own words: “vulnerability discovery is an orchestration problem, not a frontier-model problem.”

He replicates the 1998 OpenBSD TCP SACK bug — the one he committed himself, 27 years ago — using Sonnet 4.6 and Opus 4.6 driven by his open-source IronCurtain framework. He then points the same workflow at a foundational library, swaps the model out for Z.AI’s GLM 5.1 over a LiteLLM gateway, and finds an integer-truncation flaw that had been sitting on a memory-allocation path for 18 years. The orchestration layer does not change. Only the model does.

This is the same bet 0sec is built on. Our engine’s record of 103/104 (99.0%) on the public web vulnerability benchmark suite is on Sonnet 4.6 — not on a frontier-restricted preview model — on the same provider mix anyone with an OpenRouter key can use. The IronCurtain post validates the macro thesis: orchestration scaffolding extracts capability that vendors gate behind embargoed releases. The floor for what commodity models can do is now low enough to clear it.

The questions worth asking: where does our engine already map to this design, where is it behind, and what needs to be borrowed.

Where our engine already maps to his design

Hypothesize statically, validate by execution. Provos calls it the central FSM discipline. We call it blind verification: every finding is independently re-exploited before it appears in a report. Same idea, different name. The disclosure pipeline enforces this at the advisory layer too — no PoC, no advisory.

Tiered harness construction. Provos describes his harness ladder as: single-function isolation harness → multi-component harness → full end-to-end VM validation. The engine already runs tier 3 for kernel crashes — the QEMU validator compiles reproducers inside the guest and watches for KASAN/UBSAN oopses. That path was built for kernel work. The same plumbing now needs to extend into source-code review of foundational C libraries (more on that below).

Model-agnostic routing. Provos reroutes Anthropic identifiers to Z.AI through a LiteLLM gateway and ships GLM 5.1 end-to-end without changing IronCurtain. The engine already routes through OpenRouter and Azure OpenAI. Running GLM 5.1 against the web vulnerability benchmark is a config flag away — and it should be done, because “orchestration > frontier-model” is a thesis that has to keep being proven.

Where our engine is genuinely behind

Four gaps.

1. Append-only execution journal as source of truth

IronCurtain’s central architectural choice is that the Orchestrator agent does not read source code. It routes off an append-only execution journal. Every specialist agent gets a fresh context window and rehydrates the slice of journal it needs. The journal is the source of truth; the model’s working memory is disposable.

The engine’s loop carries investigation state in the conversation window. When context fills up, a summarizer kicks in and lossy-compresses. This caps the size of investigation we can run, makes recovery from mid-run failures lossy, and prevents clean parallelization.

This is the same architectural shape that drove BoxPwnr (0ca’s framework) to a 97.1% score on the web vulnerability benchmark: durable journal, fresh contexts per dispatch, strategic router that never reads the artifact directly. Two independent groups have now validated the same answer. It is the technique.

The journal/orchestrator refactor is a 3–5 day effort with a benchmark validation cycle, gated behind a feature flag, and it must clear the current benchmark tally on a 30-run pilot before flipping default. Moving slowly here is the point.

2. YAML-defined FSM workflows

IronCurtain ships its workflows as plain YAML FSM definitions. One Orchestrator interprets the FSM. New workflows are contributed as YAML, not as TypeScript PRs against the loop driver.

The engine’s investigation flow is hard-coded in TypeScript across the playbooks, prompts, and loop drivers. A workflow cannot be forked without forking the codebase. That is acceptable while the workflow is one thing (web pentest). It stops scaling the moment you want vuln-discovery, kernel-crash-triage, package-audit, and code-review-c-cpp to live as four independent artifacts side by side.

This depends on the journal/orchestrator work — that Orchestrator is the FSM interpreter; the journal is its observation surface.

3. C/C++ source-code-review workflow

This is the gap that actually moves the needle on coverage. Provos’s post is explicitly about a media framework and an integer-truncation flaw on a memory allocation path in a foundational library. Both are C/C++ memory-safety primitives. Neither is in scope for the engine’s web/LLM scanning or its npm/pypi/cargo package audits.

The engine’s code-review path exists, but it is positioned and prompt-tuned for application-layer code review — type-safety bugs, auth gaps, business-logic flaws. It is not positioned as “give me a CVE in this widely-deployed C library.” The gap is positioning and workflow, not model capability.

This delivers a new C-library review route, a tier-1 libFuzzer harness scaffolder, a tier-2 multi-component linker helper, and tier-3 validation that reuses the kernel-crash QEMU plumbing. One synthetic reference target so the test suite can prove the workflow finds a bug autonomously. Ships as a YAML-defined FSM workflow once the FSM interpreter lands.

4. Per-investigation cost reporting

Provos publishes hard $/investigation numbers on his marketing surface: ~$30 on Sonnet 4.6, ~$150 on Opus 4.6, ~$30-equivalent on GLM 5.1 at higher token volume. That number is now the metric defenders compare on. “How many libraries can I afford to audit per year” is a budget conversation, not a benchmark conversation.

The engine leads with flag counts. Defensible numbers — 103/104 on the public benchmark suite, evidence-backed under NDA — but flag counts do not give a buyer the math they need to size a contract. Tokens and per-run cost are tracked internally; a single headline $/flag is not surfaced.

This one is small and independent; it lands first. It centralizes pricing in one source-of-truth file, extends the benchmark consolidator to emit $/run and $/flag per profile, and adds a comparison table.

A note on the responsibility framing

Provos closes the post with an argument worth quoting directly:

“Every defensive tool of the past 25 years (Metasploit, nmap, Burp Suite, AFL) faced the same debate, and the historical answer has been to put the tools in defender hands. On a local model, accountability rests directly with the researcher, as it has for those tools all along.”

The position is the same. The engine’s H1-readiness work — auth-header redaction, scope allowlist on PoC runtime, refusal to render advisories with empty PoCs, a verification footer that only renders on reverify+canary success — is the same bet from the disclosure-pipeline side: defenders ship vulnerability tooling under accountability, not under a permission gradient gated by the vendor.

The H1 Code of Conduct already takes a hard line on AI-generated low-quality submissions — Final Warning on first offense, 12-month ban on second, permanent ban on third. That is not a problem if the tooling produces verified PoCs and refuses to ship advisories without them. It is a problem if the tooling auto-submits static-analysis output. The engine is built for the former. The orchestration layer that finds the bug and the disclosure layer that filters before submission are the same pipeline.

What ships first

Cost telemetry is independent and small. It lands this week.

The C/C++ workflow scaffold ships in two slices: the new code-review route and tier-1 harness scaffolder land this week as a TypeScript playbook. Once the journal/orchestrator and FSM work land, it gets reframed as a YAML FSM workflow.

The journal/orchestrator refactor gets a design doc this week. Implementation lands behind a feature flag with a 30-run pilot benchmark before the default flips. Moving slowly here is the point — this is the load-bearing piece.

The YAML FSM workflows ship on top of the orchestrator once it is in place.

For anyone watching the IronCurtain release and asking whether 0sec is on the right track: yes, and the gap is execution, not architecture. The four gaps above are the punch list.

0sec is the autonomous pentesting engine behind the managed service.