Adversarial IaC Benchmark¶

Red Team vs Blue Team security evaluation for Infrastructure-as-Code.

What This Is For¶

This benchmark is for LLM evaluators, multi-agent researchers, and AWS practitioners who want to measure how well models reason about security in Infrastructure-as-Code—not just memorize patterns.

Key differentiator

This measures genuine security reasoning, not pattern memorization.

How It Works¶

flowchart LR
    S[Scenario] --> R[Red Team]
    R --> V[Validator]
    V --> B[Blue Team]
    B --> J[Judge]
    J --> S2[Scores]

Scenario — "Create an S3 bucket for healthcare PHI data"
Red Team — Generates code with hidden vulnerabilities
Validator — Trivy/Checkov corroborate Red Team claims; unconfirmed entries are excluded from scoring (phantom concordance mitigation)
Blue Team — Analyzes code to find them; precision verification filter removes unsubstantiated findings
Judge — Scores precision, recall, F1, evasion rate using cross-provider consensus for novel vulnerabilities

Get Started¶

adversarial-iac play

The interactive wizard guides you through scenario selection, model choice, and explains results.

Next Steps¶

Quick Start — Install and run your first game
CLI Reference — All commands and options with examples
Experiments — Batch runs with YAML config