Citation & Research¶

If you use the Adversarial IaC Benchmark in your research, please cite our work.

Key Contributions¶

This benchmark introduces several novel contributions to LLM security evaluation:

1. Adversarial Evaluation Framework¶

Unlike static benchmarks where models are tested on fixed datasets, our framework creates a dynamic adversarial environment:

Red Team LLM actively tries to evade detection
Blue Team LLM must detect previously unseen vulnerabilities
Judge uses optimal bipartite matching for fair scoring

This measures true security reasoning, not pattern memorization.

2. IaC-Specific Benchmark¶

First comprehensive benchmark focused on Infrastructure-as-Code security:

114 scenarios across 18 domains
Industry compliance: HIPAA, PCI-DSS, FedRAMP, SOC 2
Cloud-native patterns: AWS, multi-service architectures
Terraform and CloudFormation support

3. Multi-Agent Evaluation¶

Evaluates emerging LLM architectures for security:

Blue Team Ensemble: Specialist agents + consensus voting
Red Team Pipeline: Multi-stage attack generation
Adversarial Debate: Finding verification through argumentation

4. Standardized Metrics¶

Reproducible evaluation with:

Precision, Recall, F1 for detection quality
Evasion Rate for attack effectiveness
Hungarian algorithm for optimal vulnerability-to-finding matching

BibTeX¶

@software{adversarial_iac_bench_2026,
  author = {Terry, Brian},
  title = {Adversarial IaC Benchmark: An Adversarial Evaluation Framework for LLM Security in Infrastructure-as-Code},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/brianterry/Adversarial-IaC-Evaluation},
  note = {Open-source adversarial benchmark for evaluating LLM security capabilities in Infrastructure-as-Code}
}

APA¶

Terry, B. (2026). Adversarial IaC Benchmark: An Adversarial Evaluation Framework for LLM Security in Infrastructure-as-Code [Computer software]. GitHub. https://github.com/brianterry/Adversarial-IaC-Evaluation

Research Questions¶

The game enables investigation of:

Research Area	Example Questions
Model Comparison	How does Claude compare to GPT-4 at detecting IaC vulnerabilities?
Adversarial Robustness	Which models are most resistant to evasion techniques?
Multi-Agent Systems	Does ensemble consensus improve detection accuracy?
Compliance Understanding	Do LLMs understand HIPAA differently than PCI-DSS?
Attack Strategies	Which Red Team strategies achieve highest evasion rates?
Defense Strategies	Does iterative analysis improve recall?
Tool Comparison	How do LLMs compare to Trivy/Checkov for IaC scanning?

What to Include When Citing¶

When using the benchmark in research, please specify:

Benchmark version: Include commit hash or release version
Models tested: Which LLMs for Red and Blue teams
Scenarios used: Domains, count, and difficulty levels
Evaluation modes: Single, pipeline, ensemble, or debate
Strategies: Attack and defense strategies used
Metrics reported: Precision, recall, F1, evasion rate

Example Usage in Papers¶

Methods Section¶

We evaluated LLM security capabilities using the Adversarial IaC Game v1.0 (Terry, 2026). Our experiments ran 200 games with Claude 3.5 Sonnet (Red Team) vs Nova Pro (Blue Team) across healthcare and financial domains at medium difficulty (3-4 vulnerabilities per game). We used Blue Team Ensemble mode with debate consensus. The Judge employed the Hungarian algorithm for optimal vulnerability-to-finding matching.

Results Section¶

Table 1 shows detection performance across models. Claude 3.5 Sonnet achieved the highest recall (78%) but lower precision (71%) compared to Nova Pro (recall: 65%, precision: 89%). The Red Team achieved an average evasion rate of 28% across all experiments.

Data Outputs¶

Each evaluation produces structured data suitable for analysis:

Output	Format	Contents
`red_team_manifest.json`	JSON	Ground truth vulnerabilities with types, locations
`blue_team_findings.json`	JSON	Detected issues with confidence, evidence
`game_result.json`	JSON	Scores, matches, timing, configuration
`code/*.tf`	HCL/YAML	Generated infrastructure code
`game.log`	Text	Full execution trace

Statistical Analysis¶

Results can be aggregated for statistical testing:

import pandas as pd
from pathlib import Path
import json

# Load all game results
results = []
for game_dir in Path("output/games").iterdir():
    with open(game_dir / "game_result.json") as f:
        results.append(json.load(f))

df = pd.DataFrame(results)

# Compare models
df.groupby("blue_model")["recall"].agg(["mean", "std"])

This benchmark builds on and differs from:

SecBench — Static security question answering (we use dynamic generation)
CyberSecEval — Multi-domain security (we focus on IaC specifically)
VulnGen — Vulnerability injection (we add adversarial detection evaluation)
SVEN — Static vulnerability datasets (we generate new vulnerabilities each game)

Our key differentiation is the adversarial game format where both attack and defense are AI-powered.