Batch Experiments
Run multiple games across model combinations and scenarios using YAML config.
Usage
python scripts/run_experiment.py --config experiments/config/E1_model_comparison_v2.yaml --region us-east-1
Minimal Config
name: "My Experiment"
region: "us-east-1"
language: terraform
cloud_provider: aws
difficulties: [easy, medium, hard]
scenarios:
- "Create an S3 bucket for logs"
- "Create a VPC with subnets"
red_settings:
red_team_mode: single
red_strategy: balanced
red_vuln_source: mixed # v2.3: test novel reasoning
novel_ratio: 0.5 # 50% novel vulnerabilities
blue_settings:
blue_team_mode: single
blue_strategy: comprehensive
detection_mode: llm_only
precision_strategy: "precise" # v2.3: verification filter
judge_settings:
use_llm_judge: true
use_trivy: true # v2.3: manifest validation
use_checkov: true
use_cross_provider_judge: true # v2.3: phantom concordance mitigation
settings:
repetitions: 3
delay_between_games: 2
output_dir: "experiments/results/my_exp"
batch_experiments:
enabled: true
model_combinations:
- red: "us.anthropic.claude-3-5-haiku-20241022-v1:0"
blue: "us.anthropic.claude-3-5-haiku-20241022-v1:0"
Full Example (with Backends)
name: "E1-S2: Qwen3.5 Thinking Mode"
region: "us-east-1"
language: terraform
cloud_provider: aws
difficulties: [hard]
scenarios:
- "Create an S3 bucket for healthcare PHI data with HIPAA compliance"
- "Create a VPC with public and private subnets"
red_settings:
red_team_mode: single
red_strategy: balanced
red_vuln_source: mixed # v2.3: test novel reasoning
novel_ratio: 0.5
blue_settings:
blue_team_mode: single
blue_strategy: comprehensive
detection_mode: llm_only
precision_strategy: "precise" # v2.3: verification filter
judge_settings:
use_llm_judge: true
use_trivy: true
use_checkov: true
use_cross_provider_judge: true # v2.3: phantom concordance mitigation
settings:
repetitions: 3
output_dir: "experiments/results/E1S2"
delay_between_games: 2
save_intermediate: true
batch_experiments:
enabled: true
model_combinations:
- name: "qwen35_thinking"
red: "us.anthropic.claude-3-5-sonnet-20241022-v2:0"
blue: "qwen3.5-plus"
blue_backend_type: direct_api
blue_thinking_mode: true
blue_backend_extra:
base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
api_key_env: "DASHSCOPE_API_KEY"
v2.3 Config Fields
| Field |
Type |
Default |
Description |
red_vuln_source |
str |
mixed |
database, novel, or mixed |
novel_ratio |
float |
0.5 |
Fraction of novel vulnerabilities when mixed |
precision_strategy |
str |
precise |
standard or precise (two-pass verification) |
use_cross_provider_judge |
bool |
true |
Require cross-provider consensus for novel vulns |
blue_specialists |
int |
4 |
Number of Blue ensemble specialists (auto-matches Red pipeline) |
Model Combination Fields
| Field |
Type |
Default |
Description |
red |
str |
- |
Red Team model ID |
blue |
str |
- |
Blue Team model ID |
name |
str |
- |
Label for this combo |
blue_backend_type |
str |
bedrock |
bedrock, direct_api, sagemaker |
blue_thinking_mode |
bool |
false |
Enable reasoning mode |
blue_backend_extra |
dict |
{} |
Backend config (base_url, api_key_env, etc.) |
red_backend_type |
str |
bedrock |
Same options as blue |
red_thinking_mode |
bool |
false |
Enable reasoning for Red |
red_backend_extra |
dict |
{} |
Red Team backend config |
Script Options
python scripts/run_experiment.py --config my_config.yaml --region us-east-1 [--output dir] [--delay 2] [--upload-to-s3]