Batch Experiments¶

Run multiple games across model combinations and scenarios using YAML config.

Usage¶

python scripts/run_experiment.py --config experiments/config/E1_model_comparison_v2.yaml --region us-east-1

Minimal Config¶

name: "My Experiment"
region: "us-east-1"
language: terraform
cloud_provider: aws

difficulties: [easy, medium, hard]
scenarios:
  - "Create an S3 bucket for logs"
  - "Create a VPC with subnets"

red_settings:
  red_team_mode: single
  red_strategy: balanced
  red_vuln_source: mixed        # v2.3: test novel reasoning
  novel_ratio: 0.5               # 50% novel vulnerabilities

blue_settings:
  blue_team_mode: single
  blue_strategy: comprehensive
  detection_mode: llm_only
  precision_strategy: "precise"  # v2.3: verification filter

judge_settings:
  use_llm_judge: true
  use_trivy: true                # v2.3: manifest validation
  use_checkov: true
  use_cross_provider_judge: true # v2.3: phantom concordance mitigation

settings:
  repetitions: 3
  delay_between_games: 2
  output_dir: "experiments/results/my_exp"

batch_experiments:
  enabled: true
  model_combinations:
    - red: "us.anthropic.claude-3-5-haiku-20241022-v1:0"
      blue: "us.anthropic.claude-3-5-haiku-20241022-v1:0"

Full Example (with Backends)¶

name: "E1-S2: Qwen3.5 Thinking Mode"
region: "us-east-1"
language: terraform
cloud_provider: aws

difficulties: [hard]
scenarios:
  - "Create an S3 bucket for healthcare PHI data with HIPAA compliance"
  - "Create a VPC with public and private subnets"

red_settings:
  red_team_mode: single
  red_strategy: balanced
  red_vuln_source: mixed          # v2.3: test novel reasoning
  novel_ratio: 0.5

blue_settings:
  blue_team_mode: single
  blue_strategy: comprehensive
  detection_mode: llm_only
  precision_strategy: "precise"   # v2.3: verification filter

judge_settings:
  use_llm_judge: true
  use_trivy: true
  use_checkov: true
  use_cross_provider_judge: true  # v2.3: phantom concordance mitigation

settings:
  repetitions: 3
  output_dir: "experiments/results/E1S2"
  delay_between_games: 2
  save_intermediate: true

batch_experiments:
  enabled: true
  model_combinations:
    - name: "qwen35_thinking"
      red: "us.anthropic.claude-3-5-sonnet-20241022-v2:0"
      blue: "qwen3.5-plus"
      blue_backend_type: direct_api
      blue_thinking_mode: true
      blue_backend_extra:
        base_url: "https://dashscope.aliyuncs.com/compatible-mode/v1"
        api_key_env: "DASHSCOPE_API_KEY"

v2.3 Config Fields¶

Field	Type	Default	Description
`red_vuln_source`	str	mixed	`database`, `novel`, or `mixed`
`novel_ratio`	float	0.5	Fraction of novel vulnerabilities when `mixed`
`precision_strategy`	str	precise	`standard` or `precise` (two-pass verification)
`use_cross_provider_judge`	bool	true	Require cross-provider consensus for novel vulns
`blue_specialists`	int	4	Number of Blue ensemble specialists (auto-matches Red pipeline)

Model Combination Fields¶

Field	Type	Default	Description
`red`	str	-	Red Team model ID
`blue`	str	-	Blue Team model ID
`name`	str	-	Label for this combo
`blue_backend_type`	str	bedrock	bedrock, direct_api, sagemaker
`blue_thinking_mode`	bool	false	Enable reasoning mode
`blue_backend_extra`	dict	{}	Backend config (base_url, api_key_env, etc.)
`red_backend_type`	str	bedrock	Same options as blue
`red_thinking_mode`	bool	false	Enable reasoning for Red
`red_backend_extra`	dict	{}	Red Team backend config

Script Options¶

python scripts/run_experiment.py --config my_config.yaml --region us-east-1 [--output dir] [--delay 2] [--upload-to-s3]