Adversary Emulation: Testing Defenses Against Real Attack Paths
The Picus Blue Report 2025 found that security tools detect only 24% of attack techniques on average. CrowdStrike reported that the average breakout time — the window between initial access and lateral movement — collapsed to 48 minutes in 2025, with the fastest eCrime breakout taking just 51 seconds. These two statistics define the challenge: defenders have a shrinking response window and most of their detections do not fire. Adversary emulation is the discipline that closes both gaps by running known adversary behaviors against production defenses and measuring what actually triggers an alert.
This is part 3 in a series on threat-informed defense. Start with part 1.
Emulation, Red Teaming, and Penetration Testing: Three Different Objectives
The terms are frequently conflated. They differ in scope, methodology, and what they measure.
| Dimension | Adversary Emulation | Red Teaming | Penetration Testing |
|---|---|---|---|
| Objective | Validate detection and response against specific TTPs | Test overall security posture and resilience | Identify exploitable vulnerabilities |
| Scope | Constrained to a specific threat group's techniques | Unrestricted — any path to objective | Defined scope (application, network segment, API) |
| Threat model | Specific named group (e.g., APT29, FIN7) | Hypothetical advanced adversary | Unauthenticated or low-privilege attacker |
| Output | Detection coverage map per technique, MTTD/MTTR measurements | Attack narrative, objective achieved (yes/no) | Vulnerability list with severity ratings |
| Team model | Purple — offensive and defensive collaborate | Red operates independently; blue is unaware | Pentester operates independently |
| Frequency | Continuous — repeat per threat group per quarter | Annual or biannual | Annual or compliance-driven |
Adversary emulation produces the most actionable output for detection engineering because it maps results directly to ATT&CK techniques. A red team exercise might reveal that the team reached the crown jewels, but it does not systematically tell you which of your 200 detection rules failed and why. Penetration testing identifies vulnerabilities but does not validate whether your SIEM rules trigger when those vulnerabilities are exploited. Emulation fills this gap by providing technique-level pass/fail data against production telemetry.
The Emulation Planning Process
Effective adversary emulation follows a structured planning cycle. Skipping any step produces results that are either too narrow (testing a single technique in isolation) or too unreliable (lab conditions that do not match production).
1. Threat Group Selection
Select adversary groups based on sector relevance, geographic exposure, and observed targeting. Organizations in the financial sector should prioritize FIN7 and Lazarus Group; government and diplomatic entities should prioritize APT29; US critical infrastructure operators should prioritize Volt Typhoon. The MITRE ATT&CK Groups catalog provides the technique composition for over 150 tracked groups.
2. Technique Extraction
For each selected group, extract the full technique set from the ATT&CK Groups page. Cross-reference with threat intelligence reports from the past 12 months to verify the group still uses those techniques and to identify any newly observed behaviors not yet reflected in ATT&CK. The CTID Adversary Emulation Library on GitHub provides pre-built, peer-reviewed emulation plans for groups including APT29, APT28, and FIN7, complete with infrastructure requirements, step-by-step procedures, and expected observable indicators.
3. Scenario Composition
Individual techniques are composed into attack paths — sequences that chain from initial access through execution, persistence, privilege escalation, lateral movement, and objective completion. The APT29 emulation plan, for example, includes two scenarios: Scenario 1 covers spearphishing attachment delivery with PowerShell execution and scheduled task persistence; Scenario 2 covers valid account initial access with OAuth application registration persistence in cloud environments. Each scenario is a full kill chain, not an isolated technique test.
4. Execution and Observation
Execute the plan step-by-step while the blue team monitors production detection systems. Record which techniques produced alerts, which produced logs but no alert, and which produced neither. This is the detection gap analysis — the core output of adversary emulation.
5. Feedback Loop
For each technique that produced no alert, identify whether the gap is a telemetry gap (the log source does not exist), a detection gap (the log exists but no rule triggers on it), or a tuning gap (the rule exists but was suppressed or threshold-filtered). Fixes are deployed, and the technique is re-tested in the next cycle. This loop is the engine of detection engineering maturity.
Emulation Frameworks and Tools
Several platforms support adversary emulation at different levels of automation and fidelity.
MITRE CALDERA
CALDERA is MITRE's open-source adversary emulation platform. It operates as a server-agent architecture: a central server manages operation planning, and lightweight agents deployed on target hosts execute techniques. CALDERA supports both autonomous emulation (the server chains techniques into full attack paths using a planning algorithm) and manual red team engagement (an operator selects techniques interactively). Version 5.3.0, released in April 2025, includes a plugin architecture with Atomic Red Team integration, sandbox escape abilities, and stockpile plugins for extended technique coverage.
Strengths: Full attack path automation, ATT&CK-native technique mapping, decision engine that adapts technique chains based on results, agent-based execution that models adversary behavior more realistically than script execution.
Limitations: Technique mappings lag behind the current ATT&CK version — sub-technique coverage is incomplete, and deprecated techniques are not always removed. The Atomic Red Team plugin has documented issues with false failure reports and incomplete technique synchronization. CALDERA requires significant infrastructure setup (server, agents, network configuration) and is better suited for dedicated security operations teams than ad-hoc testing.
Atomic Red Team
Atomic Red Team is a library of individual technique tests — "atomics" — mapped to ATT&CK technique IDs. Each atomic is a standalone script or command that executes a single technique behavior. Invoke-AtomicRedTeam, Red Canary's execution wrapper, provides a PowerShell-based runner that executes atomics with configurable parameters and cleanup logic.
Strengths: Lowest barrier to entry — a security analyst with PowerShell access can execute a technique test in minutes. Extensive technique coverage (2,000+ atomics). Community-maintained and regularly updated. Ideal for rapid detection validation during detection engineering sprints.
Limitations: Atomics test individual techniques in isolation; they do not compose techniques into attack paths. Execution fidelity is limited — many atomics use test signals (e.g., writing a specific registry key) rather than the actual adversary tool or procedure. An atomic for T1055.001 (DLL Injection) may use a benign test DLL, while a real adversary uses a custom-compiled DLL with anti-analysis features. This fidelity gap means a passing atomic test does not guarantee production detection.
Commercial Platforms
SCYTHE provides a commercial adversary emulation platform that bridges the gap between atomics and full CALDERA-style automation. It allows operators to build threat group-specific campaigns from technique modules, execute them against targets, and generate detection gap reports. Picus Security combines emulation with continuous validation — agents execute techniques against production defenses and generate coverage scores. AttackIQ (now part of ReliaQuest) provides a similar continuous validation platform with pre-built adversary scenarios.
Emerging: AI-Augmented Emulation
Research in 2025 has produced frameworks that use large language models to automate emulation planning. Aurora (published at USENIX Security 2025) uses symbolic planning paired with LLMs to generate causality-preserving attack chains from threat intelligence reports. The system parses threat reports, extracts technique sequences, validates them against environmental constraints using PDDL (Planning Domain Definition Language), and produces executable emulation plans. SynthAPT takes a different approach: it uses AI to express malware behaviors in structured JSON, enabling rapid scenario generation from malware analysis reports. These systems are promising but remain in research prototype stage — they are not yet production-grade tools for enterprise emulation programs.
Measuring Emulation Results
Execution without measurement is an expensive exercise. The key metrics that make adversary emulation output actionable are:
| Metric | Definition | Target (2025 Benchmark) |
|---|---|---|
| Detection Coverage Rate | Percentage of executed techniques that produced at least one alert | Industry average: 24% (Picus Blue Report 2025); target: 60%+ for top-10 techniques |
| Mean Time to Detect (MTTD) | Time between technique execution and alert generation | Under 30 minutes (below the average breakout time) |
| Mean Time to Respond (MTTR) | Time between alert and containment action | Under 60 minutes (before lateral movement completes) |
| Telemetry Gap Rate | Percentage of techniques where the required log source is missing | Below 15% for priority techniques |
| False Positive Rate | Percentage of alerts triggered by legitimate admin activity during the test window | Average: 26% of alerts are false positives (Picus 2025) |
The 24% detection rate from the Picus Blue Report is the anchor statistic. It means that for every four techniques an adversary executes, only one produces an alert on average. Adversary emulation directly addresses this: it identifies which 76% of techniques are invisible and whether the cause is missing telemetry, missing detection logic, or rule misconfiguration.
MTTD and MTTR must be benchmarked against breakout time. CrowdStrike's 2025 Global Threat Report recorded average breakout time at 48 minutes for eCrime operations, with the fastest at 51 seconds. By 2026, CrowdStrike reported the average had fallen further to 29 minutes — a 65% speed increase year-over-year, driven in part by AI-augmented attack tooling. If MTTD exceeds breakout time, defenders are responding to an attack that has already moved past the initially compromised host. The combination of a 24% detection rate and a 48-minute breakout window means most organizations lack both the visibility and the speed to contain real attacks.
MITRE ATT&CK Evaluations: The Reference Standard
MITRE's ATT&CK Evaluations apply adversary emulation to test commercial security products. The methodology follows the same planning cycle described above: MITRE selects one or more threat groups, builds detailed emulation plans from open-source intelligence, and executes them against participating vendor solutions in a controlled environment. Results are published as technique-level detection and response scores, enabling product comparison.
The 2024 evaluation round tested 21 vendors against ransomware behaviors and North Korean state-sponsored techniques across Windows, Linux, and macOS. The 2025 round expanded to cloud security and counter-espionage scenarios, reflecting the shift toward cloud-native and identity-based attacks. Key findings across rounds include: detection coverage varies significantly between technique categories (initial access techniques are detected at higher rates than persistence and defense evasion); analytic detections (behavioral rules) outperform signature-based detections against novel adversary implementations; and managed detection and response (MDR) services that combine product output with human analyst judgment consistently outperform standalone product deployments.
Exceptions and Limits
Adversary emulation has structural limitations that determine when its results are reliable and when they are not.
Lab vs. Production Fidelity Gap
Emulation executes known techniques in controlled conditions. Real adversaries operate in production environments with unique configurations, legacy systems, and defensive tooling that may alter observable behavior. An Atomic Red Team test for T1059.001 (PowerShell) executed from a clean terminal produces one set of telemetry; the same technique executed from a compromised Outlook child process with AMSI bypass produces a completely different set. Emulation plans that do not account for the execution context produce optimistic detection rates.
Technique-Level vs. Behavior-Level Fidelity
Most emulation tools test at the technique level — they execute a behavior consistent with the ATT&CK technique description. Real adversaries implement techniques with group-specific variations: custom tooling, specific command-line syntax, non-standard parent processes, and anti-forensic modifications. A detection that fires on the Atomic Red Team test for T1055.001 may not fire when Lazarus Group uses a custom DLL injection variant with modified PE headers. The ATT&CK Evaluations address this by running multiple implementations per technique (the "various" modifier), but internal emulation programs rarely have the resources to test multiple implementations.
Scope Creep in Purple Team Exercises
Purple team exercises — where red and blue teams collaborate during emulation — are effective for rapid detection tuning but carry a scope creep risk. Without strict technique boundaries, exercises expand from "validate detection for T1078" to "test our entire identity security stack," consuming weeks and producing inconclusive results. The discipline of maintaining a scoped technique list per exercise is the single most important operational practice.
Living-Off-the-Land Blind Spots
Groups like Volt Typhoon use exclusively legitimate administrative tools — PowerShell, WMI, net.exe — making their behavior indistinguishable from normal admin activity at the technique level. Emulation tools that execute these techniques generate the same telemetry that thousands of legitimate admin sessions produce daily. Detection for these groups requires behavioral baselining and anomaly detection, not technique-level alerting. Current emulation frameworks do not model baseline deviation well; they test whether a technique produces a signature match, not whether it represents a deviation from modeled behavior.
Cloud and Identity Limitations
CALDERA's core technique coverage is Windows-centric. Cloud-native techniques (OAuth application persistence, IAM role assumption, storage bucket enumeration) are underrepresented in all open-source emulation tools. The 2025 ATT&CK Evaluations round began testing cloud scenarios, but the tooling to emulate cloud-native attack paths at scale remains immature. Organizations with significant cloud attack surface should supplement CALDERA and Atomic Red Team with cloud-specific adversarial testing (e.g., CloudSploit, Pacu) and identity-centric attack path validation.
Honest Assessment
| Dimension | Atomic Red Team (Isolated Tests) | CALDERA (Automated Attack Paths) | Full Emulation (CTID Plans) |
|---|---|---|---|
| Setup effort | Minutes — PowerShell execution | Hours — server, agents, network config | Days — infrastructure, tooling, coordination |
| Technique fidelity | Low — test signals, not real adversary tools | Medium — uses CALDERA's built-in abilities | High — mirrors group-specific tooling and procedures |
| Attack path coverage | None — isolated technique tests | High — automated chains via decision engine | Highest — full kill chain scenarios |
| Results reliability | Low — optimistic (test conditions ≠ production) | Medium — depends on agent deployment fidelity | High — realistic execution context |
| Repeatability | High — scripted, parameterized | High — server-driven, reproducible | Medium — manual steps may vary |
| Best use case | Quick detection validation during build sprints | Continuous automated assessment | Quarterly deep-dive against priority threat groups |
Actionable Takeaways
- Choose one threat group and build a full emulation cycle before expanding. Select the group most relevant to your sector using the ATT&CK Groups catalog. Download the CTID Adversary Emulation Plan for that group. Execute each scenario step-by-step while monitoring production detection. The output — a per-technique detection gap list — is more valuable than executing atomics for all 200+ techniques in isolation.
- Measure MTTD against breakout time, not against zero. A 15-minute mean time to detect is not acceptable when breakout time is 48 minutes and dropping — by 2026, CrowdStrike reports it has fallen to 29 minutes. Benchmark your detection speed against the adversary's lateral movement speed. If MTTD exceeds breakout time, invest in faster telemetry piping and alert reduction before adding new detection rules.
- Classify every gap as telemetry, detection, or tuning before fixing anything. A technique that produces no alert because Sysmon Event ID 10 is not forwarded to the SIEM is a telemetry gap — adding more Sigma rules will not solve it. A technique that produces logs but no alert is a detection gap — write the rule. A technique that should alert but the rule was disabled due to false positives is a tuning gap — refine the rule logic. Fixing the wrong gap type wastes resources and produces no improvement in coverage.
- Use Atomic Red Team for velocity, CALDERA for depth, and full CTID plans for fidelity. Each tool has a role. Atomic Red Team is the unit test — fast, isolated, good for verifying that a new Sigma rule fires on the intended technique. CALDERA is the integration test — it chains techniques into paths and reveals whether detections work when the adversary does not stop between steps. Full CTID emulation plans are the end-to-end test — they verify that the entire observation-to-alert-to-response pipeline functions against realistic adversary behavior.
- Plan for LotL and cloud blind spots. If Volt Typhoon or a similar living-off-the-land group is in your threat model, accept that technique-level emulation will not validate your defenses. Invest in behavioral baselining tools that model normal admin activity patterns and alert on deviations. For cloud environments, supplement CALDERA and Atomic Red Team with cloud-specific attack tooling (Pacu for AWS, MicroBurst for Azure) and validate identity-centric detection (impossible travel, anomalous service principal activity) separately.
Upcoming installments in this series will cover detection engineering for technique coverage, purple teaming operations, and AI-augmented threat profiling.