Reverse ML: AI Writes Detection Rules It Never Runs
In 2025, the National Vulnerability Database published over 48,000 new CVEs. Security teams cannot hand-write detection rules fast enough. Amazon's RuleForge system generates production-ready rules 336% faster than humans and cuts false positives by 67% — not by running AI at runtime, but by using AI to write deterministic rules that run without it. The pattern is called Reverse ML, and it is changing how detection engineering works.
The Problem: 48,000 CVEs, Not Enough Humans
Every new CVE needs a detection rule. Someone reads the advisory, understands the exploit pattern, writes a Sigma or YARA rule, tests it against sample data, tunes false positives, and deploys it. The National Vulnerability Database published over 48,000 new CVEs in 2025. Even a skilled detection engineer takes 30 to 60 minutes per rule. The math does not work.
Meanwhile, the CrowdStrike 2026 Global Threat Report documented an 89% year-over-year surge in AI-enabled adversary operations. Average eCrime breakout time fell to 29 minutes — the fastest ever observed was 27 seconds. Defenders need rules in minutes, not hours or days.
The obvious answer is to put AI at runtime: let an LLM read alerts, triage them, decide what matters. That approach has been tried. It adds latency (2 to 5 minutes per analysis), loses determinism (same input, different outputs on different runs), and produces alerts nobody can audit because the reasoning is buried in 40 tool calls. Detection engineering needed a different approach.
What Reverse ML Actually Means
Skyhook, a codebase detection platform, coined the term in February 2026. The idea is straightforward: instead of training a model on data and running it in production, you use AI agents to explore real-world examples and synthesize explicit, deterministic rules. AI operates at development time. The rules execute without AI at runtime.
Skyhook needed to detect programming languages, frameworks, and configurations from codebases. Calling an LLM at request time would take 2 to 5 minutes per analysis, cost $0.10 to $0.50 per repo, and produce non-deterministic results. Hand-writing rules across 100+ repos and dozens of frameworks would take months. Reverse ML let them build comprehensive rules in days: AI labels the testbed in 2 to 5 minutes per repo instead of 30, investigates failures in minutes instead of an hour, and produces the same readable, deterministic rules a human would write.
The pattern has three phases:
- AI at authoring time: Agents explore examples, generate rules, and test coverage against known cases.
- Deterministic execution: Deployed rules run in milliseconds without any AI dependency — no LLM calls, no latency, no non-determinism.
- AI as fallback: When rules are uncertain, optionally invoke AI with structured context from the rule match, giving the model a head start rather than a cold start.
Guy Podjarny, founder of Snyk and Tessl, gave the pattern a broader name in September 2025: the Rule Maker Pattern. In a widely cited essay, he documented the same architecture emerging across security (Detections.ai, Snyk), code modernization (Moderne, CodeMod), data (AI-generated SQL queries), and infrastructure (AI-generated Terraform). The principle is identical in every domain: probabilistic generation driving deterministic automation.
RuleForge: The Production Proof
Amazon's RuleForge is the most fully realized production deployment of this pattern. Published as a peer-reviewed paper in April 2026 (arXiv:2604.01977) and detailed on Amazon Science, it is an agentic-AI system that generates web vulnerability detection rules at global scale.
RuleForge decomposes rule creation into four stages mirroring human expert workflows:
- Ingestion agent: Parses CVE descriptions, extracts exploit patterns, identifies affected components.
- Generation agent: Creates detection rules targeting the specific vulnerability pattern, using a 5×5 generation strategy — producing 5 candidate rules across 5 independent attempts, then selecting the best.
- Evaluation agent: Tests each rule against positive and negative samples, measuring sensitivity and specificity.
- Validation agent: A separate LLM acts as judge, using domain-specific prompts and negative phrasing to catch over-broad rules.
The results:
| Metric | Traditional Manual | RuleForge | Improvement |
|---|---|---|---|
| Rule creation speed | Baseline | 336% faster | 4.4× throughput |
| False positive rate | Baseline | 67% reduction | 3× fewer false positives |
| AUROC | — | 0.75 | Production-viable |
| Coverage | Limited by team size | Scales with CVE volume | Addressable |
The 67% false positive reduction comes from the judge model, which uses negative phrasing — asking "does this rule match something that is NOT the vulnerability?" — to catch over-broad detections that human reviewers often miss because they only test positive cases.
RuleForge is not a prototype. It runs in production at Amazon scale, processing real CVEs and deploying rules across Amazon's infrastructure. C. J. Moses, Amazon's VP and Distinguished Engineer, publicly attributed the system's success to its multi-agent architecture and human-in-the-loop design.
The Tool Landscape
RuleForge is the flagship, but the pattern is spreading across detection engineering:
| Tool | What It Does | Rule Type | Status |
|---|---|---|---|
| Skyhook Reverse ML | AI generates detection rules from codebase examples | Deterministic pattern rules | Production (Feb 2026) |
| Panther AI Detection Builder | Conversational AI creates/modifies detection rules in Panther Console | Python, YARA, Sigma | Open Beta (v1.118+) |
| SOC Prime Uncoder AI | Real-time threat analysis and cross-SIEM rule translation | Sigma, Roota, SIEM-specific | Production |
| FALCON (arXiv) | Autonomous CTI mining and IDS rule generation with self-reflection | Snort, YARA | Research (Aug 2025) |
| SigmAIQ (AttackIQ) | LangChain + pySigma + GPT-4 framework for Sigma creation and translation | Sigma | Open Source |
| yara-gen (Deconvolute) | Generates YARA rules for prompt injection and jailbreak detection | YARA | Open Source (Jan 2026) |
| RuleChef | LLM-powered synthesis of regex and spaCy pattern rules from examples | Regex, spaCy patterns | Open Source (Nov 2025) |
| Elastic AI-assisted rules | AI-assisted rule creation in Kibana's Detection Engine | Elastic query rules | Beta (PR #247674) |
The tools vary in maturity and scope, but they share the same architecture: AI generates the rule. The rule runs deterministically. The AI steps out.
Why Deterministic Rules Beat Runtime AI
Running an LLM at detection time sounds powerful. In practice, it has four problems that Reverse ML avoids:
Latency. An LLM analysis takes 2 to 5 minutes. A deterministic rule matches in under 10 milliseconds. When eCrime breakout time is 29 minutes, detection needs to be faster than the attacker — not slower than a language model.
Determinism. The same input should produce the same output. LLMs do not guarantee this. A detection rule that fires inconsistently erodes analyst trust and makes alert triage unpredictable. Sigma and YARA rules either match or they do not. There is no ambiguity.
Auditability. When a rule fires, the reasoning is transparent: this pattern matched this condition. When an LLM fires, the reasoning is buried in tool-call chains that nobody reads and fewer can reproduce. Compliance frameworks like SOC 2 and ISO 27001 require explainable detection logic.
Cost. LLM inference at scale is expensive. RuleForge handles 48,000+ CVEs per year. Running GPT-4 on every detection event across an enterprise would cost orders of magnitude more than rule execution. The economics only work if AI is a development tool, not a runtime dependency.
The Validation Problem
If AI writes the rules, how do you trust them? This is where most skeptics focus, and it is a legitimate concern. RuleForge's answer is the most concrete: a separate judge model evaluates each generated rule using domain-specific prompts and negative phrasing. The judge does not just verify that rules catch the vulnerability; it actively tests whether rules match things that are NOT the vulnerability.
This two-model approach — one generates, one evaluates — mirrors how senior detection engineers work. A junior engineer writes a rule. A senior engineer reviews it for over-broad matches, logical errors, and edge cases. The judge model in RuleForge plays the senior engineer role, and the 67% false positive reduction it delivers validates the approach.
FALCON, the academic framework from arXiv (2508.18684), takes validation further with self-reflection: the LLM generates an IDS rule, then critiques its own output, identifies weaknesses, and iteratively refines the rule across multiple rounds. This is the "5×5 strategy" generalized — multiple attempts, evaluated and refined, with the best candidate selected.
The broader principle: generated rules should be treated like any other code. They go through review, testing, and staged deployment. AI makes the first draft. Humans (or a second AI) validate it. The difference is that the first draft takes minutes instead of hours, and the validation catches what manual review misses.
Exceptions: When Runtime AI Is Necessary
Not every detection problem fits deterministic rules. Three patterns resist the Reverse ML approach:
Behavioral anomaly detection. Some threats — lateral movement patterns, insider data exfiltration — only manifest as statistical deviations over time. Rules cannot express "this user's access pattern is unusual compared to their baseline." Darktrace and similar tools use per-environment behavioral models that must run at detection time. These are not rules; they are statistical profiles that require runtime inference.
Novel attack patterns with no prior examples. Zero-day exploits by definition have no labeled data for AI to learn from. Rules are pattern matchers. If there is no pattern to match, there is no rule to write. Runtime behavioral analysis or sandbox detonation fills this gap.
Multi-step correlation across time and telemetry sources. Some detection requires correlating dozens of events across hours of log data — the kind of investigation a senior threat hunter performs. This is where autonomous threat hunting platforms (Dropzone AI, IBM ATOM) have a legitimate role, operating at runtime because the investigation itself is the detection.
The pattern that works: use Reverse ML for everything that can be expressed as a rule. Fall back to runtime AI only for the cases that resist rule-based expression. This is the "AI as fallback" phase in Skyhook's model — structured context from rule matches feeds the LLM, giving it a head start instead of a cold start.
Honest Assessment
| Dimension | Reverse ML (AI-writes-rules) | Runtime AI (LLM-at-detection-time) |
|---|---|---|
| Speed | Under 10ms per event | 2–5 minutes per analysis |
| Determinism | Exact: same input, same output | Variable: same input, different runs possible |
| Auditability | Full: rule logic is readable | Low: reasoning buried in tool calls |
| Cost at scale | Negligible (rule execution) | Significant (LLM inference per event) |
| Novel attack coverage | Limited to known patterns | Can reason about unseen patterns |
| Behavioral detection | Cannot express statistical baselines | Can model deviation from baselines |
| Compliance fit | Excellent (SOC 2, ISO 27001) | Poor (explainability requirements) |
| Rule creation speed | Minutes per rule (AI-assisted) | N/A (no rules to create) |
The table makes the tradeoff clear. Reverse ML wins on speed, determinism, auditability, cost, and compliance. Runtime AI wins on novel attacks and behavioral patterns. Neither replaces the other. The architecture that works is the one Skyhook described: deterministic rules handle 90%+ of cases instantly, and AI handles the remainder with rich context from the rule match.
Actionable Takeaways
Start with your rule backlog. Every detection team has a list of CVEs or threat reports that need rules but have not been written because the queue is too long. That is your Reverse ML starting point. Point an AI tool at the CVE descriptions and generate draft rules for each. The AI first draft is not production-ready — but it turns hours of writing into minutes of review.
Separate generation and validation. RuleForge's most impactful insight is not the generation agent; it is the judge agent. Whatever tool you use to generate rules, build a separate validation step. The judge should test what the rule does not match, not just what it does. Negative-phrasing prompts ("does this match something that is NOT the vulnerability?") catch over-broad rules that positive-only testing misses.
Treat generated rules like code. Put them in version control. Review them in pull requests. Deploy them through CI/CD pipelines. The Elastic Detection-as-Code beta and Google SecOps Terraform integrations both support this workflow. AI-generated rules are not less trustworthy than human-written rules — they are different, and they deserve the same testing and review process.
Measure coverage before quantity. The 48,000 CVEs are not equally important. Run your AI rule generator against your top-priority CVEs first, validate against your environment's telemetry, and measure true positive rate before deploying. RuleForge's 0.75 AUROC is production-viable but not perfect. Perfect is the enemy of deployed.
Reserve runtime AI for what rules cannot express. Behavioral anomaly detection, zero-day investigation, and multi-step correlation are legitimate runtime-AI use cases. But if a detection can be expressed as a deterministic rule, it should be. Every rule you generate is one fewer event requiring expensive, slow, non-deterministic LLM analysis at runtime.