From Prompts to Harnesses: The Three-Stage Evolution of AI Engineering

When Shopify CEO Tobi Lutke described context engineering as "the art of providing all the context for the task to be plausibly solvable by the LLM," he was naming something practitioners had been discovering the hard way: most agent failures are not model failures. They are context failures. By early 2026, the discipline had expanded again. Harness engineering — the design of the entire production scaffolding around an AI agent — emerged as the third stage in a progression that tracks how AI systems move from demos to production.

Stage One: Prompt Engineering — Shaping the Query

Prompt engineering was the first discipline, and for good reason. When GPT-3 launched in 2020, the model was the entire interface. The prompt was the only lever. Practitioners discovered that how you asked mattered as much as what you asked — and a vocabulary of techniques emerged: chain-of-thought reasoning, few-shot examples, role assignment, output format constraints.

The techniques worked, within limits. Prompt engineering optimizes a single interaction: one query, one response. It treats the model as a function you call, and the prompt as the argument you pass. For chatbots answering isolated questions, that model is sufficient. For agents executing multi-step workflows over dozens of turns, it breaks down. The prompt might be excellent, but everything around it — retrieved documents, conversation history, accumulated tool outputs — determines whether the agent stays coherent or drifts into irrelevance after turn ten.

Elasticsearch Labs drew the parallel to web development: just as web design once encompassed everything, then split into UI and UX as the field matured, AI development is splitting into prompt engineering (the query layer) and context engineering (the environment layer). They remain interconnected, but each requires distinct expertise.

Stage Two: Context Engineering — Curating the Information Environment

Context engineering expands the scope from the prompt to everything the model sees before it generates a response. Philipp Schmid at Hugging Face cataloged the components: structured output definitions, available tool schemas, retrieved information via RAG, long-term memory, short-term conversation state, and the user prompt itself. The discipline is not about writing better instructions. It is about building the information architecture the model operates within.

The shift solves a specific class of failures that prompt engineering cannot touch. Three stand out:

Lost in the Middle. Liu et al. (2023) demonstrated that LLMs perform significantly worse on content placed in the middle of long contexts compared to content at the beginning or end — a U-shaped attention curve. A RAG system that concatenates 40 retrieved chunks in retrieval order might bury the most relevant chunk at position 20, exactly where the model's attention is weakest. Re-ranking and position-aware assembly (primacy-recency ordering) address this directly.
Context Bloat. Every token in the context window costs money and consumes attention capacity. Boilerplate headers, redundant content, and low-relevance retrieved chunks do not just waste tokens — they dilute the signal-to-noise ratio for the model's attention mechanism. A 128K context window feels enormous until an agent is 30 steps deep with accumulated tool outputs, error messages, and retry attempts filling the window with stale content.
Memory Amnesia. Across sessions, most LLM systems start from scratch. User preferences established in one conversation are unknown in the next. Decisions made three sessions ago must be re-explained. For agents handling ongoing tasks — code agents on multi-session projects, customer-facing assistants with repeat users — this is a reliability problem, not a convenience problem.

Redis, defining context engineering as "the discipline of systematically selecting, structuring, and delivering the right context for LLM applications," identified four operational components that map to these failure modes: retrieval with re-ranking, position-aware assembly, context compression, and memory architecture. The pattern is consistent — each component exists because naive context assembly breaks in predictable, documented ways.

Context Failure Mode	Root Cause	Engineering Response
Lost in the Middle	U-shaped attention over long contexts	Primacy-recency ordering + re-ranking
Context Bloat	Uncontrolled token accumulation	Active compression + budgeting
Memory Amnesia	Stateless session boundaries	Long-term memory architecture + session continuity

Stage Three: Harness Engineering — Building the Production Scaffold

If prompt engineering shapes the query and context engineering curates the information environment, harness engineering designs the production infrastructure that makes the entire system reliable. The formula is deceptively simple — Agent = Model + Harness — and the harness is everything else: tool schemas, permission models, context lifecycle management, feedback loops, sandboxing, documentation infrastructure, and architectural invariants.

Anthropic's engineering blog introduced the concept while building Managed Agents, their hosted service for long-horizon autonomous tasks. The core observation: harnesses encode assumptions about what the model cannot do on its own. Those assumptions go stale as models improve. Claude Sonnet 4.5 would wrap up tasks prematurely as it sensed its context limit approaching — "context anxiety." The harness added context resets. When Claude Opus 4.5 eliminated that behavior, the resets became dead weight. Stable interfaces — session, harness, sandbox — must be decoupled so harnesses can evolve without breaking the system.

Tian Pan's analysis of SWE-bench results provides the clearest evidence for harness primacy: the same model scores 20–30 percentage points differently depending on the scaffold wrapping it. SWE-bench is not just testing the model. It simultaneously evaluates the harness. Teams treating model choice as the primary reliability variable are measuring the wrong thing.

Guides and Sensors: The Harness Taxonomy

The most useful framework for harness design distinguishes two fundamentally different controls:

Guides (feedforward, pre-action): AGENTS.md files, architecture documentation, bootstrapping scripts. They encode what good looks like and prevent bad outputs proactively by injecting project-specific knowledge that does not live in the model's weights.
Sensors (feedback, post-action): Type checkers, linters, end-to-end test suites, AI code reviewers. They observe what the agent did and create signals for correction, enabling self-correction within a session rather than requiring human intervention at each step.

Sensors split further by type. Computational sensors — type checkers, formatters, structural linters — are deterministic, run in milliseconds, and provide binary pass/fail feedback. Inferential sensors — AI reviewers that assess whether code actually satisfies the intent — are probabilistic, slower, and catch meaning-level errors that structural tools miss entirely. A mature harness uses both.

Harness Control	Direction	Examples	What It Prevents
Guides	Feedforward (before action)	AGENTS.md, architecture docs, bootstrapping scripts	Wrong assumptions, missing context, hallucinated paths
Computational Sensors	Feedback (after action)	Type checkers, linters, formatters	Structural errors, contract violations
Inferential Sensors	Feedback (after action)	AI reviewers, semantic validators, test suites	Intent mismatches, logical errors, correctness gaps

How the Three Stages Connect

The three disciplines as nested layers. Harness engineering subsumes context engineering, which subsumes prompt engineering. Each outer layer addresses failure modes the inner layers cannot.

These three disciplines are not competing approaches. They are layers in a stack, each building on the one below:

Prompt engineering remains essential for single-turn query optimization. It is the right tool when you need to coax a specific output format from a model, when you are building templates for consistent response structures, or when you are iterating on instruction clarity for a well-defined task.
Context engineering becomes necessary when the system spans multiple turns, retrieves external information, or maintains state. It is the right discipline when RAG is involved, when agents need to stay coherent across long sessions, or when you are managing memory architectures across session boundaries.
Harness engineering is required when the agent operates autonomously in production: making tool calls, modifying files, interacting with external systems. It is the right discipline when the agent's actions have real consequences and must be constrained by feedback loops and structural invariants.

A common error is treating these as either/or choices. In practice, effective AI systems use all three. The prompt defines the task. The context provides the environment for reasoning. The harness provides the safety boundaries and feedback mechanisms that keep the system reliable over time.

Discipline	Scope	Failure Mode It Solves	When You Need It
Prompt Engineering	Single query ↔ response	Ambiguous instructions, wrong format	Chatbots, single-turn tools, template design
Context Engineering	Information environment across turns	Lost context, bloat, amnesia	RAG systems, multi-turn agents, memory-dependent tasks
Harness Engineering	Full production infrastructure	Unreliable autonomous actions, drift, self-evaluation bias	Autonomous agents, production code generation, long-running workflows

Exceptions and Limits

The three-stage model has boundaries. Not every AI system needs all three layers. A classification endpoint with a fixed prompt and no retrieval operates entirely at the prompt layer. A Retrieval-Augmented Generation pipeline that answers questions from documents but does not take autonomous actions needs prompt and context engineering, but not a full harness. The discipline progression tracks the system's autonomy level, not its complexity.

Harness engineering carries its own risks. An overdesigned harness can constrain the model too tightly, preventing it from finding creative solutions that fall outside the guide and sensor boundaries. LLM-generated AGENTS.md files, despite seeming useful, actively hurt agent performance compared to human-curated versions — approximately a four-percentage-point degradation on agent benchmarks. The file that loads into every session and shapes every subsequent decision should be maintained by humans, not auto-generated.

Separating the generator from the evaluator adds architectural complexity. Anthropic's pattern — planner, generator, evaluator as separate agents — is powerful but requires a negotiation phase before implementation begins, agreeing on what "done" looks like before code is written. For small tasks, that overhead is not justified. Teams should assess whether the failure cost of an unconstrained agent justifies the harness engineering investment.

Honest Assessment

Dimension	Strength	Limitation
Prompt Engineering	Fast iteration, low barrier to entry	Solves only single-turn problems; no state management
Context Engineering	Addresses multi-turn coherence and RAG reliability	Does not constrain autonomous actions; requires harness for safety
Harness Engineering	Enables production-grade autonomous agents	High design overhead; overconstraint risk; stale assumptions as models improve
Three-Stage Model	Clear scope boundaries; progressive adoption path	Terminology is still settling; vendor-driven definitions may fragment meaning

Actionable Takeaways

Audit your current system against the three layers. If you are running a RAG pipeline and treating it as a prompt engineering problem, you are missing the context engineering layer — and likely seeing "model failures" that are actually context assembly failures. Map each failure to the discipline that owns it.
Build guides before sensors. Feedforward controls — architecture docs in the repository, AGENTS.md files, bootstrap scripts — cost less to implement than feedback controls and prevent failure modes earlier. Invest in making the repository legible to a zero-context agent before building complex evaluation pipelines.
Enforce invariants rather than describing preferences. An architectural boundary violation that fails CI is self-enforcing; the same rule in a style guide is a suggestion the agent will forget. Encode structural constraints as tests, not documentation.
Separate generation from evaluation. Any system where the generating agent judges its own output has confidence without accuracy. Models consistently overestimate the quality of their own work, and the bias compounds as sessions get longer. A separate evaluator with a defined rubric makes assessment explicit and inspectable.
Design harnesses for replacement. Harness assumptions go stale as models improve. Anthropic's pattern of decoupling session, harness, and sandbox means you can swap harness components without breaking the system. Build interfaces that stay stable even as implementations change.
Start at the layer your system actually needs. Not every system requires a full harness. Match the discipline to the autonomy level: prompt engineering for stateless tools, context engineering for multi-turn systems, harness engineering for autonomous agents with real-world side effects.

From Prompts to Harnesses: The Three-Stage Evolution of AI Engineering

Stage One: Prompt Engineering — Shaping the Query

Stage Two: Context Engineering — Curating the Information Environment

Stage Three: Harness Engineering — Building the Production Scaffold

Guides and Sensors: The Harness Taxonomy

How the Three Stages Connect

Exceptions and Limits

Honest Assessment

Actionable Takeaways

Topics

More

Follow

Stage One: Prompt Engineering — Shaping the Query

Stage Two: Context Engineering — Curating the Information Environment

Stage Three: Harness Engineering — Building the Production Scaffold

Guides and Sensors: The Harness Taxonomy

How the Three Stages Connect

Exceptions and Limits

Honest Assessment

Actionable Takeaways

Related Articles

How Model Context Protocol Works

Orchestrating Agents Without Chaos

Vibe Coding at Scale: The Production Gap No One Measured

Topics

More

Follow