An agent ships. The demo looks good. Then in real use, something goes sideways — wrong decision, wrong priority, output that completely misses the point. The default diagnosis is familiar: "The model isn't good enough."
New model gets swapped in. Prompt gets tweaked. Sometimes it gets better. Often the same failure shows up wearing different clothes.
The model wasn't the problem.
Last month I ran a small experiment in my own workflow. I set up an agent to process customer support tickets — weekly summaries, flagging recurring issues, classifying complaints by theme. The first output surprised me. The agent surfaced not the most frequently repeated problems, but the most dramatically written ones. High-emotion language, long tickets, multiple exclamation points.
Was the model wrong? No. I told it to find "important complaints." There was no reason for it to define "important" the way I would.
I had built the context wrong.
A lot of teams are stuck in the same loop right now. Agent behaves badly → prompt gets rewritten → agent behaves slightly better → agent behaves badly again in a different situation. The cycle doesn't end because nobody's going to the root.
The root is this: most product teams' definitions of "success" and "priority" are written to be understood by humans — not in a form an agent can actually use.
"Improve user satisfaction." "Fix critical bugs first." "Focus on high-value customers."
These are strategies. To an agent, they're empty words.
This is where the PM role is actually changing.
Old model: PM coordinates what gets done, engineering figures out how. Now coordination is being automated. What's left is defining what gets done — but the way that definition has to work is also shifting.
You can tell a human "focus on this customer segment" and they'll fill in the gaps over time. For an agent, that sentence isn't even a starting point.
For an agent to make usable decisions, you need to have already answered: Which signals determine priority? When goals conflict, which one wins? What counts as "normal" versus something to escalate? What data is reliable, and what's noise?
That list belongs to the PM. Nobody else is going to write it.
I manage three products serving different healthcare audiences — dentists, general practitioners, dietitians — all on the same codebase. "Find the most requested features" is not a useful instruction to give an agent, because a dentist's feature request and a dietitian's feature request can't be evaluated with the same formula. Segment size, revenue impact, churn risk, technical cost — the weights shift by context.
An agent can't learn that from observing me. If I don't write it down, the agent estimates. And when it estimates, it gravitates toward the loudest signal — the most dramatic complaint, the most frequently repeated request, the longest open ticket.
That's not bad model behavior. That's rational behavior given incomplete context.
I've started calling this "context architecture." (The phrase is a bit academic, but I haven't found a better one.)
The basic idea: teams working with agents need to build a habit of documenting the reasoning behind their decisions — not just "what we decided," but "under what conditions we'd decide differently."
Even a simple living document does the job. But without it, every new agent deployment runs into the same wall from scratch.
Model performance has improved dramatically over the last two years. Teams' ability to document decision logic clearly has not improved at the same rate.
That gap doesn't close on its own. And until it does, agent failures will keep coming — along with the wrong diagnosis that blames the model.
The right question isn't: "Is there a better model?"
It's: "Did we actually tell the agent what it needs to know?"