Last Tuesday, a message came in on Slack: "GPT-5.5 is out — you seen it?"
I had. It was already trending everywhere. Benchmarks were wild. Everyone was excited.
I was too. But for a slightly different reason. Because announcements like this aren't celebrations for me — they're decision moments.
Upgrade or not?
Sounds simple. New model is better, so switch. Right?
Not quite.
We run AI features across three different products. Each one has system prompts tuned to the current model, temperature values tested carefully, output formats that users have normalized into their daily workflow. Switching to a new model means revalidating all of that.
"Behavior change" sounds technical. What it actually means: a feature that always worked a certain way now works differently. Without warning. Without announcement.
We lived this in our health SaaS products. A module handling prescription suggestions — we updated the model version. Small change, nothing major on paper. A week later, support tickets started climbing. Users said "the suggestions feel different." The new model was technically more capable. But it had changed an output format that users had built their routine around.
Three business days to roll back to the previous behavior.
Nobody talks about this decision
At AI conferences we talk about: agent architecture, RAG, fine-tuning, LLM safety.
We don't talk about: how do you actually decide when to upgrade a model?
Here's the thing. This is a PM decision now. And there's no framework for it.
Developers want to upgrade — new capabilities are tempting, the benchmarks are shiny. Customer support worries about the friction of change. I'm standing in the middle, asking: now? later? never?
Last month I was reviewing a SaaS integration where the model version wasn't locked. The API was auto-updating behind the scenes. The customer had no idea. Until the behavior changed.
That's a vulnerability. Treat it like one.
Model lifecycle management
Software development has version management. Dependency management. Deprecation notices.
AI models? Still maturing. But we can't wait for the ecosystem to catch up.
Here's what we've started doing:
Always lock the model version. OpenAI and Anthropic both support specific version strings — use gpt-4-turbo-2024-04-09, not an alias. Aliases update silently. You don't notice until users do.
Run behavioral tests in staging. Before any model hits production, run your full prompt set against it. Not just functional testing — behavioral testing. Do the outputs feel the same? Same length? Same format? Do edge cases behave as expected?
Treat the upgrade like a feature. Put it in a sprint. Define acceptance criteria. Plan the rollout. "Let's quickly update it" doesn't work here. I learned that the hard way after 17 open support tickets traced back to a model update made three weeks earlier.
All of this sounds obvious. It didn't feel obvious in the moment.
The real question
GPT-5.5 might be excellent. DeepSeek V4 might top every benchmark. Claude Opus might handle anything you throw at it.
But none of that answers this: in my product, for this specific use case, does this model's behavior match what my users expect?
Only you can answer that. And asking that question matters as much as choosing the model.
When a new model drops, everyone shares benchmarks. Nobody asks "and now what do we actually do?"
I'm asking.