Sprint review last month. The metrics were looking good:
"AI suggestion feature usage rate: 71%. Sprint target: 50%."
Everyone was happy. There was actual celebration.
That same week, a user interview — a dentist. "Yeah, I see the AI recommendation every time," she said. "But I usually just close it. That's kind of my job."
And that's when it clicked. Usage numbers tell you almost nothing about whether your AI feature is actually working.
Traditional product metrics answer specific questions. Is the feature there? Can users find it? Are people clicking on it?
Those questions made sense for a button, a filter, a dropdown. They don't hold up for AI output.
A user can see your AI recommendation every single day. Dismiss it every single day. Never say a word. Your metric goes up. Your dashboard stays green. And you have no idea whether the model is producing anything useful.
Here's the thing. You feel good. But feeling good and working well are very different things.
ML engineers call this the "eval problem." There are whole disciplines built around it — ground truth datasets, precision, recall, benchmarks. Rigorous, systematic ways to know if a model is doing what you think it's doing.
Product managers don't have an equivalent. Or they do, and nobody uses it.
Think about this: You shipped an AI feature four months ago. Since then, there have been three model updates, a prompt change, and a UI revision. Is that feature producing good output right now?
You probably don't know.
A few months back, I was working on a diagnostic suggestion module for a clinical management platform. Doctors were seeing the recommendations — usage rate was high because the panel opened automatically on every patient record. But they almost never acted on it. Action rate was somewhere in the single digits.
Usage rate: looks impressive. Action rate: basically zero.
The gap between those two numbers is where AI value actually lives. Most teams measure the first and forget the second. It's a blind spot that compounds quietly over time — every model update, every sprint, every release widens the gap just a little more without anyone noticing.
So what do you do differently?
Sample your outputs manually. Once a month, pull 50 random AI outputs. Have someone on the product team — or a domain expert — review them. Not a metric. Human judgment. Yes, this is manual. There's no shortcut here. A machine can't evaluate a machine, at least not yet.
Track action rate, not just usage rate. AI recommendation → user action. What percentage of shown recommendations actually get used? If you don't know this number, start measuring it today. Keep it separate from usage rate. When the gap between the two grows, that's your signal.
Capture rejection reasons. Put a small dropdown next to the close button: "Why didn't you use this?" — three options, one click. That data is worth more than twenty user interviews. Because it's captured in real time, in real context, when the mental model is still fresh.
Build a regression baseline. When your model or prompt changes, does the action rate drop? You can only detect this if you have historical data to compare against. Without it, you'll never know when a quiet update silently made things worse. If you don't have this yet, go back to your update timeline and check whether your metrics show any breaks around those dates.
None of this is technically difficult. What's difficult is breaking the habit of treating AI features like everything else.
Because when a feature "looks like it's being used," questioning it creates friction. It's hard to say "but is it actually working?" when everyone's celebrating a 71% usage rate.
You shipped the AI feature. The metric went up. The dashboard is green. Everyone's happy.
Whether it's actually working for your users? That's a different question.
Most teams aren't asking it. And sometimes that's more dangerous than not measuring at all.