Product & Strategy ps-3 20 min

Product Metrics for AI Behaviour

Learning Objectives

define the three categories of AI product metrics
identify which metric category is missing from your current AI feature
design a metric set covering quality, trust, and efficiency for one feature
evaluate what action to take when a metric falls outside its target range

Core Concepts

The Three Categories of AI Product Metrics

AI product metrics fall into three categories. Each one answers a different question about how the feature is performing.

Quality metrics answer: is the AI output good?

Quality measures the accuracy, usefulness, and completeness of what the AI produces. For generative features, this typically involves tracking what users do with the output: whether they accept it, edit it, or discard it. X-company's PRD drafting feature shows exactly this pattern. Of all AI-generated drafts, 68% are used with minor edits, 20% require major edits, and 12% are discarded entirely. Each bucket tells a different story. Minor edits suggest the output is directionally correct but imprecise. Major edits suggest the AI is missing important context. Discards suggest the output is not usable at all.

Quality metrics require user behaviour as a proxy when direct accuracy measurement is not possible. In most product contexts, you cannot automatically know whether a generated document is "good." You can observe what users do with it.

Trust metrics answer: are users confident enough in the AI to rely on it?

Trust is distinct from quality. A feature can produce high-quality output that users do not trust, and vice versa. X-company's customer support triage feature illustrates this. Adoption has grown week over week: users are engaging with the feature and acting on its routing suggestions. But opt-out rates spike sharply whenever the AI's confidence score drops below 0.7. Users are watching the signal, consciously or not, and withdrawing when it falls. The feature works. Users do not always believe it.

Trust metrics capture this relationship. They include adoption trends over time (not just a snapshot), explicit trust signals like thumbs up or down ratings, confidence score correlation with action rates, and opt-out or override frequency. A trust metric that shows growing adoption alongside a spike in overrides when confidence drops is telling you something important about the threshold at which users stop believing the system.

Efficiency metrics answer: is the AI saving real time or cost, without creating new problems?

X-company's onboarding documentation feature reduced documentation time by 60%, a strong efficiency signal. But three support escalations were traced directly to incorrect instructions in AI-generated documentation. The time saving is real. So is the cost introduced by errors. An efficiency metric that only captures the upside is incomplete.

Efficiency metrics include time-to-completion comparisons (before and after AI assistance), error or rework rates, cost per output, and downstream impact: what happens after the AI output is used. That last category is where most teams have gaps.

When a Metric Falls Outside Its Target Range

Each category has characteristic failure modes and appropriate responses.

Quality falls: the discard rate climbs, or major-edit volume rises. The first question is whether the AI is missing context (a prompt or retrieval problem), whether the output format is wrong (a framing problem), or whether the underlying model behaviour has shifted (a regression problem). Investigate the discard cohort specifically: what are those users trying to do that the output did not support?

Trust falls: opt-out rates rise, adoption plateaus, or explicit ratings trend negative. The question is whether users have encountered real failures (a quality problem expressing itself as a trust signal) or whether the interface is not communicating confidence correctly (a presentation problem). Do not optimize trust metrics by hiding low-confidence outputs: that erodes trust faster.

Efficiency gains are offset by downstream costs: time is saved but errors, escalations, or rework increase. This is the most dangerous failure mode because it is invisible if you only measure the AI output stage. Audit the downstream path. If AI-generated onboarding documentation causes support escalations, the efficiency metric must include escalation rate as a counter-indicator, not just documentation time.

Key Points

Quality, trust, and efficiency are three distinct measurement categories, each answering a different question about AI feature performance
Quality metrics use user behaviour (edits, discards, acceptance) as a proxy for output accuracy
Trust metrics track adoption trends and override behaviour, not just a satisfaction score at a point in time
Efficiency metrics must capture downstream impact, not only the time saved at the point of AI output
A metric outside its target range is a diagnostic signal: the response depends on which category is failing and why

Actionable Takeaways

Audit your current AI feature against all three categories. If you have metrics in only one or two, name the gap explicitly and assign someone to close it this sprint.
Add a discard or override rate to any generative AI feature you own. If you cannot observe what users do with AI output, you cannot measure quality.
For trust metrics, look at trends and conditional behaviour: not just average adoption, but how adoption responds when confidence signals are low or when errors have recently occurred.
Define downstream checkpoints for every efficiency metric. If the AI saves time in step 3 but creates rework in step 5, your metric must span both steps.
Set target ranges with explicit response plans before you ship. "If discard rate exceeds 20%, we review the top 10 discard sessions within 48 hours" is a measurement practice. "We'll look at it if something seems wrong" is not.

Practical Examples

X-company: PRD Drafting Feature (Quality Metrics)

The PRD drafting assistant generates first-draft product requirement documents from a structured brief. The product team tracks output disposition in three buckets:

Disposition	Rate	Interpretation
Used with minor edits	68%	Output is directionally correct; precision needs improvement
Used with major edits	20%	Output is missing key context or misaligning on scope
Discarded	12%	Output is not usable; prompt or retrieval likely at fault

The 12% discard rate became the team's primary quality signal. They reviewed 30 discard sessions and found a consistent pattern: the AI was generating generic PRD structures when the brief contained highly domain-specific inputs (law firm workflows, billing codes, matter management terminology). The fix was a domain context layer in the prompt, not a model change.

Without the discard rate, they would have seen 88% "usage" and called it a success.

X-company: Customer Support Triage (Trust Metrics)

The support triage feature routes incoming tickets to the correct queue and suggests priority level using an AI model. The trust metric that proved most diagnostic was the confidence-gated opt-out rate: the percentage of agents who manually overrode the AI routing when the model's internal confidence score was below 0.7.

When confidence was above 0.7, override rate was 8%. When it dropped below 0.7, override rate climbed to 41%. This told the team two things: agents were implicitly calibrated to the confidence signal even without formal training on it, and the feature was trusted when the model was confident but not when it was uncertain.

The response was not to hide low-confidence scores. It was to surface them explicitly as a visual indicator so agents could make informed decisions. Override rate at low confidence stayed high (which is appropriate: agents should override when the model is uncertain), but overall adoption improved because agents trusted that the system was being honest about its own limitations.

X-company: Onboarding Documentation (Efficiency Metrics)

The onboarding documentation generator produces client-facing setup guides from a structured template. Initial measurement showed a 60% reduction in documentation time per onboarding, which the team reported as a strong win.

Three months later, three enterprise clients escalated support tickets tied to incorrect instructions in AI-generated guides. Investigation revealed the AI was hallucinating specific configuration steps for integrations it had not been trained on. Each escalation required a senior engineer to spend four to six hours on remediation.

The team restructured the efficiency metric to include a downstream counter-indicator: support escalations traced to onboarding documentation errors per month. The revised target: documentation time reduction above 50% with fewer than one escalation per quarter attributable to documentation errors.

The feature was still net positive. But the metric now reflected the full picture.

Implementation Workflow

Use this workflow to design a metric set for one AI feature your team owns or is currently building. Work through each step in order.

Name the feature and its AI output. Write one sentence describing what the AI generates or decides. Example: "The feature generates a first-draft project status report from meeting notes and task data."
Define your quality metric. Identify how users interact with the AI output. Do they accept it, edit it, or discard it? If the feature does not produce user-visible output, identify the closest behavioural proxy. Write down: what you will measure, how you will capture it, and what your target range is. Example: "Discard rate below 15%; major-edit rate below 25%."
Define your trust metric. Identify whether the feature has a confidence signal (explicit or implicit). Map adoption over time, not as a single snapshot. Identify the conditions under which users override or opt out. Write down: what adoption trend you expect over the first 90 days, and what override rate at low confidence you consider acceptable.
Define your efficiency metric with a downstream checkpoint. Measure time or cost at the point of AI output. Then identify the next downstream stage where errors or rework would appear. Add that stage as a counter-indicator. Write down: the primary efficiency measure, the downstream checkpoint, and the target for each.
Set response plans for each metric. For each target range, write one sentence describing the action you will take if the metric falls outside it. This does not need to be a full investigation plan: it needs to be specific enough that whoever is on rotation can act without a meeting.
Check for gaps. Look at your current instrumentation. Which of the three categories is already measured? Which is missing entirely? Identify the one missing metric that would most change how you evaluate the feature's success and add it to your next sprint.
Review in four weeks. Bring the metrics, the actuals, and any out-of-range signals to a 30-minute review. The question is not "is the number good?" It is "what is the metric telling us, and what are we going to do about it?"