What to Measure and What Not To
Learning Objectives
- identify the four measurement traps that produce misleading AI metrics
- distinguish activity metrics from outcome metrics in AI adoption
- evaluate your current AI metrics against a baseline-first standard
- select the right measurement approach for one AI initiative
Core Concepts
The Baseline-First Standard
A metric without a baseline is an observation, not a measurement. Knowing that your team spends four hours per week on a task today tells you nothing unless you knew how long it took before AI was involved.
The baseline-first standard is simple: before deploying any AI tool, record the current state of the process you expect it to improve. Time, volume, error rate, cycle time, whatever is relevant. Then measure the same variable after. The comparison is the metric.
X-company's product team measured PRD drafting cycle time for six months before introducing AI assistance. The baseline: 8 days average from kick-off to sign-off. Post-AI, measured over four months: 2.1 days. That is a credible outcome metric. The before and after are comparable, the time windows are long enough to smooth variation, and the variable (cycle time) maps directly to a business outcome (faster delivery).
Without the six-month baseline, X-company could only have reported "PRDs now take 2.1 days." That is just a number.
Activity Metrics vs. Outcome Metrics
Activity metrics count what the AI or the team is doing. Outcome metrics measure whether that activity produced something valuable.
| Activity metric | Outcome metric |
|---|---|
| Prompts per day | Cycle time reduction |
| Seat utilization rate | Output volume per engineer |
| AI features enabled | Defect rate in AI-assisted code |
| Hours of AI usage | Time to first draft |
| Tokens consumed | Stakeholder approval rate |
Activity metrics are not inherently useless. They can be useful early, to confirm that adoption is actually happening, before you have enough data to see outcomes. The mistake is treating them as success indicators after adoption is underway.
X-company initially reported seat count and prompts per day to its leadership team. Within six weeks, engineers were running more prompts per task than necessary, not because it helped, but because the metric rewarded volume. The team noticed the behavior shift. They retired both metrics and replaced them with output quality scores on shipped code and PRD cycle time. Behavior normalized within a month.
The Four Measurement Traps
Trap 1: Measuring adoption instead of impact. Tracking how many people use the tool, not what changes when they do. Seat utilization looks like progress. It tells you nothing about whether the tool is producing better work.
Trap 2: Measuring speed without quality. AI makes many tasks faster. Faster is not always better. A PRD drafted in two days that requires three revision cycles is not an improvement over an eight-day PRD that ships. Speed metrics must be paired with quality signals: revision rate, stakeholder approval, downstream defect rate.
Trap 3: Measuring at the wrong level. Team-level averages hide what is actually happening. If three engineers account for 80% of the AI-assisted output, the team average looks healthy while most of the team is not yet benefiting. Segment by role, seniority, or workflow stage to find where value is concentrating and where it is not.
Trap 4: Measuring without a counterfactual. Claiming that AI reduced time-to-feature without knowing what time-to-feature was before is not measurement. It is a story. The counterfactual (what would have happened without AI) is approximated by the baseline. Without it, you cannot attribute improvement to AI rather than to headcount, tooling, or a change in project complexity.
What Good AI Metrics Look Like
A good AI metric has four properties:
- Tied to a baseline: compared to a pre-AI state measured over a meaningful period
- Outcome-oriented: measures the result of work, not the activity of the tool
- Attributable: the change can be reasonably linked to AI adoption, not a confounding factor
- Actionable: tells you what to do next, not just what happened
Key Points
- A metric without a baseline is an observation, not a measurement
- Activity metrics (seat count, prompts per day) measure adoption, not impact
- Speed gains must be paired with quality signals or they mislead
- The four traps: measuring adoption instead of impact, speed without quality, wrong level of aggregation, and no counterfactual
- A good AI metric is baseline-tied, outcome-oriented, attributable, and actionable
Actionable Takeaways
Audit your current AI metrics this week. List every metric you are currently reporting on AI adoption. For each one, ask: is this an activity metric or an outcome metric? Does it have a baseline? If the answer to either question is no, flag it for replacement.
Record a baseline before your next AI deployment. Choose one process you expect AI to improve. Measure the current state now, before the tool goes live. Cycle time, defect rate, revision count: pick the variable that maps to the outcome you care about.
Retire at least one activity metric. Seat count, prompts per day, and hours of AI usage are the most common culprits. Pick one your team currently reports. Identify the outcome metric it was meant to proxy. Replace it.
Segment your data before drawing conclusions. Before reporting an average, split it by role or workflow stage. If value is concentrating in one segment, that is a more useful finding than the average.
Pair every speed metric with a quality signal. For any claim that AI made something faster, identify the corresponding quality indicator. If you cannot, the speed claim is incomplete.
Practical Examples
X-company: From Activity Metrics to Outcome Metrics
When X-company's VP of Product presented the first AI adoption report to the executive team, the slides showed three numbers: 18 seats activated, 4,200 prompts run in the first month, and 94% of the product team using the tool at least once per week.
The CEO asked one question: "Is the product team shipping better work faster?"
No one could answer it. The metrics did not measure that.
The team spent the next two weeks identifying three outcome metrics to replace the activity metrics:
- PRD cycle time: from kick-off to stakeholder sign-off (baseline: 8 days over 6 months pre-AI)
- First-draft acceptance rate: percentage of PRDs accepted without a major revision in the first review
- Downstream ticket rework rate: percentage of engineering tickets that required significant scope change after PRD hand-off
After four months of measurement, the results were specific: cycle time dropped to 2.1 days, first-draft acceptance improved from 61% to 79%, and rework rate fell from 23% to 14%. Those numbers answered the CEO's question.
Engineering Team: The Speed-Without-Quality Trap
X-company's engineering team ran into the second trap during the same period. Their initial metric was lines of AI-assisted code per engineer per sprint. The number went up consistently. So did the defect rate in the following release cycle.
The team had optimized for code volume. AI was helping engineers write more code, faster. The code was shipping with more bugs.
They replaced the metric with two paired signals: PR cycle time (speed) and post-merge defect rate per PR (quality). Within two sprints, the pattern changed. Engineers were spending more time reviewing AI-generated code before submitting PRs. Volume went down slightly. Defects dropped by 31% over six weeks.
The metric change did not fix the problem on its own. It made the problem visible, which allowed the team to fix it.
Founders: The Wrong Level of Aggregation
A founder running a 12-person product team reported that AI adoption was "strong across the team" based on average prompt usage. In a quarterly review, a closer look revealed that two senior engineers were generating 70% of the AI-assisted output. The remaining ten team members had experimented briefly and returned to previous workflows.
The average obscured an adoption gap. Segmenting by seniority and role would have surfaced it in the first month. The corrective action (pairing senior AI users with junior teammates for structured co-working sessions) was simple once the problem was visible. It was invisible as long as the team reported averages.
Implementation Workflow
Follow these steps to evaluate and reconfigure your current AI measurement approach for one initiative.
List your current AI metrics. Write down every metric your team or organization is currently tracking related to AI adoption. Include anything reported to leadership, used in team reviews, or surfaced in dashboards.
Classify each metric. For each metric on your list, mark it as either an activity metric (measures what the AI or team is doing) or an outcome metric (measures the result of that activity). Use the table from Core Concepts as a reference.
Check for baselines. For each outcome metric, identify whether a pre-AI baseline exists. If no baseline was recorded before deployment, note the gap. You cannot retroactively create a true baseline, but you can establish one now for future comparison.
Identify the outcome each activity metric was meant to proxy. For every activity metric on your list, write one sentence explaining what business outcome it was intended to indicate. If you cannot write that sentence, the metric has no justification.
Select one activity metric to retire. Choose the weakest activity metric, the one most disconnected from a real outcome. Document what it was measuring, why it was misleading, and what you are replacing it with.
Define a replacement outcome metric. For the metric you are retiring, write the replacement outcome metric using this format:
[Variable] measured [before/after] [event], compared to baseline of [baseline value] over [time period].Example: PRD cycle time measured before and after AI-assisted drafting, compared to a baseline of 8 days averaged over 6 months pre-deployment.Choose one current or upcoming AI initiative and document a baseline plan. Identify the process, the variable to measure, the measurement method, and the time window for baseline collection. Commit to collecting it before the tool is deployed.
Segment your next report. Before presenting any average metric in your next leadership update, split it by at least one dimension: role, seniority, team, or workflow stage. Note whether the segmented view changes the conclusion the average suggests.