Lean thinking for Evals

Most experienced Product Managers already have a workable quality playbook: agree what ‘good’ looks like, ship in slices, test, monitor, iterate. When you move into AI products, especially those using Large Language Models (LLMs), that playbook still works, but crucial elements change too.

LLM-driven products are sensitive in ways that are easy to underestimate because the probabilistic nature of these systems fundamentally changes the notion of QA. A small prompt change, model upgrade, or retrieval adjustment can shift behaviour in ways that aren’t obvious in code review and can’t be caught reliably through spot checks. OpenAI’s guidance is direct on this: traditional testing methods are not sufficient on their own for generative AI, ‘quality’ is harder to pin down (1).

Here, the mindsets and specific techniques associated with Lean provide useful mental models for Product Managers – and product teams as a whole. Not because AI development is manufacturing in the traditional sense, but because Lean focuses on standards, flow, the visibility of problems, and continuous improvement. This connects closely to the emergence of Evals as best-practice.

What Evals are (and what they’re not)

An Eval is a repeatable way to measure whether an AI system meets defined quality criteria across representative scenarios. Most practical Eval setups include:

Scenario sets: realistic inputs with enough context to represent real usage.
A rubric / scoring method: human review, automated scoring, or a hybrid. Many teams use model-graded Evals (LLM-as-judge) for scale, with human spot checks for calibration.
Baselines and thresholds: what “good” currently looks like, and what counts as a regression.
A harness: the code and infrastructure to run these checks repeatedly.

The point is not to build a perfect measurement machine. But we do want a quality system that sits inside day-to-day delivery, with shared measurement that enables sensible trade-offs and continual learning. Anthropic’s overview is worth reading because it makes clear how difficult (and important) evaluation is in practice (2).

A simple way to keep your thinking grounded is to name the quality dimensions you actually care about. For many products, this boils down to:

Task success (did the user/agent achieve the goal?)
Factuality/grounding (is it accurate and properly supported?)
Safety/compliance (does it stay within policy and regulation?)
Tone/brand (does it communicate appropriately?)
Latency (is it fast enough?)
Cost (is it economically viable at scale?)

Note: Many of these criteria align with the Usability, Feasibility and Viability lenses we commonly apply to product development. Product-Market Fit and other desirability-type considerations are also crucial, but their connection to Evals might be slightly looser.

Let’s make this tangible

Imagine a contact-centre AI agent that reads an incoming customer email and decides what to do next.

Baseline: it classifies the request (refund, delivery issue, account change), retrieves the relevant policy, drafts a reply, and chooses one of two paths:
- Automate (send the message and, where appropriate, trigger a simple backend action) or,
- Escalate (route to a human supervisor with a summary and recommended next step).
Change: a few weeks later, you tweak the prompt to make the agent more “helpful” and reduce escalations.
Regression: the agent starts taking confident actions in borderline cases—promising refunds or processing changes that should have been escalated—because the decision threshold for escalation has effectively shifted.

The outputs still look good in isolation, but the failure shows up in the workflow: the wrong cases get automated, and humans see them only after customers complain. This is the pattern Evals are designed to prevent: a shared standard for ‘automate vs escalate’, fast signals when that boundary drifts, and a habit of turning these incidents into permanent tests.

Lean > QA

So, Evals often look like QA: test cases, regression suites, CI gates. And this similarity is helpful.

However, the difference is operational. Where many software tests are binary and stable over time, Evals often are not. They measure degrees of quality across multiple dimensions, in a system whose behaviour can shift with prompts, data, retrieval, or model changes. Google’s ‘ML Test Score’ rubric captures the broader idea: production readiness for ML systems depends on ongoing evaluation and monitoring, not just pre-release testing (5). This is where Lean thinking comes in.

So, the framing is:

QA mechanics help you implement Evals.
Lean helps you run Evals as a quality system.

Mapping Lean principles to Evals

1) Standard work: define “good”, link it to outcomes, and keep it visible

In Lean, standard work is the baseline that makes improvement possible. Without a shared definition of ‘normal’, you can’t reliably spot abnormality.

In Eval terms, ‘standard work’ is:

Rubrics: concrete definitions of ‘correct’, ‘helpful’, ‘safe’, and ‘on-brand’.If you use a model to grade outputs, treat the grader prompt as part of the standard. If the grader drifts, your standard drifts.
Scenario classes: the request patterns you care about, including edge cases and known failure modes.
Thresholds: what must not regress and what is ‘good enough to ship’.

To make this usable inside a product team, communicate it like any other product standard:

Put the rubric and scenario set somewhere obvious (alongside design principles, accessibility standards, API contracts).
Make it part of normal rituals: sprint planning (‘What Eval coverage changes?’), release reviews (‘What moved vs baseline?’), and incident reviews (‘What scenarios should we add?’).
Tie it to outcomes you are working towards and make things clear. In the contact-centre agent example, that will include metrics like: First Contact Resolution (FCR), escalation rate and after-call work (ACW) driven by incorrect automation. CSAT and key complaint themes, and cost per contact (including model and tooling costs) should also be measured.

2) Andon: make problems visible early, and keep the feedback loop fast

Andon is about surfacing issues quickly so they can be addressed before they spread (4).

In Eval terms, the idea is simple: regressions should be visible soon after they are introduced. But there is a practical constraint: an Andon cord is useless if it takes an hour to pull. So, treat speed and cost as product requirements for your Eval suite:

Developer-loop signals (fast and cheap): run in minutes on every meaningful change. These might include schema/JSON validation, tool-call correctness checks, simple heuristics, small-model graders, or targeted ‘golden path’ scenarios. These should be fast and low cost per run.
Production signals (ongoing): sampling queues, drift indicators, escalation spikes, and ‘thumbs down’ themes, reviewed on a regular cadence.

The key PM responsibility is the operating agreement: who responds when the signal goes red, what happens next, and how quickly you turn that signal into a prioritised countermeasure.

3) Jidoka: decide what stops the line, without creating Muda (waste)

Jidoka is commonly summarised as ‘build in quality and stop when abnormality appears,’ so defects do not keep flowing downstream. The AI version is a tiered approach that protects outcomes without creating muda (waste), including wasted developer time and unnecessary API spend (3). Here, we can separate cadence (when you run Evals) from severity (what they mean).

A model cadence might be:

PR gate (fast): runs on every meaningful change.
Nightly regression suite (comprehensive): wider coverage, slower, more expensive.
Pre-release suite (targeted + comprehensive): used when risk is higher or the change is large.

While severity might differentiate between:

Hard blockers (stop-the-line): unsafe behaviour or policy violations, clear factuality regressions in sensitive domains, consistent failure on core journeys, tool misuse with real-world consequences.
Warnings (triage): minor tone drift, small verbosity changes, modest latency/cost drift still within budget.

For example, in our contact centre, inaccurate agentic actions that trigger issues downstream might be hard blockers; while true tonal issues in communications might considered less urgent. Lean is useful because it makes the trade-off explicit: you are balancing flow, quality, and cost deliberately.

4) Kaizen and poka-yoke: make learning cumulative and reduce repeat failures

Kaizen is continuous improvement; poka-yoke is mistake-proofing. In Eval terms, the loop is straightforward:

If a failure matters, capture it as a scenario.
Add it to the Eval set so it cannot recur silently.
Implement a countermeasure (e.g. prompt/tool/retrieval/UX/guardrail).
Re-run Evals, update baseline and thresholds.

Synthetic data can make this stronger, but it needs guardrails:

Start small: generate a limited number of variants.
Promote only variants that resemble real user behaviour or plausible risk.
Where possible, seed from anonymised production patterns so you do not drift away from reality.

This is how Evals become a maintained asset, rather than a static benchmark.

Do things change for agentic workflows?

No. While many AI products are now workflows: classify-retrieve-draft-check-act, many are increasingly ‘agentic’ – like the contact centre example, where the system can choose which steps and tools to use based on context (6)(7).

Agentic systems do not need a new Eval philosophy, but they do need clearer structure, because small component regressions can compound: a retrieval miss can turn into a wrong action, not just a slightly worse answer.

End-to-end Evals (integration): did the assistant avoid promising refunds when the user is ineligible, while still resolving the query appropriately?
Component Evals (diagnostic): did retrieval return the policy section that contains the eligibility rule?

A simple rule is usually enough: start with end-to-end coverage on critical journeys, then add component Evals only where traces show repeated failure. Over-testing every step is a common source of waste. LangSmith’s guidance on evaluation approaches is a good reference for thinking about step-level vs system-level evaluation (8).

Limits of the Lean mental model

The Lean analogy is useful, but please note that this is not a 1:1 mapping. For AI products:

Quality is less objective: ‘tone’ and ‘helpfulness’ are harder to measure than a binary defect. This means Evals need calibration and periodic human review.
Inputs are unbounded: user behaviour shifts, and Eval sets can go stale unless refreshed from real usage.
Metrics can be gamed: you can optimise for a high Eval score while degrading user experience (Goodhart effects). This is another reason to balance End-to-end and Component level evaluation.

Applying this as a PM

The switch from traditional QA (and UAT) to Evals requires a new way of thinking. So, if Evals are not gaining traction in your team, use Lean as a sequencing tool:

Map the value stream: inputs → retrieval/tools → reasoning/steps → output → user/ business outcome.
Create standard work: rubrics, scenario classes, thresholds, and a calibrated grading approach.
Add Andon: fast developer-loop signals plus a production sampling/review loop.
Tier your gates (Jidoka): separate cadence from severity and protect core outcomes without excessive waste.
Run Kaizen: every meaningful failure becomes a test and use carefully chosen variants to prevent near-miss repeats.

The aim is not to add process. It is to make quality work cumulative, so you stop solving the same problems twice. Lean thinking has been transformative in manufacturing – and is central to good development practice. It will work well in the AI product world too.

Sources

1. OpenAI, “Evaluation best practices”: https://platform.openai.com/docs/guides/evaluation-best-practices

2. Anthropic, “Evaluating AI systems”: https://www.anthropic.com/research/evaluating-ai-systems

3. Toyota Motor Corporation, “Toyota Production System”: https://global.toyota/en/company/vision-and-philosophy/production-system/index.html

4. Toyota UK magazine, “Andon”: https://mag.toyota.co.uk/andon-toyota-production-system/

5. Google, “What’s your ML Test Score?”: https://research.google/pubs/pub45742/

6. Anthropic, “Building effective agents”: https://www.anthropic.com/research/building-effective-agents

7. IBM, “Agentic workflows”: https://www.ibm.com/think/topics/agentic-workflows

8. LangSmith, “Evaluation approaches”: https://docs.langchain.com/langsmith/evaluation-approaches

This article was first published on Product Breaks.