Tim Scudder, Thoughts on Product Take me home
February 2026

Lean thinking for Evals

Most experienced Product Managers already have a workable quality playbook: agree what ‘good’ looks like, ship in slices, test, monitor, iterate. When you move into AI products, especially those using Large Language Models (LLMs), that playbook still works, but crucial elements change too.

LLM-driven products are sensitive in ways that are easy to underestimate because the probabilistic nature of these systems fundamentally changes the notion of QA. A small prompt change, model upgrade, or retrieval adjustment can shift behaviour in ways that aren’t obvious in code review and can’t be caught reliably through spot checks. OpenAI’s guidance is direct on this: traditional testing methods are not sufficient on their own for generative AI, ‘quality’ is harder to pin down (1).

Here, the mindsets and specific techniques associated with Lean provide useful mental models for Product Managers – and product teams as a whole. Not because AI development is manufacturing in the traditional sense, but because Lean focuses on standards, flow, the visibility of problems, and continuous improvement. This connects closely to the emergence of Evals as best-practice.

What Evals are (and what they’re not)

An Eval is a repeatable way to measure whether an AI system meets defined quality criteria across representative scenarios. Most practical Eval setups include:

The point is not to build a perfect measurement machine. But we do want a quality system that sits inside day-to-day delivery, with shared measurement that enables sensible trade-offs and continual learning. Anthropic’s overview is worth reading because it makes clear how difficult (and important) evaluation is in practice (2).

A simple way to keep your thinking grounded is to name the quality dimensions you actually care about. For many products, this boils down to:

Note: Many of these criteria align with the Usability, Feasibility and Viability lenses we commonly apply to product development. Product-Market Fit and other desirability-type considerations are also crucial, but their connection to Evals might be slightly looser.

Let’s make this tangible

Imagine a contact-centre AI agent that reads an incoming customer email and decides what to do next.

The outputs still look good in isolation, but the failure shows up in the workflow: the wrong cases get automated, and humans see them only after customers complain. This is the pattern Evals are designed to prevent: a shared standard for ‘automate vs escalate’, fast signals when that boundary drifts, and a habit of turning these incidents into permanent tests.

Lean > QA

So, Evals often look like QA: test cases, regression suites, CI gates. And this similarity is helpful.

However, the difference is operational. Where many software tests are binary and stable over time, Evals often are not. They measure degrees of quality across multiple dimensions, in a system whose behaviour can shift with prompts, data, retrieval, or model changes. Google’s ‘ML Test Score’ rubric captures the broader idea: production readiness for ML systems depends on ongoing evaluation and monitoring, not just pre-release testing (5). This is where Lean thinking comes in.

So, the framing is:

Mapping Lean principles to Evals

1) Standard work: define “good”, link it to outcomes, and keep it visible

In Lean, standard work is the baseline that makes improvement possible. Without a shared definition of ‘normal’, you can’t reliably spot abnormality.

In Eval terms, ‘standard work’ is:

To make this usable inside a product team, communicate it like any other product standard:

2) Andon: make problems visible early, and keep the feedback loop fast

Andon is about surfacing issues quickly so they can be addressed before they spread (4).

In Eval terms, the idea is simple: regressions should be visible soon after they are introduced. But there is a practical constraint: an Andon cord is useless if it takes an hour to pull. So, treat speed and cost as product requirements for your Eval suite:

The key PM responsibility is the operating agreement: who responds when the signal goes red, what happens next, and how quickly you turn that signal into a prioritised countermeasure.

3) Jidoka: decide what stops the line, without creating Muda (waste)

Jidoka is commonly summarised as ‘build in quality and stop when abnormality appears,’ so defects do not keep flowing downstream. The AI version is a tiered approach that protects outcomes without creating muda (waste), including wasted developer time and unnecessary API spend (3). Here, we can separate cadence (when you run Evals) from severity (what they mean).

A model cadence might be:

While severity might differentiate between:

For example, in our contact centre, inaccurate agentic actions that trigger issues downstream might be hard blockers; while true tonal issues in communications might considered less urgent. Lean is useful because it makes the trade-off explicit: you are balancing flow, quality, and cost deliberately.

4) Kaizen and poka-yoke: make learning cumulative and reduce repeat failures

Kaizen is continuous improvement; poka-yoke is mistake-proofing. In Eval terms, the loop is straightforward:

  1. If a failure matters, capture it as a scenario.

  2. Add it to the Eval set so it cannot recur silently.

  3. Implement a countermeasure (e.g. prompt/tool/retrieval/UX/guardrail).

  4. Re-run Evals, update baseline and thresholds.

Synthetic data can make this stronger, but it needs guardrails:

This is how Evals become a maintained asset, rather than a static benchmark.

Do things change for agentic workflows?

No. While many AI products are now workflows: classify-retrieve-draft-check-act, many are increasingly ‘agentic’ – like the contact centre example, where the system can choose which steps and tools to use based on context (6)(7).

Agentic systems do not need a new Eval philosophy, but they do need clearer structure, because small component regressions can compound: a retrieval miss can turn into a wrong action, not just a slightly worse answer.

A simple rule is usually enough: start with end-to-end coverage on critical journeys, then add component Evals only where traces show repeated failure. Over-testing every step is a common source of waste. LangSmith’s guidance on evaluation approaches is a good reference for thinking about step-level vs system-level evaluation (8).

Limits of the Lean mental model

The Lean analogy is useful, but please note that this is not a 1:1 mapping. For AI products:

Applying this as a PM

The switch from traditional QA (and UAT) to Evals requires a new way of thinking. So, if Evals are not gaining traction in your team, use Lean as a sequencing tool:

  1. Map the value stream: inputs → retrieval/tools → reasoning/steps → output → user/ business outcome.

  2. Create standard work: rubrics, scenario classes, thresholds, and a calibrated grading approach.

  3. Add Andon: fast developer-loop signals plus a production sampling/review loop.

  4. Tier your gates (Jidoka): separate cadence from severity and protect core outcomes without excessive waste.

  5. Run Kaizen: every meaningful failure becomes a test and use carefully chosen variants to prevent near-miss repeats.

The aim is not to add process. It is to make quality work cumulative, so you stop solving the same problems twice. Lean thinking has been transformative in manufacturing – and is central to good development practice. It will work well in the AI product world too.


Sources

1. OpenAI, “Evaluation best practices”: https://platform.openai.com/docs/guides/evaluation-best-practices

2. Anthropic, “Evaluating AI systems”: https://www.anthropic.com/research/evaluating-ai-systems

3. Toyota Motor Corporation, “Toyota Production System”: https://global.toyota/en/company/vision-and-philosophy/production-system/index.html

4. Toyota UK magazine, “Andon”: https://mag.toyota.co.uk/andon-toyota-production-system/

5. Google, “What’s your ML Test Score?”: https://research.google/pubs/pub45742/

6. Anthropic, “Building effective agents”: https://www.anthropic.com/research/building-effective-agents

7. IBM, “Agentic workflows”: https://www.ibm.com/think/topics/agentic-workflows

8. LangSmith, “Evaluation approaches”: https://docs.langchain.com/langsmith/evaluation-approaches

This article was first published on Product Breaks.

← Previous article Next article →