Steve James | Product Leader & AI Strategy

Product Friction and the Comfort Crisis

Steve James — Sun, 31 May 2026 00:00:00 GMT

I read most of Michael Easter's The Comfort Crisis on the sofa, under a blanket, the heating on and a coffee going cold beside me, with a podcast queued for when I got bored. The book opens with Easter on a windy tarmac above the Arctic Circle, about to be dropped by bush plane into hundreds of miles of Alaskan backcountry for thirty-three days. Grizzlies, glacial rivers, no trail out. I was reading it while optimising my own discomfort down to zero, which is, more or less, his entire point.

The book's argument is that we have engineered hardship out of modern life so thoroughly that we are now suffering from its absence. We are, in Easter's words:

Sheltered, sterile, temperature-controlled, overfed, underchallenged, safety-netted.

He builds the case by going to find the people who have deliberately put the difficulty back: the NBA's top sports scientist using an ancient practice to forge championship athletes, an Oxford economist studying happiness through death in Bhutan, a neuroscientist measuring what extended time in the wild does to a stressed brain, and the bow-hunters who take him into the Arctic. The throughline is a body of research suggesting that the things we have spent centuries removing, hunger, cold, physical strain, boredom, sustained effort, were not only burdens. They were also inputs our minds and bodies are tuned to expect, and their total absence shows up as the modern epidemics of anxiety, restlessness and poor health.

His prescription is not suffering for its own sake. It is to reintroduce discomfort on purpose, in chosen and bounded doses: carry a heavy pack, sit with boredom instead of reaching for a screen, take on something hard enough that you might fail. Every advance that removes a small struggle feels like progress, because individually each one is. Warmer house, faster food, smoother everything. But the sum of all that smoothing is a life with no edges, and Easter marshals a fair amount of evidence that humans do badly without edges.

I am a product manager, not an adventurer, so the thought that kept interrupting my reading was not about the Arctic. It was that removing small struggles, one at a time, each removal obviously good on its own terms, is a precise description of what we do to software. For nearly every product team I have worked with, frictionless has quietly become an end in itself. For most of that history the goal stayed safely out of reach: there was always a step that could not yet be automated, a judgement only a person could make, so "reduce the friction" had a natural stopping point. AI is dissolving that limit. For the first time it is not impossible to picture a journey with no friction left in it at all, which makes a question we could once leave unanswered suddenly pressing: which friction were we ever right to remove? Strip enough of it out of a product, I have come to think, and you can end up with something similar to what comfort has done to us.

The reflex we have stopped questioning

Reducing friction is one of the most reliable instincts in product work. It is not the whole job, or even close to it; a good product manager is usually chasing bigger questions first, whether the thing is worth building at all, whether anyone actually wants it, whether it moves the outcome you set out to move. Nor is the experience itself the product manager's to draw: shaping the UX is the designer's domain. What lands on the product manager's desk is the prioritisation, the call on which problems are worth fixing and in what order, and smoothing a friction point is one of the most common items on that list. That is where the instinct bites. Fewer steps in the funnel. Fewer fields in the form. Fewer decisions for the user to make. Most of the time it is the right call, and I want to be clear about that before I argue with it, because the argument only works if you grant that the default is usually sound. A checkout that takes one tap instead of five really is almost always better. A signup that does not demand a fax number really is better. The history of good product work is, to a first approximation, the history of optimisation and refinement.

The trouble starts when a useful default hardens into an end in itself. Friction stops being one variable you tune against an outcome and becomes a thing that is simply bad, to be driven towards zero wherever it is found. Once that happens you stop asking the only question that matters, which is what a given piece of friction is doing. Some friction is waste: it costs the user effort and returns nothing. Some friction is load-bearing: take it away and the structure it was holding up comes down with it. Treating all of it as waste is how you saw through a structural beam because it was in the way. To be clear, none of what follows is an argument for making products harder, or for designing in struggle so that users feel they have earned the outcome. It is an argument against treating frictionlessness as a goal in its own right, when it is only ever a means, and one that sometimes works against the very thing you wanted.

Problem creep, and the satisfaction you can never reach

A really interesting idea I took from the book is one Easter borrows from the Harvard psychologist David Levari. In a series of studies, Levari and Daniel Gilbert had people pick the threatening faces out of a long sequence, then quietly made threatening faces rarer. People did not relax. They started reading neutral faces as threatening. Their assignment was to identify threatening faces, so they did. The same held when the task was judging research proposals as ethical or unethical: make the clearly unethical ones scarce and ambiguous ones start getting flagged. As a problem grows rarer, we expand our definition of it, so the number of problems we perceive never really falls. The unsettling implication is the one that matters here: we will report that something is wrong whether or not anything actually is. Levari called this prevalence-induced concept change; Easter calls it problem creep.

This is the most important thing a product manager could know about chasing a frictionless experience, and almost nobody frames it this way. Removing friction does not raise satisfaction, because the baseline resets to meet it. Cut the signup from eight fields to three and users will not bank the goodwill; they will start chafing at the email confirmation step that remains, and when that goes, the one onboarding screen left will feel like one too many. Satisfaction was never a function of how much friction there is, only of the gap between expectation and experience, and smoothing just teaches people to expect more. None of which is an argument against improving: continuous refinement is the work, and a substandard experience should never be left to stand. It is an argument against believing that each thing you smooth buys satisfaction, when often it only resets the baseline.

There is a deeper reason this shows up so reliably, and it is the same one behind the comfort crisis itself: deliberately introducing difficulty is not a natural instinct, but removing it is. We see an obstacle and want it gone, rarely pausing to ask what it was doing. That reflex is how we engineered the hardship out of modern life, and it is what surfaces when users tell us what is wrong with a product. They will reliably flag friction as a fault; it is vanishingly rare for anyone to point at a piece of friction and say it is making the experience, or the longer-term outcome, better, even when it is.

This is where a good product manager's rigour earns its keep, because the task problem creep leaves us with is exactly this: telling the real problems from the ones we are only hearing about because the threshold has moved. The study tells us the defect reports will keep coming whether or not anything is wrong; the judgement is in working out which is which. The discipline is to keep measuring against the outcome you set out to move and to refuse to be governed by individual incidents, whether a loud complaint, a counted click, or a stakeholder who disliked a screen. That is harder than it sounds, because it cuts against how our minds work: we react to the thing directly in front of us rather than holding a line on an outcome agreed weeks earlier. Treating each reported friction as a claim to be tested against the outcome, rather than a fault to be fixed on reflex, is a large part of the job and one of the least natural things it asks of you.

The value was in the effort

There is a second reason effort-free experiences disappoint, and it runs deeper than recalibration: for a great many activities, the effort was the part that made them worth doing. Mihaly Csikszentmihalyi spent decades studying the moments people report as most rewarding, the states of complete absorption he called flow, and they are never the easy ones. The best moments, he wrote, "usually occur when a person's body or mind is stretched to its limits in a voluntary effort to accomplish something difficult and worthwhile." Flow arrives when a challenge is matched to your skill and pitched just past your current reach: demanding enough to take all of your attention, not so demanding that you flounder. Too little challenge and attention drifts into boredom; too much and it tips into anxiety or frustration. The reward lives in the narrow band between, in the stretch itself.

That is an awkward finding for a discipline built on making things easier, because it means enjoyment and effort are not opposites. In the activities people care about most, the effort is the enjoyment. A climber in a flow state is not gritting through the ascent to reach a rest at the summit; the absorption on the wall is the point. Take the difficulty out of something that ran on flow and you do not raise the person to a higher state of ease. You remove the conditions that made it worth doing, and what is left is not contentment but a disappointing flatness. The lesson is not that hard is good and easy is bad, but that some of the effort we are quick to design away was carrying the experience, and smoothing it to nothing can drain out exactly what people came for. A product that automates the interesting part of a task and leaves the user the dull remainder has made their day easier and emptier at the same time.

This is exactly where AI-assisted creation gets uncomfortable. When a model does the whole job, the person who prompted it is left holding an output they did not struggle for, and the satisfaction of having made something does not arrive, because they did not make it. There is a related and well-observed problem: the more heavily AI-generated something is, the more generic it tends to be, since the models are trained on broadly the same material and regress towards the same middle. Jake Knapp and John Zeratsky have made the point that differentiation is the strongest predictor of whether a product survives, and that building fast with AI tends to pull teams towards the undifferentiated middle. So the frictionless creation tool can deliver a double disappointment: the work feels unearned, and it is not even distinctive. You have removed the effort that made the activity meaningful and the struggle that would have made the result yours. Some of that lost friction was the human staying in the loop, and it was doing more than anyone noticed: it was potentially elevating the work up from the average of everything the model had ever been trained on.

The friction some products keep

Easter's own prescription, again, is not suffering for its own sake. It is to reintroduce difficulty deliberately and in chosen doses, and his organising example is misogi: a practice, drawn from Japanese tradition and repurposed by a researcher he meets, of taking on something genuinely hard with a real chance of failure. Leave comfort, meet a challenge, struggle, return changed. The lesson for product work is narrower and a good deal less romantic: some of the effort we are quickest to remove is load-bearing, quietly doing a job, and the only way to know which is to look before you cut.

Look, then, at the products that declined to smooth everything away. Duolingo could hand you the translation and let you tap "continue" until a lesson counts as done; instead it drip-feeds material, drills you with repetition, and builds the experience around a daily streak. Peloton could be a quiet bike and instead puts a leaderboard in front of you. A game with no challenge is not relaxing, it is boring, and boredom is the fastest way out of an experience. The point is not to copy these and bolt difficulty onto your own product. It is that a blanket instinct to strip out friction would have wrecked every one of them, because in each case the effort was the thing the user came for. The UX field has a name for this balance. Nielsen Norman Group frame it as friction and flow: two tools to reach for as the journey demands, friction to slow people down where slowing down helps, flow to speed them through where it does not. In that framing friction is a design tool, sometimes there to prevent a careless action, sometimes there because it is carrying the value, not a defect to be removed on sight.

When the friction is ours

So far I have written as though friction is something we do to users. But the same thing is often pointed back at us, the people who build the products, and the most recent account I have read of what that feels like comes from Teresa Torres. She has spent years arguing that the synthesis at the heart of product discovery, the slow work of turning a stack of customer interviews into a structured map of the opportunity space, is cognitively demanding and worth doing properly. It is, in the terms of this piece, load-bearing friction. The effort is where the understanding forms.

She has also run her own interviews through AI, and she is candid about the result:

When I ran my interviews through Claude, it caught opportunities I missed. But I also caught things it missed. The highest-quality synthesis came from combining both perspectives... I still believe there's real value in doing the synthesis yourself. But I also know that a draft OST you actually refine is better than a perfect process you never get to. This is about raising the floor.

That is not a small concession from someone whose career is built on teaching the manual craft, and it is the agency question resolved sensibly: keep the human in the work, but let the machine raise the floor. The danger she is steering around is the obvious one. It is not that AI does some of the synthesis. It is letting it do all of it, so that the draft is never refined, the judgement is never applied, and the understanding that only forms through the effort never forms at all. Torres is keeping the load-bearing friction and shedding the rest. That is exactly the discrimination the job now requires, and it is the corrective the comfort-crisis argument needs, so it does not curdle into nostalgia for hardship.

Telling load-bearing friction from waste

None of this is a licence to make products annoying. The Arctic is not better than your living room because it is colder; Easter's case rests on the discomfort being chosen and purposeful. The same condition applies to product friction. The job is not to add hardship. It is to develop judgement about which friction is load-bearing and which is waste, and to stop treating the distinction as if it did not exist.

A few questions help me sort one from the other. What is this friction producing? If the honest answer is nothing, that the user pays and gets only delay, remove it without ceremony; that is exactly the friction the instinct exists to clear away. But if the friction is producing comprehension, or care, or commitment, or a sense of authorship, or a moment where the user actually understands the decision they are making, then it is holding something up, and removing it will quietly drop whatever that was. A confirmation step on an irreversible action is load-bearing. The effort of arranging your own thoughts before a tool acts on them is load-bearing.

The agency dimension is the one I would watch hardest as we wire more AI into products. Aishwarya Naresh Reganti and Kiriti Badam, who have shipped more than fifty enterprise AI products, frame it as a trade: every increment of autonomy you hand to a system is an increment of control the user gives up, and you should only ask for it once the system has earned the trust. Their counsel is to start with the problem rather than the solution, keeping human control high and agency low at first, and to hand over autonomy slowly. That is the same instinct as keeping load-bearing friction. An assistant that quietly does everything has removed the user's friction and their agency in the same motion, and turned a creator into a passive curator of outputs they cannot vouch for. The smooth version is not obviously the better product. It is just the smoother one, and we have learned to mistake those for the same thing.

This is why the distinction matters more now than it ever has. For as long as removing friction was genuinely hard, our blunt tools kept some of the load-bearing kind in place by accident, because there was always a step we could not yet take out. AI has the potential to remove that safety margin. Once it is possible to strip out all of the friction, keeping the right friction stops being a refinement at the edges and becomes the heart of the job. And much of what is worth keeping is the friction that holds a human in the loop: the effort that forces a judgement, the step that makes someone own a decision, the resistance that keeps the work from dissolving into the mediocre middle that a model, left to its own devices, will always drift towards.

That mistake is the comfort crisis, ported from Easter's wilderness into our roadmaps. Each removal of friction looks like progress because each one is, locally. The cost only shows up in aggregate: experiences with no edges, users whose satisfaction baseline has crept out of reach, work that feels unearned because it was. The frictionless ideal is worth questioning not because friction is good, but because "remove all of it" is not a strategy. It is the absence of one.

Governance Is a Differentiator: Safety, Regulation, and Incident Readiness for AI Products

Steve James — Mon, 18 May 2026 00:00:00 GMT

Here is a claim that rarely makes it into a pitch meeting: governance is a differentiator.

Not a constraint. Not a compliance checkbox you fill in so legal signs off and the product ships. A differentiator. The companies that treat safety, fairness, and incident readiness as features, not friction, are the ones that win high-stakes enterprise deals, pass regulator scrutiny, and retain customers through problems. The companies that treat governance as a Q4 nice-to-have are the ones that lose the deal when the customer's legal team asks to see their incident response playbook and there isn't one.

It is worth being honest about why governance gets deprioritised. The entire language of product management is built around speed: ship fast, iterate, grow, capture the market before the competition does. OKRs reward launches. Roadmap reviews reward velocity. Governance work is often invisible when it succeeds and only visible when it fails, which makes it structurally easy to defer in favour of the next feature. This has always been the tension for highly regulated products: pharmaceuticals, medical devices, financial services. Those industries learned (sometimes painfully) that governance is not opposed to speed; it is the thing that lets you sustain speed without a catastrophic reversal. AI products face the same lesson, but compressed into a much shorter timeline and with far less institutional muscle memory to draw on.

The good news is that governance does not have to conflict with speed to market or growth. You do not need to choose between shipping and being responsible. You just need to bake safety, fairness, and incident readiness into your processes the same way you bake in testing, code review, or design critique. Teams that treat governance as a parallel work-stream rather than a gate at the end find that it barely slows them down, and it materially de-risks the moments that would otherwise stop them dead: the enterprise deal that stalls on legal review, the regulator inquiry, the post-incident customer call. The cost of governance is predictable. The cost of its absence is not.

This is the final article in the AI Fluency series: Layer 4, Safety and Governance. If you've followed the series from the four-layer framework, you've built a mental model of:

How AI systems work (Layer 1),
How to measure whether they're working (Layer 2), and
How to build and debug the application layer (Layer 3, Part 2).

Those three layers let you build a product that works. This final layer answers the question that determines whether your product survives contact with the real world: when things go wrong, can you defend what you built? Can you fix it credibly? And can you do either without burning customer trust?

Governance conversations in product teams tend to fall into two modes: defensive (compliance says we must) or vague (we need to be responsible). Neither prepares you for the day a regulator asks how your model was validated, or a journalist calls because your product hallucinated legal advice, or an enterprise customer discovers your bias testing didn't account for their jurisdiction.

This layer is increasingly tested in interviews. It is also increasingly tested in contracts. European customers now ask for EU AI Act compliance documentation before signing. Enterprise legal teams ask to see your incident response playbooks. Regulators are moving from "AI safety is optional" to "high-risk AI systems must have risk management systems, human oversight, and documentation."

For a PM, this means Layer 4 is no longer deferrable. Everything that follows is evidence for the opening claim: governance is where trust is built, and trust is what keeps customers after the first incident.

Define the Problem Space: Safety vs Security

The first thing to untangle is this: "AI safety" and "AI security" are not the same thing, and conflating them in conversation erodes trust with technical customers.

AI Safety is about the model doing harm on its own. The model hallucinates. The model is biased. The model gives unsafe advice in normal operation, without any adversary involved. Safety is whether the system behaves as intended when used as designed.

AI Security is about adversaries making the system do harm. An attacker jailbreaks the model. An attacker embeds malicious instructions in a document the system reads. An attacker extracts training data through the API. Security is whether the system resists attacks.

This distinction matters because the causes are different, the mitigations are different, and the person responsible is different. A safety issue is a model problem. A security issue is an architecture and access-control problem.

Consider a legal AI assistant. A safety issue: the model confidently summarises a statute incorrectly, and a user relies on the wrong advice. A security issue: an attacker embeds instructions in a contract PDF; the model reads the contract and sends the document to an attacker's server instead of helping the user analyse it.

Both are bad. Both need to be prevented. But you fix them differently. Safety is about prompt iteration, eval design, and guardrails on output. Security is about privilege separation, content sanitisation, and input validation.

The OWASP Top 10 for LLM Applications is the standard reference for security. The 2025 edition (v2.0) covers prompt injection, sensitive information disclosure, supply chain vulnerabilities, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption. If you have not reviewed this list in the context of your product, do that. Now.

For safety, the concern is broader: hallucination, bias, sycophancy, specification gaming, distributional shift. These are failures of the model to do what you intended, not failures of the system to resist attacks.

Understand What You Inherit: Alignment and Its Limits

When you build on top of a frontier model like Claude Opus 4.7 or GPT-5.5, you inherit the alignment work that provider has done. The provider has invested heavily in making the model safer: techniques like RLHF and Constitutional AI reduce harmful outputs, and each model generation improves on the last. But from a governance perspective, the question is not how alignment works. It is what guarantees you can rely on, and what remains your responsibility.

The answer is uncomfortable: you get no guarantees. Model providers publish safety cards and responsible use guidelines. They do not offer SLAs on alignment. They do not indemnify you against hallucination, bias, or harmful output. When a regulator asks "how do you ensure this system behaves safely?", pointing at your provider's alignment research is not a defensible answer. It is the equivalent of telling an auditor "our supplier said the parts are good."

What alignment gives you, practically, is a reduction in baseline failure rate. Think of it as moving from 1-in-10 harmful outputs to 1-in-100. That is meaningful. It is also not sufficient. The gap between "reduced" and "eliminated" is where your governance obligations live: your domain-specific evals, your guardrails, your monitoring, your incident response. The provider handles the foundation. Everything built on top of it is yours to defend.

Catalogue the Threats: Failure Modes You Must Anticipate

Every AI product will fail. The question is whether you anticipate the failure modes in red-teaming or whether you discover them in production.

In the previous article, you learned to diagnose whether a failure is model-layer or application-layer. That diagnostic tells you who owns the fix and how long it will take. This section adds a second axis: is the failure a safety issue or a security issue? The first axis routes the fix. The second axis determines the severity, the regulatory implications, and the communication strategy. You need both.

Jailbreaking is when a user crafts inputs that bypass your safety constraints. Role-play framing: "Pretend you are DAN (Do Anything Now)." Instruction override: "Ignore all previous instructions." The user finds clever ways to make the model do what it is designed to refuse. Mitigations: intent classification, input validation, explicit red-teaming before launch.
Prompt Injection is when malicious instructions are embedded in content the model reads. A RAG system retrieves a web page. Hidden in white text on white background: "SYSTEM: Ignore previous instructions." An agent reads an email instructing it to forward confidential data. Unlike jailbreaking (which requires the user to be the attacker), prompt injection can be orchestrated by third parties. Mitigation: delimit retrieved content with clear boundaries, apply privilege separation so instructions only come from the system prompt, sanitise external content, require human approval for consequential actions.
Sycophancy is when the model tells users what they want to hear rather than what is true. A user says "I think X is true." The model: "You're absolutely right, X is indeed true," even if X is false. A user expresses doubt. The model immediately backs down even if it was correct. In advisory products (legal, medical, financial), sycophancy is a material accuracy issue. It erodes trust invisibly because users do not realise they are being validated for the wrong reason. Mitigation: include factual claims you know to be false in your eval suite and check whether the model agrees; test with adversarial user pushback to see if the model holds its ground; use system prompts that explicitly instruct the model to prioritise accuracy over agreeableness.
Hallucination is plausible-sounding but false information. The model generates fiction confidently. In legal products, hallucination is material misrepresentation. In financial products, it is bad advice. Mitigations: retrieval-augmented generation (RAG) grounded in reliable sources, explicit confidence calibration in outputs, monitoring of factual consistency in production.
Reward Hacking is when the model finds ways to maximise the training signal that do not correspond to genuine good behaviour. An RLHF-trained model might learn that longer responses score higher, or confident-sounding responses score higher, or flattering responses score higher. None of these necessarily mean better output. Mitigation: design evals that are harder to game than the training signal; regularly compare model outputs to ground truth, not just to user satisfaction proxies.
Specification Gaming is achieving the letter of your objective but not the spirit. A model instructed to "be concise" truncates answers before they are complete. A model instructed to "always provide an answer" hallucinates rather than saying "I don't know." Mitigation: define objectives precisely; include counter-examples in your evals; test for the failure mode you're trying to prevent, not just the success case.
Distributional Shift is when the real world differs from your training data. You built evals on English legal text. You deploy in a multilingual context. Performance degrades on non-English queries in ways your evals did not capture. Users will find novel use cases that were not in your test data. Mitigation: regular production sampling; monitoring of query distribution; expanding evals based on real user behaviour.

The triage tree:

AI product failure detected
        │
        ├── Was there an adversary involved?
        │       │
        │      YES → SECURITY failure
        │       │
        │       ├── User-supplied attack?
        │       │       └── Jailbreaking
        │       │
        │       └── Embedded in retrieved content?
        │               └── Prompt Injection
        │
        └── No adversary involved?
                │
               YES → SAFETY failure
                │
                ├── Model agrees with false premise?
                │       └── Sycophancy
                │
                ├── Model invents information?
                │       └── Hallucination
                │
                ├── Model games its objective?
                │       └── Reward Hacking / Spec Gaming
                │
                └── Model fails on new data distribution?
                        └── Distributional Shift

Is the failure caused by something outside the model (an adversary, a system component) or inside the model (the model's own misalignment)? Is it jailbreaking (adversary, user-supplied input) or prompt injection (adversary, retrieved content) or a guardrail gap or hallucination (model)? Each diagnosis points to a different fix.

The failure modes that kill trust silently are sycophancy and hallucination. Both are invisible to users unless you test for them explicitly. Both erode the thing the product sells: trustworthiness.

Zoom In on the Hardest One: Bias and Fairness

Bias in AI is systematic, not random. It comes from training data, from annotators, from feedback loops, and from the evals you chose to run.

The sources are well-documented. Training data reflects human history. Annotators reflect their own demographics and preferences. Feedback loops amplify initial biases. And aggregate accuracy hides all of this: a model can score 95% accuracy overall while performing at 70% for a specific demographic or language.

For legal AI, the fairness risks that standard PM playbooks underweight are the ones that actually bite. A legal AI product is trained on reported case law, which over-represents appealed cases and commercial matters. Divorce, immigration, welfare, small-value employment are under-represented. A user seeking advice on family law gets outputs trained on a distribution that does not match their needs. Jurisdictional bias is worse: English and US law dominate training data. A model asked about Scottish or Northern Irish law may silently default to English common-law assumptions, which reads correct to someone unfamiliar with the distinction but is materially wrong.

Bias testing is not a checkbox. It requires sliced evaluation: testing performance disaggregated by jurisdiction, case type, language, and demographic proxy. A 95% accuracy number that hides 70% accuracy for a specific group is worse than useless; it is misleading. Your floor (the worst-performing bucket) is what you need to commit to and defend.

Fairness metrics are also in tension. Demographic parity (equal positive outcomes across groups) conflicts with equal opportunity (equal positive outcomes among those who deserve it). You cannot satisfy both simultaneously if base rates differ. This means you must choose which definition of fairness your product optimises for, and be prepared to explain and defend that choice.

Face the Regulatory Reality: Compliance as Launch Blocker

AI regulation is moving from "eventually, probably" to "now, non-negotiable." What follows is the landscape as of early 2026. This area moves fast; check primary sources (EU AI Act text, UK AI Regulation page, NIST AI RMF) for current status before making compliance decisions.

The EU AI Act is the strictest framework. It is risk-tiered. Unacceptable-risk applications are banned outright (social scoring by governments, real-time biometric ID in public spaces). High-risk AI (employment, education, credit, law enforcement, administration of justice) faces extensive obligations: risk management systems, data governance, human oversight, transparency, conformity assessment before launch. The original implementation timeline sets August 2026 as the date when high-risk AI systems face full compliance obligations, though the European Commission's Digital Omnibus proposal (November 2025) would defer this to December 2027. At the time of writing, the deferral has not been formally adopted; plan against the August 2026 deadline unless and until it is.

For a legal AI product, "administration of justice" potentially pulls you into high-risk territory. What that means: you need conformity assessment before launch in the EU. You need to document data sources, conduct bias testing, have qualified humans in the loop, and be prepared to defend all of this to regulators. Budget 3-6 months for this before an EU launch.

The UK has chosen a different path: principles-based regulation via existing sector regulators (the Solicitors Regulation Authority for legal services, the ICO for data protection). No dedicated AI Act, but the expectation that existing frameworks apply. Solicitors remain liable for AI outputs. "The AI did it" is not a defence. GDPR Article 22 rights apply: if an AI system makes a decision that materially affects a person, that person has the right to human review and contestation.

The US is fragmented: sectoral regulation (EEOC for hiring, CFPB for credit) plus state-level activity (Colorado AI Act is closest to the EU approach). No comprehensive federal AI law, but movement toward the NIST AI Risk Management Framework as de facto standard.

For a global product, EU compliance tends to set the highest bar. If you can ship in the EU, you can usually ship in the UK and US. If you cannot meet EU standards, your feature may be geofenced or not available globally.

What does this mean for a PM? If your product is high-risk (which most legal and financial AI is), regulation is not a Q4 nice-to-have. It is a launch blocker. You need to know your risk classification. You need conformity assessment in your critical path. You need human oversight workflows designed in, not bolted on. You need bias testing that is defensible to regulators. This is not optional.

Mik Kersten's Project to Product offers a useful lens here. Kersten argues that only four types of work flow through a product value stream: features, defects, risks, and debt. The categories are mutually exclusive and collectively exhaustive. Every item in your backlog is one of the four. The practical consequence is that risk work (compliance, safety testing, incident preparedness) is not a side activity that competes with "real work"; it is one of the four fundamental flows. If your flow distribution shows 90% features and near-zero risk, you are not shipping fast. You are accumulating exposure. Good PMs track all four flows and adjust the distribution deliberately, especially in a regulatory environment where unaddressed risk becomes a launch blocker or, worse, a post-launch crisis.

Plan for Failure: Incident Response

Your AI product will fail. The difference between companies that survive incidents and those that don't is not whether the incident happened; it is whether you detected it, contained it, and communicated about it credibly.

Incident readiness has six phases:

  1. Detection
     Monitor, alert, user reports
          ▼
  2. Triage
     Severity framework (S1-S4)
          ▼
  3. Containment
     Disable, revert, emergency guardrail
          ▼
  4. Investigation
     Root cause analysis
          ▼
  5. Remediation
     Fix, test, deploy
          ▼
  6. Post-Incident Review
     Blameless review → feed into eval suite

Detection is harder than you might think. You need monitoring and alerting on safety metrics (hallucination rate, safety filter triggers, unusual output patterns). You need users to be able to report issues. You need red-teaming that catches problems before production. If you cannot see the failure, you cannot respond to it.

Triage happens fast. What failed? Who is affected? How severe is it? Is it ongoing? Use a severity framework beforehand:

Severity 1: critical, causes serious harm
Severity 2: widespread bias or successful jailbreak
Severity 3: isolated failures
Severity 4: quality issues without safety implications

Knowing the severity upfront drives containment.

Containment stops the harm. Disable the feature. Revert to a previous model version. Add an emergency guardrail. Take the system offline if necessary. The right action depends on severity and the nature of the failure.

Investigation determines root cause. When did it start? Which component failed? Can you reproduce it? Is it a known issue with the model, or specific to your application?

Remediation fixes the problem. Change the prompt. Add a guardrail. Rollback the model. Update the knowledge base. Retrain if necessary. Test thoroughly before deploying.

Post-incident review is blameless (focused on systems, not people) and generative (every incident becomes a new test case in your eval suite). A post-mortem that does not feed into your evals is a post-mortem that taught you nothing.

The wrong response to an incident is: "This is a known issue with LLMs; we've adjusted the prompt." That statement is technically true and commercially fatal. It tells customers you do not understand the problem, you do not have a plan, and they should leave.

The right response is: "We detected this failure, contained it within [timeframe], identified the root cause, implemented a fix, and added test cases to prevent recurrence. Here is our incident report." That tells customers you have processes. You can be trusted to do better next time.

Test Your Understanding

Before you move on, test yourself on these questions:

You are building a legal AI assistant. Your model gives bad advice on family law but performs well on commercial litigation. Your evals show 92% accuracy overall. Is this a problem? Why, and what would you do?
A user successfully jailbreaks your product and makes it generate content it is designed to refuse. Is this primarily a safety problem or a security problem? What should you do?
Your regulator asks whether your product falls under the EU AI Act's high-risk provisions. You are deployed in the EU and your product assists with contract analysis. What factors would you consider in answering this?
You discover that prompt injection is possible in your RAG system: malicious instructions in retrieved documents can hijack the model. What is your immediate containment step, and what is your architectural fix?
You want to build an incident response playbook. What are the six phases, and what would you include in each one?

Why This Layer Matters

This layer is easy to defer. It feels defensive rather than generative. It requires knowledge that sits outside your usual domain. It demands conversations with legal and compliance teams that slow down velocity. It means red-teaming before launch. It means admitting your product is not perfect.

But this is the layer that determines whether customers stay after the first incident. The companies that say "yes" to European enterprise deals have conformity assessment in progress. The companies that recover credibly from incidents have playbooks. The companies that raise Series B on the back of "we built this responsibly" rather than "we built this fast" are the ones that treated governance as a feature, not friction.

That is the argument this article opened with, and everything in between is the evidence for it. Safety vs security distinctions, alignment understanding, failure mode taxonomies, bias testing, regulatory awareness, incident response playbooks: none of these are compliance theatre. They are the substance of trustworthiness. They are what you point to when a customer's general counsel asks "can we trust your product?" and you need a better answer than "we're working on it."

The Complete Framework

This is the final article in the series. If you've followed it from the beginning, step back and consider what you now have.

┌────────────────────────────────────┐
│  Layer 4: Safety and Governance    │
│  Trust, regulation, incidents      │
│  You can defend what you built     │
│  and survive the first crisis.     │
├────────────────────────────────────┤
│  Layer 3: Product Architecture     │
│  Context engineering, triage       │
│  You can design and debug the      │
│  application layer.                │
├────────────────────────────────────┤
│  Layer 2: Evaluation and Quality   │
│  Golden datasets, regression,      │
│  eval flywheel                     │
│  You can measure whether your      │
│  product works.                    │
├────────────────────────────────────┤
│  Layer 1: How Models Work          │
│  Tokens, attention, inference      │
│  You can hold your own in a        │
│  technical conversation.           │
└────────────────────────────────────┘

Layer 1 gave you the mechanical understanding to sit in a room with engineers and ask the right questions. Layer 2 gave you the measurement discipline to make evidence-based product decisions instead of guessing. Layer 3 gave you the design knowledge and diagnostic method to build and debug the application layer, and to give stakeholders a credible timeline when something breaks. Layer 4 gave you the governance, safety, and regulatory awareness to defend what you built, survive an incident, and retain customers through problems.

The layers are not independent. A fabricated citation is a Layer 1 failure (next-token prediction confabulated a reference), a Layer 2 failure (the eval suite didn't test for introduced entities), a Layer 3 failure (no enforcement guardrail behind the prompt rule), and a Layer 4 failure (the customer's general counsel needs a credible incident response, not "it's a known issue with LLMs"). Real production problems are cross-layer. Your job as a PM is to move between them fast enough to diagnose, fix, and communicate before the problem becomes a crisis.

That movement is the fluency this series set out to build. It is not about knowing everything. It is about knowing what you know, knowing what you don't, and being honest about the boundary. It is being able to say: "I'd need to pull the eval logs to give you a precise number, but here's how I'd think about whether precision is even the right metric for this use case."

When I started working on AI products, I didn't know what I needed to learn. I just started researching, and the information I collected in that Obsidian vault along the way formed the basis for this series. I wanted to make it available to every PM who is feeling the same gap I felt. I hope it's been useful.

Triage: The Diagnostic Discipline That Separates Fixable From Fundamental

Steve James — Sun, 10 May 2026 00:00:00 GMT

Remember the returns policy failure from the previous article? Your AI support agent told a customer they had 90 days to return an item. Your actual policy is 30 days. The customer tried on day 45, got rejected by a human agent, and is now furious. You now know how the context pipeline should have been built: the five types of context, retrieval techniques, assembly decisions. What you don't yet have is a way to use that knowledge when things go wrong. A repeatable method for figuring out where a failure lives, who owns the fix, and how long it will take.

Your engineering lead asks in standup: "Should we file this as a model limitation with the provider, or is this something we can fix?" Your instinct might be to defer. After this article, you won't need to.

This is Layer 3, Part 2: diagnosis. Part 1 taught you how to design the application layer. This article teaches you what to do when it breaks. Together, they complete Layer 3 of the AI Fluency framework: Product Architecture.

The Reproduction Test

Here is the single most useful diagnostic question in AI product work:

Does this failure reproduce when you send the same prompt directly to the raw model, with no retrieval, no orchestration, no guardrails?

Open your provider's API playground: Anthropic's Workbench, OpenAI's Playground, or the equivalent for whichever model you use. Strip out your RAG pipeline. Strip out your system prompt. Give the model the same question the user asked, with just the relevant context pasted in by hand. See what happens.

If the failure reproduces, it's a model-layer problem. The model genuinely cannot do this task reliably, even with perfect context. Your options are limited: stronger prompt engineering, a more capable model, or escalation to the provider. These are slow fixes, measured in weeks to months.

If the failure disappears, it's an application-layer problem. Something in your pipeline is breaking the interaction between the model and the user. Your retrieval is pulling the wrong documents. Your context assembly is burying the answer. Your output processing is mangling the response. These are fast fixes, measured in hours to days.

    User reports: "The AI got this wrong"
                    │
                    ▼
    Reproduce on the raw model?
    (Same prompt, no pipeline,
     direct API call)
                    │
           ┌────────┴────────┐
           │                 │
          YES                NO
           │                 │
           ▼                 ▼
     MODEL LAYER        APP LAYER
           │                 │
     Slow fix:          Fast fix:
     prompt iteration,  debug retrieval,
     switch model,      context assembly,
     escalate to        guardrails,
     provider           output processing

This is not a theoretical framework. It's a practical test you can run in ten minutes. And the result changes everything about who works on the problem, how long it takes, and what you tell the customer.

What the Test Actually Looks Like

The reproduction test sounds simple, but doing it well requires some care. Here's what to actually do.

First, grab the exact user query that failed. Not a paraphrase of it, the actual input. Failures are often sensitive to phrasing, and testing with a tidied-up version defeats the purpose.

Second, open the provider's API playground: Anthropic's Workbench or OpenAI's Playground, whichever you use. The point is to bypass your entire application stack.

Third, paste in the user's query along with the context the model should have had: the relevant policy document, the product specs, whatever the retrieval pipeline was supposed to provide. Paste it manually so you'll know exactly what the model is seeing.

Fourth, run it. If the model gets it right with clean context and no pipeline, the pipeline is the problem. If it gets it wrong even with perfect context handed to it directly, the model is the problem.

This test has a secondary benefit: it builds your intuition for what the model can and cannot do. After running a dozen of these, you start to develop a sense for which failures are likely to be application-layer before you even test. That intuition is worth developing deliberately.

The Investigation: A Customer Support Agent Gone Wrong

Let's walk through the returns policy example. Your AI support agent is telling customers they have 90 days to return items. Your actual policy is 30 days. A customer tried to return something on day 45, was told by a human agent that it was too late, and is now furious because "your AI told me I could."

You run the reproduction test. You paste the customer's question ("What's the return window for this product?") into the API console along with your returns policy document. The model answers correctly: 30 days. So the pipeline is broken somewhere. Now you need to find where.

Step 1: Is the returns policy in your knowledge base at all?

This sounds obvious, but check. Teams add product information, troubleshooting guides, and FAQ content to their knowledge bases, and sometimes miss the mundane operational documents. If your returns policy was never ingested, the model has no source of truth to draw from. It will fall back on its training data, which might include outdated versions of your policy, a competitor's policy, or a generic "most retailers offer 90 days" inference. Fix: ingest the document. This takes hours.

Step 2: Is it being retrieved when customers ask about returns?

The document exists, but is it surfacing? You need to see what the retrieval pipeline actually returns for the customer's query. How you do this depends on your stack. If you're using a managed platform like LangSmith, Langfuse, or Arize, open the trace for the failed request and look at the retrieval step: it will show you exactly which chunks were pulled and in what order. If your team has built a custom pipeline, ask your engineer to point you to the retrieval logs or to run the query against the vector store directly so you can see the ranked results. Most vector databases (Pinecone, Weaviate, Qdrant, pgvector) have a console or API endpoint where you can run a similarity search and inspect what comes back. The key question is simple: when a customer asks about returns, does the returns policy appear in the top results? If it isn't ranking in the top chunks, your retrieval is the bottleneck. Maybe the document's title is "Operational Policies Q1 2026" and the query "can I return this?" doesn't match it semantically. Maybe your chunking split the 30-day clause away from the surrounding context. Fix: adjust chunking boundaries, add metadata tags, or switch to hybrid retrieval. This takes days.

Step 3: Is the model ignoring the retrieved content?

The policy is being retrieved and it's in the context, but the model is still saying 90 days. This is a grounding failure. The model is preferring what it learned in training over the document you gave it. Your grounding instructions aren't assertive enough. Fix: strengthen the relevant part of your system prompt. Be explicit: "When answering questions about company policies, use only the provided documents. Do not infer or supplement with general knowledge." This takes hours.

Step 4: Is the content in your knowledge base stale?

Everything is working correctly, and the model is faithfully using the retrieved document. But the document itself says 90 days because it's from last year, before you changed the policy. This is a content management problem, not an AI problem. Fix: update the document and establish a refresh cadence for policy content. This takes days to fix and weeks to systematise.

Step 5: Is the model reasoning incorrectly even with correct, current content?

You've confirmed the right document is being retrieved, it's current, and the grounding instructions are strong. But the model is still getting the answer wrong. Maybe the policy has a complex conditional structure ("30 days for electronics, 60 days for clothing, 14 days for perishables") and the model is applying the wrong condition. This is a genuine model-layer limitation. Your options: add structured reasoning instructions to force step-by-step evaluation of which condition applies, switch to a more capable model, or restructure the policy document to make conditions unambiguous. This is the rarest outcome.

To recap the sequence: missing document, retrieval miss, grounding failure, stale content, genuine model limitation. In practice, steps 1 through 4 account for roughly 90% of failures that initially look like "the model is wrong." Step 5 is the residual. Diagnosing it requires ruling out everything else first, which is why the sequence matters.

Why This Is a PM Skill, Not an Engineering Task

You might be thinking: isn't this debugging? Shouldn't the engineers handle it?

They should handle the fix. But the triage belongs to you, for three reasons.

First, you're the one fielding the customer complaint or the stakeholder question. If you can't diagnose the layer in the meeting where the problem is raised, you lose a day to "let me check with the team." If you can, you give a credible answer on the spot. "This looks like a retrieval issue. We can investigate today and likely have a fix this week." That sentence changes the temperature of the conversation.

Second, you own prioritisation. Knowing the layer tells you the cost of the fix, which determines where it sits in the backlog. Application-layer fixes are cheap and fast, which means they should rarely be deferred: if you can fix a retrieval miss in a day, there is no good reason to leave it in the backlog for a sprint. Model-layer problems are expensive and slow, which means deferral is sometimes the right call: you might work around the limitation, accept it as a known constraint, or invest in a model switch when the roadmap allows. That asymmetry is a product decision, and you're the one making it.

Third, you're the person who spots patterns. A single retrieval failure is a bug. Ten retrieval failures across different document types are an architectural problem that needs investment. You can only see that pattern if you're doing triage consistently, not delegating every quality report to engineering with a "please investigate."

The Stakeholder Conversation

The practical payoff of layer diagnosis is the conversation it enables.

Without it, you say: "We're looking into the issue and will get back to you." This is the default, and it's weak. It communicates nothing about severity, timeline, or whether you understand the problem.

With it, you say: "We've traced this to a retrieval issue. Our pipeline isn't surfacing the returns policy when customers ask about returns. The document is in the knowledge base, but the semantic match is poor because of how it's titled and chunked. We can fix the chunking and add metadata tags this week. I'll confirm when it's deployed and we'll monitor the same query class for a few days after."

That is a different conversation. It names the layer, identifies the specific failure, scopes the fix, commits to a timeline, and describes the verification step. It's the difference between forwarding a ticket and owning the problem. The diagnosis that makes it possible takes ten minutes.

When the Layer Isn't Obvious

Not every failure falls cleanly into one bucket. Some common ambiguities and how to handle them:

The model gets it right sometimes and wrong sometimes. This is almost always application-layer. Inconsistent retrieval (the right document surfaces on some queries but not others), temperature set too high (the randomness parameter from Article 2), or context ordering issues (the answer is in the context but buried in the middle where the model attends poorly). Run the reproduction test multiple times with the exact same input. If the raw model is consistent but your app isn't, the pipeline is introducing variance.

The model gets it mostly right but adds incorrect details. This is a grounding problem. The model is using the retrieved content as a starting point and then embellishing with training knowledge. Strengthen your grounding instructions and consider adding an output validation step that checks claims against the source documents.

The model refuses to answer when it should. Check your guardrails. Overly aggressive input filters or safety instructions can cause false refusals. This is application-layer: your constraints are too tight, not too loose.

The failure only appears at scale. Some issues only surface with high concurrency, long conversations, or large context windows. These are infrastructure-layer or application-layer issues (rate limiting, context truncation, session management), not model issues. Don't blame the model for your orchestration.

Test Your Understanding

Here are three scenarios. For each one, identify the layer and draft the first two sentences of the message you'd send to the stakeholder who raised the issue.

Scenario 1. Your e-commerce product recommendation agent keeps suggesting items that are out of stock. Customers are clicking through and finding "unavailable" pages. When you test the same query on the raw model with current inventory data pasted in, it correctly excludes out-of-stock items.

Scenario 2. Your internal knowledge assistant is asked "What's our parental leave policy?" and responds with a policy from 2019. Your HR team updated the policy in 2024. You test the raw model with the 2024 policy pasted in and it answers correctly.

Scenario 3. Your AI coding assistant is asked to refactor a complex recursive function into an iterative one. It produces code that compiles but has a subtle off-by-one error in the loop termination condition. You test the same prompt on the raw model with the same code context and it makes the same mistake.

The answers are less important than the discipline. In each case, the reproduction test tells you the layer. The layer tells you the fix. The fix tells you the timeline. And the timeline is what your stakeholder actually needs to hear.

What's Next

This completes Layer 3: Product Architecture. Part 1 taught you how to design the application layer. This article taught you how to diagnose it when it breaks. Between the two, you can build an application layer that performs and debug it when it doesn't. The core insight across both: most failures that look like model problems are actually application problems, and application problems are fixable, fast, and within your control.

Step back for a moment and consider what you now have. Layer 1 gave you enough mechanical understanding to hold your own in a technical conversation. Layer 2 gave you the measurement discipline to make evidence-based product decisions. Layer 3 gave you the design knowledge and diagnostic method to build and debug the application layer. Three layers in, you can sit in a room where someone reports "the AI got this wrong," diagnose the layer in ten minutes, name the fix, estimate the timeline, and communicate all of that credibly. Six months ago, most of us would have said "let me check with the team."

The final article in this series is Layer 4: Safety and Governance. You've built the product and you know how to debug it. Now, what happens when it fails in ways that matter beyond your backlog? When a regulator asks how your model was validated, when a customer discovers bias in your outputs, when an incident lands and your response determines whether the customer stays or leaves. That is what Layer 4 prepares you for.

Context Engineering: What Comes After Prompt Engineering

Steve James — Mon, 04 May 2026 00:00:00 GMT

Your AI support agent is telling customers they have 90 days to return items. Your actual policy is 30 days. A customer tried to return something on day 45, was told by a human agent that it was too late, and is now furious because "your AI told me I could." The model didn't hallucinate the 90-day figure out of nowhere. Something in the system fed it the wrong information, or failed to feed it the right information, or buried the right information where the model couldn't attend to it. Before you can diagnose what went wrong (which we'll cover in the next article), you need to understand how the agent's context was assembled in the first place.

This is Layer 3 of the AI Fluency framework: Product Architecture. Where Layer 1 explained how models work and Layer 2 discussed how to measure whether they're working, Layer 3 is about the application you build around the model: the retrieval pipeline, the guardrails, the orchestration.

Every product team has access to the same frontier models. The application layer, and specifically the decisions you make about what information reaches the model, is what determines whether your product is useful or not. That is what this article is about. I've split the layer across two articles. This one covers design: what information should reach the model, how it should be structured, and why those decisions belong to you. The next covers diagnosis: what to do when it breaks.

This article builds on the foundations from Article 2, so a quick reminder. A model processes text as tokens: roughly one token per four characters of English. The context window is everything the model can see at once: system prompt, conversation history, retrieved documents, and the current query. It is a hard limit, measured in tokens. Everything described in this article competes for space inside that window. And models do not attend equally to all of it: information at the beginning and end of the context is recalled more reliably than information buried in the middle. If those concepts are unfamiliar, read Article 2 first.

From Prompts to Context Engineering

Prompt engineering was about phrasing: "Use a step-by-step approach." "Think like a financial expert." "Format your output as JSON." All of it happened in the final prompt the model saw.

Context engineering asks a different question: what information should enter the model's context window at all? Not how to phrase the question, but what you think it should know in order to be able to answer your question or carry out your task. Retrieved documents. Conversation history. Tool descriptions. Strategic context about your business. Negative examples showing what not to do. All of this happens before the final prompt is written, and all of it competes for the same limited token budget.

A perfectly worded prompt with the wrong context produces poor results. A mediocre prompt with excellent context often outperforms it. The quality ceiling of your AI product is set by what reaches the model, not by how you phrase the question.

Why This Is a Product Decision

GitHub Copilot's quality depends on which files it pulls into context. Open the right files in your editor and it writes code that fits your codebase. Close them and it produces generic snippets that could belong to any project. The model is the same either way. Someone at GitHub decided that "currently open files" is a useful proxy for "what is relevant to this developer right now." That is a product question, not an infrastructure question.

Customer support bots illustrate it at the simplest level. A bot that knows your order history, your subscription tier, and your last three support tickets feels like it understands you. A bot running the same model without that context asks you to repeat your account number for the third time. Same model. Different context. Completely different experience.

The pattern holds everywhere you look: Perplexity's search quality lives in its retrieval, not its model. ChatGPT's memory feature is a set of product decisions about what facts to persist and when to surface them. In every case, the model is a commodity. Everyone has access to the same frontier models. What makes one product better than another is what reaches the model before it generates.

Context engineering is a product decision because it requires knowing what "relevant" means for your users in your domain. Engineering builds the pipeline. The PM defines what should flow through it.

It is also a prioritisation problem. The context window is a fixed token budget. Should you spend 2,000 tokens on detailed instructions or 2,000 tokens on more retrieved documents? Should you include the last 10 messages of conversation history or compress them to 3? These are trade-offs with direct quality, cost, and latency implications: the same kind you make when you prioritise features for a sprint, except the currency is tokens instead of engineering days.

Model providers publicise their context windows like a spec war: 200k tokens, 1 million tokens, bigger every quarter. It is tempting to assume that context engineering will stop mattering once the window is large enough to fit everything. It won't. Bigger windows cost more per request, respond slower, and attend less reliably to information buried in the middle. Researchers call this degradation pattern context rot: model performance declines as the window fills, and the pattern of what gets lost shifts depending on how full the window is. Below 50% capacity, the model loses information in the middle. Above 50%, it starts losing the earliest tokens entirely, favouring only the most recent input. A 200k-token window filled indiscriminately performs worse than a 30k-token window filled with the right information in the right order. The constraint was never the size of the window. It is the relevance of what you put in it.

The Five Types of Context

Everything that fills a context window falls into one of five categories. Each competes for the same limited space. Each is a product decision.

  Context Window (limited tokens)
  ════════════════════════════════

  1. Instructions    ~500-1500 tokens
     Role, objective, constraints, format

  2. Examples         ~200-500 tokens per example
     Positive and negative demonstrations

  3. Knowledge        Variable
     Domain facts, task context, structured data

  4. Memory           Varies
     Short-term (chat history), long-term (stored)

  5. Tools            ~50-300 tokens per tool
     Function definitions, parameter specs

  All five compete for the same token budget.
  Every token spent on one is unavailable to others.

1. Instructions

Instructions tell the model who it is and what it should do. They include:

Role: "You are a market research analyst." Encourages the model to adopt a persona and reasoning pattern.
Objective: Why this task matters. What success looks like. A perfectly-written objective improves the model's autonomy by giving it strategic context, not just mechanical steps.
Requirements: The steps, conventions, constraints. "Respond in JSON. Always cite sources. Do not speculate."

Instructions are non-negotiable. Every system prompt needs them. But instructions are also expensive: a detailed system prompt easily consumes 500-1500 tokens. The tradeoff is simple and unavoidable: the more you specify, the less room you have for context.

2. Examples

Examples show the model what you want by demonstrating it. Positive examples ("here's a good answer"). Negative examples ("here's what to avoid").

Examples are more powerful than instructions alone. Three to five well-chosen examples often outperform ten pages of written guidance. But they cost tokens: each example might be 200-500 tokens.

The product decision is not whether to include examples, but which examples matter most for your use case. An example that shows the model how to cite sources correctly prevents hallucination more reliably than an instruction that says "always cite sources."

3. Knowledge

Knowledge is external context: domain information, business strategy, market facts, system architecture, workflow procedures, structured data from your database.

A customer support agent needs knowledge about your return policy. A recruitment assistant needs knowledge of your company values. A contract reviewer needs access to your standard terms.

Knowledge fills the space between instructions (which are generic) and the specific task (which is user-generated). It's the most variable category, because different use cases need different knowledge. A financial advisor needs market data. A healthcare chatbot needs clinical guidelines. A product manager assistant needs your roadmap and strategy docs.

4. Memory

Memory divides into short-term (conversation history within a session) and long-term (facts and preferences stored in a database and retrieved when relevant).

Short-term memory is often automatic: your orchestration layer appends the last five messages. Long-term memory is a product decision: what user preferences should you remember? Should you save conversation summaries? Do you store extracted facts that help the model make better future decisions?

Memory enables continuity. Without it, each interaction is stateless. With it, the model can reference earlier conversations, build on previous context, and adapt to individual users.

5. Tools

Tools are function definitions that the model can invoke. A tool description is a micro-prompt: name, description of what it does, parameter definitions, examples of use.

A tool description is product specification work. The difference between an agent that reliably calls the right tool and one that thrashes between wrong tools is often the clarity of the tool description.

Too many tools create noise. Forty-nine tools in an Atlassian MCP server consume 1,387 tokens just describing them. If your agent only needs the "create ticket" function, the other forty-eight tools are pure waste. Limiting tools to only what an agent needs is a core context engineering decision.

Information Retrieval Techniques

Before you assemble context, you need to retrieve it. This is where the first layer of intelligence happens.

Query rewriting: An LLM rephrases the user's question before retrieval. "What's your return policy for items bought in-store?" becomes "Return policy physical retail locations." Better queries retrieve better documents.
Hybrid RAG: Combines dense vector search (semantic similarity) with sparse keyword search (exact matches). A question about "vehicle collisions" finds documents about "car accidents" through vectors, and finds documents about "traffic incidents policy" through keyword matching. Most production systems use hybrid.
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer to the user's question. Turn that answer into an embedding. Use it to search for real documents similar to the hypothetical answer. Often retrieves more relevant chunks than the raw query.
Adaptive RAG: Make retrieval self-correcting. If the first retrieval round returns low-confidence results, try again with a rewritten query. If the model later indicates it needs more information, retrieve again.
Agentic RAG: Instead of one fixed retrieval step, let the agent decide when to retrieve, what to retrieve, and how to use it. This adds flexibility but also complexity and cost.

For product managers, these are not just technical options. They are quality levers. Invest in smarter retrieval and you can reduce the volume of context you need to stuff into the prompt. Invest in query rewriting and you may not need a more expensive embedding model.

Context Assembly Techniques

Retrieval finds candidates. Assembly turns candidates into the context that actually reaches the model. This is where the second layer of intelligence happens.

Filtering: Only include documents matching criteria. Last 30 days. Relevant department. High confidence. Filters reduce noise.
Deduplication: Remove near-duplicates. Embeddings tend to return similar items multiple times. Deduping saves tokens.
Re-ranking: Sort retrieved results by true relevance, not just embedding similarity. A cross-encoder model scores each chunk. You retrieve top 20, re-rank, then use top 5. Re-ranking is expensive but often worth it; it improves quality more than retrieving more documents.
Scoring: Assign importance or priority to each chunk. Use a scoring function to decide whether to include, compress, or skip a document.
Compressing: Shorten documents while preserving key meaning. A 1000-token document becomes a 200-token summary. Saves context window space.
Combining: Merge results from multiple sources. User profile from CRM. Support history from Zendesk. Product docs from Confluence. All assembled into one coherent context.
Splitting: Distribute context across specialised models. One model for legal review. Another for market analysis. Another for technical evaluation. Each gets only what it needs.
Chunking: Break documents into pieces before storage (this happens at ingestion, not assembly, but it affects everything downstream). Too-small chunks miss context. Too-large chunks waste tokens and miss precision.
Chunk stitching: Reconstruct logical flow when you retrieve fragmented pieces. The model sees coherent narrative, not disjointed blurbs.
Templating: Wrap selected context in structured format. Add metadata headers: "Date: 2026-01-15, Author: Jane Smith, Confidence: High." Makes the input easier to parse.

The product insight is this: your retrieval might find the right documents, but if assembly is poor, the model sees noise. Each assembly technique is a quality lever. Filtering reduces false positives. Re-ranking improves precision. Compression saves tokens. Your job is to sequence them intentionally.

The Surface Area You Now Own

Step back and look at what this article has covered. Instructions, examples, knowledge, memory, tools: five types of context competing for the same token budget. Retrieval techniques that determine which information even reaches the model. Assembly techniques that determine how it is structured when it arrives. Tool descriptions that determine which actions the model can take.

Every one of these decisions shapes what your user experiences. The model is the same for everyone. Frontier models are a commodity; your competitors can access the same ones you can. The context is what makes your product yours. It is where your domain knowledge, your user understanding, and your product judgment translate into quality that a competitor cannot replicate by switching to the same API.

This is why context engineering is a PM skill. Not because you need to write the retrieval code, but because you need to define what "relevant" means, decide which trade-offs to make when tokens are scarce, and know enough about the mechanics to have a useful opinion when someone suggests "we should just load the whole document in."

Test Your Understanding

You have a 128k-token context window. Your system prompt is 1,000 tokens. You can retrieve 10 documents. Should you retrieve 10x1,000-token documents, or 100x100-token chunks, and why?
Your retrieval system returns semantically similar documents but the model still gives wrong answers. Name two assembly techniques that could help, and explain how.
You're designing a tool set for an agent. You have access to 15 functions but the agent only needs 3. What is the context cost of exposing all 15, and what should you do?
A colleague argues that with a 200k-token window, you should forget about RAG and just stuff entire documents into the prompt. What are three counter-arguments?

The shift from prompt engineering to context engineering is a shift in where you invest your attention. Prompt engineering optimises the last thing the model sees. Context engineering optimises everything that reaches it first.

As models improve, they become better at extracting value from context. The ceiling on quality is no longer "can the model understand this prompt," but "does the model have the right information to begin with."

That's your leverage now.

What's Next

If you've followed the series to this point, you now have three layers of the framework in place.

Layer 1 gave you the mechanics: tokens, context windows, attention, the training/inference distinction. Layer 2 gave you measurement: golden datasets, eval hierarchies, regression suites, the eval flywheel. This article, the first half of Layer 3, gave you the application layer: the five types of context, retrieval and assembly as quality levers, and the recognition that these are product decisions, not engineering details.

What you don't yet have is what to do when it breaks. The returns policy failure from the opening of this article is still undiagnosed. You know the context was wrong, but you don't yet have a repeatable method for tracing the failure to a specific component and routing the fix to the right team.

That is the next article: Layer 3, Part 2. A diagnostic discipline for figuring out whether a failure lives in the model or your pipeline, and how to turn that diagnosis into a credible stakeholder conversation.

The Eval Gap: Why Most AI Products Ship Blind

Steve James — Sun, 26 Apr 2026 00:00:00 GMT

Two days after shipping the latest model update to production, the complaints start trickling in. A customer's workflow that was working fine last week is now returning hallucinated case citations. A legal summary that used to be precise is now adding unsupported caveats. A contract extraction task is occasionally returning fields in the wrong order.

You had no warning. No eval suite caught it. The model provider had silently updated their checkpoint and your product quietly degraded.

This is the norm, not the exception.

You shipped blind.

This is Layer 2 of the AI Fluency framework: evaluation. And it is the biggest gap most product managers have. Not because evals are mysterious or require a PhD in statistics. Because most teams don't have them at all, or have them in fragmented, ad hoc forms that don't catch real failures.

If Layer 1 (How Models Work) is about knowing what you're building with, Layer 2 is about knowing whether it works.

You cannot make evidence-based product decisions without it. You cannot measure the impact of a model upgrade, a prompt change, or a retrieval improvement. You cannot set realistic expectations for stakeholders. And you will learn about regressions from angry users, not from a test that runs before deployment.

Why Evals Are a PM Skill, Not Just an Engineering Problem

The scarcest skill in AI product management right now is not prompt engineering. It is evaluation. And most PMs do not have it.

You are accountable for quality. If you don't have evals, you are flying blind. That should make you uncomfortable.

When engineering says "the new retrieval pipeline looks better," you need to ask: better according to what measurement? If they can't answer quickly, you don't have evals. When an executive asks whether upgrading the model is worth the cost, you need data. If you don't have historical eval runs on your golden dataset, you're guessing.

Evals are how you:

Prioritise: should I invest in improving the prompt, the retrieval, or the model?
Set expectations: "we're 85% accurate on the golden dataset, with these specific failure modes"
Make evidence-based decisions: this model change improved faithfulness by 3 percentage points, and here's what that means for users
Catch regressions before shipping: run evals before deployment, not after complaints arrive
Compound quality over time: every bug that reaches production becomes a regression test, preventing it from happening again

Without evals, every claim about quality is opinion. With them, you have the data to defend or challenge it.

The Evaluation Hierarchy: Four Layers from Cheap to Reliable

Not all evals are equal. Think of them as a hierarchy from fast and cheap to slow and reliable:

                     ▲  high reliability, high cost
                    ╱ ╲
                   ╱   ╲   Expert Review
                  ╱     ╲  (days · £££ · gold standard)
                 ╱───────╲
                ╱         ╲  Human Spot Check
               ╱           ╲ (hours · ££ · catches edge cases)
              ╱─────────────╲
             ╱               ╲ LLM-as-Judge
            ╱                 ╲(minutes · £ · scalable, needs calibration)
           ╱───────────────────╲
          ╱                     ╲ Automated Unit Tests
         ╱                       ╲(seconds · ~free · narrow, brittle)
        ▼  low reliability, low cost

You don't pick one. A healthy eval system uses all four, with expensive methods calibrating cheap ones.

Level 4: Automated Unit Tests. Seconds to run, nearly free. Test narrow, deterministic cases. Catch obvious format violations: does the output have the required JSON fields? Is the extracted date in the right format? These are your canary tests: they detect catastrophic failure quickly.
Level 3: LLM-as-Judge. Minutes to run, cheap at scale. A second language model scores the outputs from your production model. The judge reads the input, the model's response, and optionally a reference answer, then scores on a specific dimension: faithfulness, relevance, tone, whatever you specify. Scales to thousands of examples. Requires calibration.
Level 2: Human Spot Checks. Hours to days to run, expensive. Humans read through a sample of outputs and score them. This is where you catch what automation misses. Catches edge cases, subjective judgments, domain-specific context that an LLM judge might miss.
Level 1: Expert Review. Days to weeks, most expensive, gold standard. For high-stakes failures or major releases, you bring in domain experts (lawyers, doctors, etc.) to comprehensively assess the output.

Expert review is also the rarest. Most companies simply cannot afford to keep domain specialists on retainer for ongoing eval work. A medical AI product needs clinicians to review outputs; a legal AI product needs practising lawyers. These people are expensive, scarce, and have day jobs that aren't reviewing your model's outputs. The companies that do manage to build sustained expert review into their eval process tend to develop a substantial quality moat. Their golden datasets are richer, their failure modes are better understood, and their products improve faster because they have ground truth that competitors are guessing at. If you can find a way to get even occasional expert review, even a few hours a quarter on your hardest cases, it is worth disproportionate investment.

A healthy production eval system runs Level 4 continuously (every commit, every model update). Level 3 runs at every deployment candidate. Level 2 runs periodically, maybe weekly or monthly. Level 1 runs for major releases or after quality incidents.

The Golden Dataset: Your Foundation

I learned this one the hard way. I built a tool that monitors Reddit threads across subreddits relevant to my product space, tracking commentary about my product and its competitors and scoring sentiment across four dimensions: usability, pricing, accuracy, and reliability. The scores feed an interactive dashboard and trigger alerts when sentiment shifts sharply.

The problem was consistency. The same type of comment would score differently across runs. A frustrated post about pricing might register as moderate negativity one day and severe the next. I spent time tuning the prompt, adjusting the scoring rubric, tweaking the model parameters. Nothing stuck.

The fix was embarrassingly simple: I didn't have a golden dataset. The model had no baseline for what a "strongly negative usability comment" actually looked like versus a "mildly negative" one. I spent a few hours manually curating a set of real Reddit comments with hand-scored classifications across all four dimensions. Once the model had concrete examples to calibrate against, consistency improved dramatically.

That is the golden dataset in miniature. Without it, you are asking the model to invent a standard. With it, you are giving it one.

The golden dataset is the spine of your eval system. It is a curated set of (input, expected output) pairs representing the core tasks your product performs.

A good golden dataset has these properties:

Representative: covers the distribution of real user queries, not just the easy cases that your model already handles well
Diverse: includes edge cases, adversarial inputs, underrepresented scenarios that are likely to trip up the model
High quality: expected outputs were produced or verified by domain experts, not pulled from model outputs
Versioned: tracked in version control; every change is documented
Sized right: large enough to be statistically meaningful (100-500 examples for most tasks) but small enough that you can manually review it

How do you build one? Start with production: collect real user queries (with privacy protections). Don't use the model's previous outputs as ground truth; that is circular reasoning. Instead, have domain experts produce or verify the correct outputs. Then deliberately add hard cases: queries designed to provoke hallucinations, ambiguous inputs, edge cases. Make sure you cover different task types, user segments, domains.

The anti-patterns are worth naming because teams fall into them constantly:

Sampling only from easy cases: if your golden dataset over-represents queries the model already handles well, your evals will be optimistic and your failure modes invisible
Using model outputs as ground truth: this is circular reasoning; the model confirms its own correctness
Freezing the dataset: as user behaviour evolves, a static eval set becomes less representative; review and refresh it regularly

Once you have a golden dataset, it becomes your regression baseline. Every model update, every prompt change, every retrieval improvement gets evaluated against it before shipping.

LLM-as-Judge: How It Works and Why It Drifts

LLM-as-judge is the most important scalability mechanism in AI evaluation. Here is the pattern:

Your production model generates a response to an input.
A judge prompt sends the input, the response, and optionally the expected output or retrieved context to a judge model (typically a frontier model, something more capable than the model under evaluation).
The judge returns a score, a classification, or a rationale.

That's it. And it scales to thousands of examples at a fraction of the cost of human review.

But judge quality depends entirely on the judge prompt. Don't ask for general "quality"; ask for specific dimensions. A good judge prompt:

Evaluates one dimension at a time: faithfulness, completeness, schema compliance, or whatever matters most. Combining dimensions in a single prompt muddies the signal.
Provides a rubric: the judge needs to know what constitutes a 1 vs. a 4, with concrete examples of each score level.
Asks for chain-of-thought reasoning before scoring: this forces the judge to justify its rating, which reduces arbitrary scoring.
Includes a reference answer when available: giving the judge the expected output makes comparison explicit rather than leaving it to infer what "correct" means.

Judge models also have known biases:

Verbosity bias: rating longer outputs higher, independent of quality
Position bias: favouring the first option presented in a comparison
Self-enhancement bias: rating the model's own outputs higher than they deserve
Sycophancy: agreeing with the framing in the eval prompt without scrutiny

So you must calibrate. Compare judge scores to human scores on a sample of outputs. Measure agreement using Cohen's Kappa or Spearman correlation. Identify systematic biases. Document the margin of error. For teams ready to go further, Parloa Labs' research on Bayesian A/B testing for AI agents shows how hierarchical statistical models can account for variation across conversation types and reduce the risk of overinterpreting small samples.

The trap many teams fall into is treating LLM-as-judge as the entire eval system rather than one layer of it. It is tempting because it scales so easily: you can score thousands of outputs overnight for a few dollars' worth of API calls. But scale is not the same as reliability. An LLM judge is still a language model, with all the failure modes that implies. It can confidently score an output as faithful when a domain expert would immediately spot the hallucination. It can miss subtle reasoning errors that a human reviewer catches in seconds. It can systematically overlook an entire class of failure because nothing in the judge prompt draws attention to it.

LLM-as-judge works best as a screening layer: fast, cheap, and good enough to flag candidates for closer inspection. It does not replace human review; it reduces how much human review you need. Teams that skip the calibration step, or that never run human spot checks against their judge scores, end up with an eval system that tells them everything is fine while quality quietly degrades. A calibrated LLM judge is powerful. An uncalibrated one is a false sense of security.

The Eval Flywheel: How Quality Compounds

This is the pattern that separates mature AI teams from ones that ship blind:

        Ship feature ──► Collect feedback ──► Add hard cases
             ▲              & production        to golden
             │              samples             dataset
             │                                    │
             │                                    ▼
        Ship fix with  ◄── Identify & fix  ◄── Run evals
        new regression      failures            on golden
        test                                    dataset

Every time you complete this cycle, your product gets better and your eval suite gets more comprehensive. The golden dataset grows. The regression suite grows. Your understanding of failure modes deepens. Quality compounds over time.

Teams that skip this cycle ship features once and hope for the best. Teams that follow it ship features, learn from real-world behaviour, add the hard cases that tripped them up to the eval suite, and prevent the same failure from happening again. After six months, the difference in product quality is dramatic.

But this only works if evals are fast enough to run on a cadence. If running evals takes three days and requires booking someone's calendar, it won't happen. Build for speed. Your regression suite should run in under 30 minutes. It should be runnable as a pre-deployment gate.

Regression Suites: Treating Every Bug as a Test

This is not a new idea. High-performing software teams have used automated regression suites for years: every bug that reaches production becomes a test case, and the suite runs before every deployment to make sure you never ship the same failure twice. The principle is identical for AI products. The difference is what you are testing: not whether code executes correctly, but whether model outputs meet a quality bar.

Every major bug that reaches production should become a regression test.

A regression test is a specific example from your golden dataset (or a new example you create) designed to catch a particular failure mode. When a user reports a quality issue, your team does root cause analysis. Was it a prompt problem? A retrieval problem? A model problem? Once you know the cause, you create a test case that reproduces the failure and add it to the regression suite.

Then you fix the issue, run the test to confirm the fix works, and leave the test in the suite permanently. Next time you update the model, change the prompt, or modify the retrieval pipeline, that test runs automatically. If the regression suite fails, the deployment is blocked. This is your CI/CD gate for AI quality.

This is critical because AI regression happens silently. Model providers update their models without notice. A checkpoint update under the same name can degrade your product without any code change on your side. A prompt modification intended to improve one dimension might accidentally hurt another. A retrieval pipeline change might start returning worse documents.

If you run evals only at release time, you will miss degradations that happen between releases. Run the regression suite on a schedule: weekly, fortnightly, monthly, depending on your risk tolerance. If a regression suite fails, treat it as a production incident. Figure out what changed, whether to rollback or fix, and communicate clearly to stakeholders: which capability degraded, how severe, who is affected, what you are doing about it.

How to Start: The Minimum Viable Eval Setup

You do not need perfection to get started. You need enough signal to catch obvious regressions.

Pick your core task. For a legal AI product, it might be contract clause extraction. For a research assistant, it might be answer quality on a specific domain. For a customer service bot, it might be whether the response is safe and on-topic.

Create a golden dataset of 50-100 examples. Real user queries where possible. Have one or two domain experts produce or verify the correct outputs. Include at least 10-15 deliberately hard cases: adversarial inputs, edge cases, things the model might hallucinate about.

Write a judge prompt for one key evaluation dimension: faithfulness, correctness, format compliance, whatever is most critical for your use case. Run the judge on your golden dataset. Compare the judge's scores to a sample of human scores. Document the calibration.

Create a regression suite of 20-30 cases: your most important examples plus any cases where the model previously failed.

Add this to your CI/CD pipeline. Run it before every deployment. If any dimension drops below your threshold, block the deployment and investigate.

You also don't need to build all of this from scratch. Several platforms exist specifically for LLM evaluation and can accelerate setup significantly. Braintrust offers managed eval pipelines with deployment gating and a generous free tier. Langfuse is open source and self-hostable, with built-in support for LLM-as-judge scoring, dataset versioning, and human annotation queues. Arize Phoenix is strong on agent-level tracing and observability. For teams that want a CLI-first approach, Promptfoo is lightweight and good for red-teaming. The right choice depends on your stack and whether you need managed infrastructure or prefer to self-host, but any of them will get you running evals faster than building a bespoke pipeline from zero. This space moves quickly; these recommendations are accurate as of April 2026.

That is enough to start. From there, the flywheel kicks in. Every failure that reaches production becomes a new regression test. Your golden dataset grows. Your evals get richer. Within three months, you have a system that catches 80% of regressions before they reach users.

Test Your Understanding

Your team just upgraded to a newer version of your production model. What should happen before that change goes to production, and why?
You have an LLM judge that scores your outputs on faithfulness. Human review of a sample shows the judge is consistently 15 percentage points more generous than humans. What does this tell you, and what do you do?
Your golden dataset includes 500 examples. You run evals and see a 94% pass rate. Is that good? What additional information do you need before concluding the model is production-ready?
A product manager on another team argues they don't need to invest in evals because "the model seems to be working fine." How do you make the case for eval infrastructure? What specific risks does skipping evals introduce?

Connection to the Framework

This is Layer 2 of the four-layer AI Fluency framework introduced in the first article in this series. Layer 1 gave you the mental model for how inference, context windows, and training shape what a model can and cannot do. Layer 2 is how you turn that understanding into measurement. If Layer 1 tells you that a model's behaviour is probabilistic and sensitive to context, Layer 2 is the discipline that detects when that behaviour drifts in your product.

Together, they form the foundation for evidence-based decisions about AI products: you understand the machinery, and you can measure whether it is working.

The next layer covers the architecture decisions that actually deliver quality. I'm planning to split this layer across two articles: first, how you engineer context to get reliable results in your specific domain; then, how you diagnose whether a problem lives in the model or in your application.

What Product Managers Actually Need to Know About How Models Work

Steve James — Sat, 18 Apr 2026 00:00:00 GMT

You're in a technical review meeting. An engineer mentions that "the model's context window is the real bottleneck here," and you need to know whether that's a fundamental constraint or an excuse. Two minutes later, someone else says "we should fine-tune instead of prompting," and you need to ask the right question: is that the cheapest solution, or are they reaching for a tool they know instead of thinking through the problem?

This is Layer 1 of the AI Fluency framework I introduced in the first article of this series: enough to never say something wrong in a technical conversation, and enough to ask the questions that separate clear thinking from engineering mythology.

You don't need to actually build any of this. You just need to understand it well enough to make product decisions and hold ground in conversations where technical trade-offs are real and money is at stake.

Tokens: Why They Matter More Than You'd Think

A token is the smallest unit of text a language model processes. It's not a word. The distinction matters.

Roughly: one token is about four characters of English text, or 0.75 words. A 1,000-word article is about 1,300 tokens. The reason we talk about tokens instead of words is that models handle subword units using algorithms like Byte Pair Encoding, which lets them handle unknown words, code, punctuation, and multiple languages with a single vocabulary. "Unhappiness" might become two tokens: "un" and "happiness." "iPhone" might be one.

Why you care: Token counting affects your pricing model, your cost per query, and whether a feature is economically viable at scale.

If you're building a document Q&A feature and charging per query, you need to estimate how many tokens each query will consume. If your average document is 50,000 words, that's roughly 67,000 tokens. If you include that entire document in every query, your inference cost balloons. This is why retrieval systems exist: to send only the relevant chunks, not the whole document. Understanding tokens forces you to think about chunking strategy early, not as an afterthought when costs explode.

Test your thinking: if a feature needs to include 100 customer support tickets as context (to handle multi-turn support conversations), and each ticket averages 200 tokens, what's your per-request cost at current API prices? If that number makes your pricing model break, you now know retrieval is not optional.

Context Windows: The Hard Limit

The context window is everything the model can see at once. It includes the system prompt, conversation history, any retrieved documents, and the current user query. The model has no memory outside of it. This is a hard architectural constraint, not a software limitation you can engineer around.

Claude Sonnet 4.6 offers a 1 million-token context window (with a 200K standard tier). Gemini 3.1 Pro also offers 1 million tokens. GPT-5.4 exposes 272,000 tokens by default and up to 1 million via the API. These numbers change rapidly, and they matter for your product architecture in three ways.

First, cost. Larger contexts cost more to process, both in tokens consumed and in time to generate the first token. A 1 million-token context doesn't help your user experience if their first response takes ten seconds.

Second, quality. Models don't attend equally to all parts of a long context. Research shows a "lost in the middle" phenomenon: information buried in the middle of a very long context is recalled less reliably than information at the beginning or end. If you're stuffing 100 retrieved documents into a context window, the model may ignore the most relevant ones if they appear in the middle. This makes retrieval quality more important than raw window size. A smaller, well-chosen set of documents often outperforms a larger, unsorted dump.

Third, architectural decisions. Large context windows invite a kind of laziness in product design: just fit everything in and let the model sort it out. That rarely works. Instead, context limits drive you towards smarter retrieval, better chunking strategies, and summarisation pipelines that compress long documents before inclusion. These constraints often improve your product.

The mistake to avoid: "Bigger context window = better product." It usually doesn't. Retrieval quality matters more than raw window size. And if someone suggests using a 1 million-token context to avoid building a retrieval system, push back.

Training vs Inference: Where the Fix Lives

A language model has two distinct phases that are easy to conflate but very different in cost, time, and product implication.

Training is the one-time, massively expensive process where a model learns from data. Thousands of GPUs run for weeks or months. Parameters (the model's internal weights) get updated billions of times. The output is a large file encoding everything the model learned. Training a current frontier model (the Claude 4 series, GPT-5 series, Gemini 3 series) runs into the hundreds of millions of dollars, and the trend is upward. Training creates a hard knowledge cutoff: the model only knows what was in its training data, up to a specific date.

Inference is using the model. That is the whole idea. Training produces a finished object: a large file of learned patterns, sitting on a disk somewhere. Inference is running that file against a new input to produce an output. The model itself does not change. It is a finished thing being used.

An analogy that might help it stick. Training is the years a chef spent developing a cookbook: experimenting, failing, adjusting, and eventually writing everything down. Inference is you cooking dinner from that cookbook tonight. You are not rewriting the recipes. You are following the instructions already there, with your ingredients, to produce a meal. Every ChatGPT message, every API call, every autocomplete suggestion is someone cooking dinner. The cookbook itself never changes.

For a language model specifically, that means your prompt flows in, and the model generates output one token at a time, each new token conditioned on everything that came before. This is called autoregressive generation, and three things follow from it that you cannot ignore.

First, latency is dominated by output length, not input length. A 50-token prompt that produces a 1,000-token answer takes roughly 20x as long to generate as the same prompt with a 50-token answer. This is also why streaming interfaces feel responsive: tokens appear as they are produced, so the user sees progress instead of staring at a spinner. Without streaming, a long response feels broken even when it is not.

Second, there are two different latency metrics and they have different causes. Time to first token (TTFT) is how long the user waits before anything appears. It is dominated by prompt length, because the whole input has to be processed before generation starts. Tokens per second (TPS) is how fast the response streams once it starts. It is dominated by output length and the model's decoding speed. A product can have fast TTFT and slow TPS, or the reverse, and the user experiences these as two different problems. Know which one is biting you before you try to fix it.

Third, inference is stateless. The model has no memory between calls. If you are building a chat product, every turn sends the entire conversation history back through the network, and you pay for all of it, every time. This is why a long chat becomes expensive linearly, and then, if you are not careful, quadratically once the context gets large enough that the model slows down. Summarisation pipelines, context truncation, and prompt caching exist to blunt this curve.

On pricing: input tokens and output tokens are priced separately, and output is typically 3 to 5 times more expensive. This quietly shapes architecture. RAG systems pump up your input count, which is cheap. Long generations pump up your output count, which is where the bill lives. A chatbot that produces concise answers is not just better UX, it is better unit economics.

On caching: in 2026, every major provider offers some form of prompt caching, where a prefix you reuse across requests (your system prompt, a retrieved document, a large code file) can be charged at a fraction of the normal input rate, sometimes as low as 10 percent. For conversational products this changes the economics materially. If you have a 5,000-token system prompt, you almost certainly want it cached. If you are sending the same retrieved documents across a multi-turn conversation, you want those cached too. This is a standard PM question to ask your engineers: are we actually hitting the cache?

Most of the product decisions you make as a PM are about inference, not training. Training is where capability comes from. Inference is where your unit economics, your latency, and most of your visible user experience live.

The consequence: if a user says "the model gave wrong information," the fix could be any of these, each with different timelines and costs:

A prompt change (inference-time, immediate, cheap)
A retrieval fix via RAG (inference-time with moderate complexity)
A fine-tuning run (training-time, days to weeks, expensive)
Full retraining (training-time, months, very expensive)

Understanding which bucket the problem falls into determines the timeline and cost you communicate to stakeholders. A prompt change is an hour. A fine-tuning run is a project. These are not equivalent.

Why hallucinations happen about recent events is also now obvious: the model's training data has a cutoff. Without RAG to inject fresh information at inference time, the model has literally no way to know about last week's news. This is why every serious AI product that touches current information needs retrieval.

Also: when an API provider updates their model (GPT-5.3 to GPT-5.4, or Claude Sonnet 4.5 to 4.6), this is a new training run. Behaviour can shift in unexpected ways. Previously-passing evaluations may fail. This is why you need evaluation suites to detect when a model update breaks your product before it reaches users.

The Three Levers: Prompting, RAG, and Fine-tuning

When you want a model to behave differently or know more, you have three primary levers. Picking the wrong one is one of the most expensive mistakes a PM can make.

Prompting means shaping behaviour through the input: system prompt, user message, few-shot examples. It's immediate, cheap, reversible, and requires no new data. But it consumes context window space, can be fragile (the model can be distracted from instructions by conflicting content), and is bad for injecting large bodies of knowledge.

Use prompting for style, format, tone, and output structure. It's your first iteration tool. Change a prompt on a Tuesday and see results on Wednesday. Iterate until you're hitting quality targets.

Retrieval-Augmented Generation (RAG) injects relevant information from an external knowledge base at query time. You embed documents into vectors, store them in a vector database, and at query time you retrieve the most similar chunks and insert them into the prompt. The model then generates a response grounded in that context.

Use RAG when your knowledge base changes frequently (news, documentation, internal data), when factual accuracy matters, when the knowledge is too large to fit in a context window, or when you're handling proprietary data that shouldn't appear in training. RAG is updatable without retraining: just update the index.

The catch: retrieval quality is a bottleneck. The model can only be as good as what it retrieves. Poor chunking strategies break context. The pipeline has multiple failure points.

Fine-tuning continues training a pre-trained model on task-specific data. It's expensive and slow to iterate, requires high-quality labelled data (often hundreds to thousands of examples), and carries the risk of "catastrophic forgetting" where the model's general capabilities degrade. But fine-tuning deeply encodes behaviour into the model's weights. It requires no large system prompt at inference time. For narrow, well-defined tasks with stable patterns, it can be very effective.

Use fine-tuning when you have a stable task with clear input/output patterns, sufficient high-quality training data, and when prompting alone isn't hitting your quality targets.

Here's the decision framework:

What are you trying to change?
        │
        ├── Knowledge (facts, data, documents)
        │       └── Use RAG
        │
        ├── Style, format, or tone
        │       ├── Try prompting first
        │       └── Fine-tune only if prompting fails
        │             AND you have labelled data
        │
        └── Deep behavioural patterns
                ├── Do you have 500+ quality examples?
                │       ├── Yes → Fine-tune
                │       └── No  → Better prompting + examples
                └── Is the task stable and well-defined?
                        ├── Yes → Fine-tune may be worth it
                        └── No  → RAG + prompting (iterate faster)

Most PM instincts that reach for fine-tuning first are actually knowledge problems (RAG) or style problems (better prompting) in disguise.

The mistake to avoid: using fine-tuning to inject facts into a model. Fine-tuning teaches behaviour and style, not new knowledge. You'll waste time and money. Use RAG instead.

Embeddings and Vector Search: The Engine Behind Retrieval

An embedding is a numerical representation of text as a vector: a list of floating-point numbers in a high-dimensional space. The key property: semantically similar content produces numerically similar vectors. This is geometric. It's why you can search for "vehicle collision" and find documents about "car accidents" even if those exact words never appear.

This geometric property is the entire reason RAG works.

A semantic search (vector search) finds the closest vectors to a query vector. Unlike keyword search, which matches exact words, vector search matches meaning. "Car accident" finds "vehicle collision" naturally. Typos are handled better. Long-form meaning works well. The trade-off: you can see keyword matches ("there's the word!"), but vector similarity is more opaque ("it's close in embedding space... somehow").

The quality of your embedding model is the ceiling on how good your retrieval can ever be. If the embedding model doesn't understand your domain well, retrieval fails, and the generation model can't fix bad retrieval. This is the garbage-in, garbage-out problem of RAG systems.

For specialised domains (legal, medical, financial), general-purpose embedding models often underperform. The terminology has different meanings. The training data under-represents the domain. This is why domain-specific embedding models exist (e.g., voyage-law-2 for legal text). A smaller, domain-specific model often outperforms a larger, general-purpose one.

In production, the best retrieval systems combine vector search with keyword search (hybrid search). Vector search catches conceptual matches and paraphrase. Keyword search catches exact matches and rare terms that embeddings might miss. Fusion algorithms combine both result sets.

The decision to make: For your specific domain, is a general-purpose embedding model sufficient, or do you need a domain-specific one? If you're handling legal documents, medical records, or specialised technical content, test both and measure. The cost difference is usually small; the quality difference can be large.

Attention and Transformers: Why Position Matters

Transformers (the architecture behind all modern LLMs) work by letting every token attend to every other token and deciding which ones matter. That's "attention." Everything you care about flows from this mechanism.

In the sentence "The lawyer filed the brief because she thought it was incomplete," the token "she" needs to know it refers to "lawyer," not "brief." Attention does this: it lets "she" attend strongly to "lawyer" and weakly to everything else.

More importantly for your product: attention is position-sensitive. Models reliably attend to the beginning of the context (system prompts and early instructions) and the end of the context (the current query), but poorly to the middle. This is the "lost in the middle" phenomenon.

If you have ten retrieved documents in a RAG system, the model may not attend equally to all of them. The most relevant document should be placed at the beginning or end of the context, not buried in the middle.

This matters for prompt design. If you're crafting a system prompt with instructions, put the most critical instructions first and last. If you're building a RAG system, your retrieval ranking algorithm should decide not just which documents are relevant, but where in the context they'll be placed.

Also: attention requires every token to attend to every other token. That's O(N²) complexity. This is why longer contexts are disproportionately more expensive. This is why extending context windows is hard. This is why a 1 million-token context required serious architectural innovation.

The intuition: the cost of processing a sequence doesn't scale linearly. It scales quadratically. Double your context window, and you quadruple the computation.

Temperature and Sampling: The Creativity Dial

After processing your prompt, a language model doesn't simply decide the next word. It produces a probability distribution over its entire vocabulary: a score for every possible next token. Temperature controls how the model samples from that distribution.

Temperature	Behaviour	Best for	Avoid for
0.0	Deterministic. Always picks the top token.	Structured data extraction, evals and auditing, safety-critical features	Creative tasks where variety matters
0.3 to 0.5	Mostly predictable, with light variation	Factual Q&A, summarisation, classification	High-variance creative output
~1.0 (default)	Balanced between likely and surprising tokens	General chat, drafting, conversational UX	Tasks requiring reproducibility
1.2 to 1.5	Creative, surprising, exploratory	Brainstorming, marketing copy, ideation	Safety-critical features, evals, auditing
1.8 to 2.0	Chaotic, often incoherent	Rare experimental use	Almost everything in production

Temperature = 0 means greedy decoding: always pick the highest-probability token. Outputs are deterministic. The same prompt always produces the same output.

Temperature = 1.0 (default) samples from the distribution. Some variability, but the most likely tokens are still favoured.

Temperature > 1.0 flattens the distribution. Lower-probability tokens become more likely. Outputs are more creative, surprising, and random.

Temperature doesn't make the model smarter or dumber. It only changes how adventurously the model samples from probabilities it's already computed. A less probable but correct answer is still in the distribution. Temperature affects whether it gets sampled.

For your product:

Use temperature = 0 for anything that needs to be consistent, reproducible, or auditable. Structured data extraction, factual Q&A, legal review, evaluations. You want determinism.

Use higher temperature (0.7-1.2) for creative tasks: brainstorming, marketing copy, ideation. You want variety.

One mistake: running evaluations with high temperature. If your eval results vary every run due to sampling randomness, you can't tell whether a change you made actually improved things. Eval suites must run at temperature = 0.

Another mistake: using temperature = 0.8 in a safety-critical feature (medical triage, contract review) because you want "natural-sounding" responses. Higher temperature increases the probability of unexpected outputs, including outputs that violate safety constraints. In safety-critical contexts, use low temperature.

Test Your Understanding

Here are four scenarios. Work through each one and think about what you'd ask or decide.

Scenario 1: Knowledge Cutoff Problem

A user reports your AI product gave factually incorrect information about an event that happened last month. What are the two most likely causes, and how would you investigate which one it is?

Scenario 2: Feature Architecture Decision

You're designing a customer support chatbot. Each conversation can involve multiple back-and-forths over several minutes. The model needs to remember earlier context. What would you choose: increasing the context window, or implementing a summarisation pipeline that compresses conversation history before each request? What's the trade-off?

Scenario 3: The Fine-tuning Proposal

Your engineering team proposes fine-tuning a model to "improve quality on our specific domain." You have 200 labelled examples. Is fine-tuning the right move? What questions would you ask before committing to a fine-tuning run?

Scenario 4: Retrieval Ordering

You're building a RAG system for legal document research. You retrieve five relevant documents. Based on what you know about attention and context position, how would you order them in the prompt to maximise the model's ability to use them?

Take time with these. The answers reveal whether you've internalised the concepts or just skimmed them. If any of these feel unclear, go back to the relevant section and re-read.

What's Next

This is Layer 1: how models work at a mechanical level. The next layer, which I'll cover in Article 3, is evaluation and benchmarking: how to measure whether your product decisions actually worked.

If you understand tokens, context windows, training vs inference, the three levers, embeddings, attention, and temperature, you've crossed a threshold. You can sit in a room with engineers and ask the right questions. You can push back on bad ideas. You can make decisions about architecture and trade-offs without deferring to whoever sounds most confident.

That's the point of this framework. Not to make you an ML engineer. To make you a product person who thinks clearly about AI systems.

The Four-Layer Model: A PM's Framework for AI Product Quality

Steve James — Mon, 13 Apr 2026 00:00:00 GMT

A legal AI tool summarises a Share Purchase Agreement. The summary is crisp and accurate. It flags an unusual warranty clause. But when it cites Hedley Byrne v Heller [2019] UKSC 14 as the governing authority, there's a problem: Hedley Byrne was decided in 1964, and there is no 2019 Supreme Court case at that citation. The model invented it. The customer's general counsel reads it back to you and asks a simple, devastating question: "Can we trust any of your tool's output?"

You have one hour to know what went wrong.

This is where most PMs get stuck. They see the problem and they know it's bad. But they don't have the language to move between the different layers of causation fast enough to diagnose it, fix it, and explain it.

About 6 months ago, I decided to fix that gap for myself.

How this series came about

I'm a Product Manager working in AI, and I reached a point where I realised that surface-level familiarity with the technology wasn't enough. I could talk about AI at a strategy level, but when a technical conversation got specific, when someone mentioned attention mechanisms, eval rubrics, or grounding failures, I didn't always have the depth to push back or ask the right follow-up question. That bothered me.

So I started reading everything I could get my hands on. Pawel Huryn's Product Compass became one of my most valuable sources, particularly his work on context engineering and AI agent architectures. His writing has a rare quality: it's technically rigorous without losing the product lens. If you're a PM working anywhere near AI and you're not subscribed, fix that. Aakash Gupta's Product Growth has been another consistent source of sharp thinking on where the PM role is heading as AI reshapes the discipline. Both of them are doing genuinely important work making this knowledge accessible to product people.

Beyond those two, I consumed research papers, Anthropic's and OpenAI's developer documentation, Simon Willison's blog (consistently one of the sharpest voices on the practical realities of building with LLMs), Hamel Husain's writing on evals (if you read one thing on AI evaluation, make it "Your AI Product Needs Evals"), and anything else I could find that helped me build a more complete picture. I captured all of it in an Obsidian vault that grew, over months, into a fairly comprehensive AI knowledge repository structured around a layered learning curriculum.

This series is that curriculum, rewritten for a wider audience. I'm publishing it because I think every PM working with AI needs this knowledge, and too much of it is scattered across technical papers, engineering blogs, and paywalled courses that assume you're building models rather than building products with them.

The framework I landed on organises everything into four layers. It's not the only way to structure this knowledge, but it's the one that keeps proving useful in practice, because real AI product problems don't stay in one layer. They cross all of them at once.

Why four layers?

Most PMs specialise in one layer and miss the others. An engineer might understand Layer 1 (how models work) deeply and miss Layer 2 (how to evaluate them). A quality leader might own Layer 2 without seeing Layer 3 (how architecture amplifies or suppresses failures). A compliance officer might own Layer 4 without understanding Layers 1 or 2, and so end up writing governance policies that don't address the root causes.

The real power comes from moving between them. The fabricated case citation above won't make sense if you only look at one layer. It requires understanding how next-token prediction works (Layer 1), why the eval suite missed it (Layer 2), what the prompt architecture did wrong (Layer 3), and what the customer's GC actually needs to hear (Layer 4).

Here are the four layers, stacked from mechanics to governance:

┌────────────────────────────────────┐
│  Layer 4: Safety and Governance    │
│  Trust, regulation, incidents      │
├────────────────────────────────────┤
│  Layer 3: Product Architecture     │
│  RAG, prompts, guardrails          │
├────────────────────────────────────┤
│  Layer 2: Evaluation and Quality   │
│  Evals, regression, benchmarks     │
├────────────────────────────────────┤
│  Layer 1: How Models Work          │
│  Tokens, attention, inference      │
└────────────────────────────────────┘

Layer 1: How Models Actually Work

This is the foundation layer. You don't need a PhD in transformers, but you need to understand the basic mechanics: what tokens are, why context matters, how attention works, how models generate text, and why the distinction between training and inference shapes product timelines.

Layer 1 is where you learn concepts like temperature (which controls whether a model produces the same output every time or samples from multiple possibilities), context windows (which limit how much information you can feed the model), fine-tuning (expensive but powerful) versus RAG (retrieval-augmented generation: asking the model to answer based on specific documents you feed it), and embeddings (the mathematical representation that makes semantic search possible).

When someone says "the model is hallucinating", that's Layer 1 language. It doesn't explain why, but it names what happened at the token level: the model's next-token prediction led it down a path that confabulated information not present in its input.

In this incident, the model's Layer 1 failure was straightforward. When it drafted the words "the seller's tortious liability is governed by", the next token it predicted was a case citation. This is statistically probable: English legal prose trains models to expect citations in this position. The model had seen "Hedley Byrne v Heller" thousands of times in training data associated with tortious liability. It sampled that citation, then confabulated the year and court to match the format pattern it had learned: [YEAR] COURT ABBREVIATION NUMBER.

The grounding instruction ("never cite cases not in the source material") was present in the prompt. But because of how attention works in transformers, an instruction buried in a long context has weaker weight than a deep prior learned across thousands of training examples. The model chose the prior.

Layer 2: Evaluation and Quality

This is the layer that separates PMs who control their own destiny from those who get surprised by customers.

Layer 2 is about defining quality rigorously enough that you can measure whether your model is actually doing what you promised. It covers precision, recall, and F1 scores; hallucination types (intrinsic: inventing information; extrinsic: confusing sources); the difference between faithfulness and factuality; how to build regression suites so you catch silent model degradation; and how to read benchmark claims without getting hoodwinked.

A faithfulness eval tests whether the model's answer agrees with the source material it was given. The summary passed this test: the warranty clauses it described were accurate. The citation hallucination was not detected by a faithfulness eval because the eval doesn't measure "did the model introduce an entity that wasn't in the source".

This is the biggest gap most PMs have. They focus on whether the output is good, not on whether the eval is measuring what matters. They measure happy paths and miss the failure modes. In this case, the eval suite detected accuracy, but it didn't have a slice that tested "citation introduced outside source material". That's a rubric design failure.

Layer 2 is also where you track regression. Did the model work last month and break this week? If so, the provider silently updated the checkpoint, and that's a governance conversation with your customer, not just a bug fix.

Layer 3: Product Architecture and Design Patterns

Layer 3 is how the actual product is built. It covers RAG pipelines (retrieval, re-ranking, prompt assembly), agents and tool use, the distinction between model-layer problems (the provider's job) and app-layer problems (your job), trade-offs between latency, cost, and quality, and guardrails and output validation.

This is where you learn that a prompt rule isn't an enforcement mechanism. The system prompt said "do NOT cite cases". That's Layer 3 work. But rules are suggestions to models. The model followed the suggestion most of the time, but on Friday, it decided to cite a case anyway. A rule without enforcement is a hope, not an architecture.

The fix is a guardrail: a post-generation step that detects citation patterns and verifies each one against a legal citator before returning the response. That's enforcement. No citation, no output.

Layer 3 also teaches you how problems move through the pipeline. Retrieval gives you the right chunks? Good: the problem isn't in the retrieval logic. The chunks are promoted correctly by re-ranking? Move on. The grounding instruction is in the prompt? Check. But is it at the top of the prompt (high attention) or buried in the middle (low attention)? That's architecture and it matters.

Layer 4: Safety, Ethics, and Governance

Layer 4 is what you tell the customer's general counsel, the regulator, and your board.

This layer covers alignment and constitutional AI (how models are made safe), failure mode classification (is this jailbreaking, prompt injection, sycophancy, or distributional shift?), bias and fairness, the regulatory landscape (EU AI Act, UK GDPR, US fragmentation), the difference between AI safety failures (a model making a mistake) and AI security failures (an attacker exploiting a model), and incident response.

In this scenario, the Layer 4 conversation is: "We've confirmed the issue. The model generated a citation not present in your document. This is a known failure mode. We mitigate it through three defences: a prompt rule, a post-generation verifier, and a regression eval suite. The verifier wasn't yet enabled for this feature. We're enabling it this week and pausing the feature in the meantime. You'll have a full post-mortem in five working days."

That's not "we have a hallucination problem". It's not "we're so sorry". It's: "Here's where it failed, here's how you can check we've fixed it, here's our timeline, and here's the evidence that we think about these failures systematically."

Under the EU AI Act, a contract-review tool used by qualified lawyers sits close to the "high-risk" line. The regulator doesn't expect perfection. It expects documented mitigation. A governance framework that shows you've thought about failure modes, tested for them, and have a response plan. The Layer 4 conversation is what saves the relationship.

Why It's All Four Layers at Once

Here's the critical insight, traced through the incident:

The Fabricated Citation: One Bug, Four Layers

Layer 1  │ Next-token prediction produced a statistically
(Model)  │ plausible case citation from training priors.
         │ Grounding instruction lost to stronger prior.
         │
    ─────┼──────────────────────────────────────────────
         │
Layer 2  │ Eval suite tested faithfulness but NOT
(Eval)   │ "introduced entities." Rubric gap.
         │
    ─────┼──────────────────────────────────────────────
         │
Layer 3  │ Prompt rule said "do NOT cite cases."
(Arch)   │ No enforcement guardrail behind it.
         │
    ─────┼──────────────────────────────────────────────
         │
Layer 4  │ Client's GC needs a credible incident response,
(Gov)    │ not "it's a known issue with LLMs."

Real production problems are cross-layer. You can't diagnose this incident with Layer 1 knowledge alone. Yes, the model produced a confabulation, but so what? Why did it get to the customer? Because Layer 2 missed it. Why did Layer 2 miss it? Because the eval rubric had a gap. Why didn't the gap matter less? Because Layer 3 had no enforcement layer. Why did the customer not drop you? Because Layer 4 had a governance response ready.

No single layer owns the fix. No single layer owns the blame.

The PM's job is to move between layers fast enough that by the time you're in the room with the customer, you've already diagnosed which layer each fix belongs to, what the timeline is for each fix, and which fix gets shipped first.

Consider a few other escalations:

"The AI is slower than it was last week." Is the context window longer (Layer 1 attention cost)? Is there a regression that shows when the slowdown started (Layer 2)? Did the provider update the model (Layer 3 model-layer vs app-layer)? If it's a silent update, check the customer contract for a notice requirement (Layer 4 governance).

"The eval suite passes but users say quality dropped." Your eval has a gap (Layer 2). Interview users, extract the failure class, build a new test slice. Did the provider silently update the checkpoint (Layer 1)? Did any app-layer code change: a new prompt, a re-ranker threshold, a new guardrail (Layer 3)?

"We're being attacked about hallucination rates on a public benchmark." This is Layer 2 entirely: benchmark literacy. What dataset? What prompt? Are their numbers even comparable to yours? And Layer 4: the marketing response matters. Publish your own hallucination rate, defined precisely, on a held-out domain-specific benchmark.

Every one of these incidents touches multiple layers. If your trace doesn't, you're missing something.

Fluency as Movement

The differentiator isn't knowing everything. It's knowing what you know, knowing what you don't, and being honest about the boundary. It's being able to say: "I'd need to pull the eval logs to give you a precise number, but here's how I'd think about whether precision is even the right metric for this use case."

It's moving from Layer 1 to Layer 2 to Layer 3 to Layer 4 and back again, fast enough that you can diagnose a problem before it becomes a customer crisis, fix it before your user has to call their lawyer, and communicate the fix in terms that matter to the people who depend on your product.

This is the fluency the series will build. It's what I set out to learn when I started filling that Obsidian vault, and it's what I want to make available to every PM who's feeling the same gap I felt. Each subsequent article goes deep on one layer, plus a dedicated piece on context engineering, the cross-cutting discipline that determines what information reaches the model in the first place. This is the frame you'll use to connect them.

Test Your Understanding

Before moving to the deeper layers, check yourself on these questions:

Layer 1 vs Layer 2: A model produces a grammatically perfect answer that's factually wrong. Is this a Layer 1 problem, a Layer 2 problem, or both? What would you measure to tell the difference?
Layer 3 diagnosis: You have a system prompt that says "cite only sources from the provided context". The model cites external sources anyway. Is this a prompt-writing problem or an architectural problem? What would you add to fix it?
Cross-layer incident: A feature works perfectly in the lab, passes all evals, and then fails on a customer's data. Which layers would you check first and why?
Layer 4 communication: A customer asks, "Is this a bug or is this what LLMs just do?" What's the difference you're communicating in your answer, and which layers justify your answer?

The series ahead:

Layer 1: How Models Work — tokens, context windows, attention, and the mechanics that explain why models behave the way they do
Layer 2: The Eval Gap — why evaluation is the biggest gap most PMs have, and how to close it
Layer 3: Model-Layer vs App-Layer — the diagnostic question that determines who owns the fix
Context Engineering — the cross-layer discipline of controlling what information reaches the model
Layer 4: Safety and Governance — the conversations PMs keep dodging, and why they determine whether customers stay

EU AI Act Compliance Is Already Breaking Startups: What PMs Need to Know

Steve James — Mon, 06 Apr 2026 00:00:00 GMT

I am bullish on AI. I have been for a while, and nothing I have seen in the last twelve months has changed that view. The pace of capability improvement is staggering, the application space is enormous, and the economic incentives are pulling in one direction: more AI, faster, everywhere.

But I have also spent enough years in product development to know that "move fast" without "think carefully" is how you end up with a mess that takes twice as long to clean up as it took to create. And right now, the EU AI Act is forcing a conversation that the industry has been avoiding.

It is not a future concern. It is already here.

The Reality on the Ground

The EU AI Act is not coming. It is already here, and the enforcement milestones are arriving fast:

Aug 2024 ─── Act entered into force
    │
Feb 2025 ─── Prohibitions on unacceptable AI practices took effect
    │
Aug 2025 ─── General Purpose AI model requirements took effect
    │
Aug 2026 ─── ⚠ HIGH-RISK AI SYSTEM REQUIREMENTS TAKE FULL EFFECT
    │
Aug 2027 ─── Full enforcement begins

That critical August 2026 deadline is four months away.

The penalties are not theoretical. Violations of prohibited practices carry fines of up to EUR 35 million or 7% of global annual turnover, whichever is higher. High-risk violations reach EUR 15 million or 3%. Even providing incorrect information to regulators can cost EUR 7.5 million.

For a seed-stage startup, even the "proportionate caps" designed to protect SMEs can be existential. As one compliance source put it bluntly: a EUR 140,000 fine for a seed-stage company is a death sentence.

Founders on Reddit are already discussing this. One described having to "rewrite a big part of my AI-powered chatbot to meet the new regulations." Another flagged the health AI classification problem: "the line between 'health advice' and 'medical device' is blurry and the EU is not messing around." A third described the moment their first EU customer asked about AI Act compliance and they had nothing prepared.

This is not hypothetical risk. It is operational reality.

The Two Extremes

There are broadly two perspectives in this conversation, and both have a point.

The US "Wild West" argument holds that America's relatively light-touch approach to technology regulation has been a primary engine of its dominance. Apple, Google, Amazon, Microsoft, Meta: the most valuable and influential technology companies on the planet were all built in an environment where founders could experiment, ship, iterate, and scale without navigating a compliance labyrinth first. For fifty years, US tech companies have led the world in innovation, and that track record is difficult to argue with. Whether the US still leads in every frontier, particularly AI where the debate about China's position is genuinely open, is a fair question. But the broader pattern is clear: permissive regulatory environments have historically correlated with explosive technological growth.

The EU "precautionary" argument holds that unchecked AI development creates real harm: discriminatory hiring algorithms, opaque credit scoring, surveillance creep, and safety risks in critical infrastructure. The AI Act is an attempt to draw lines before the damage is done, not after. Proponents argue that without guardrails, we are sleepwalking into a world where AI systems make consequential decisions about people's lives with no transparency, no accountability, and no recourse.

Both positions contain truth. And both, taken to their logical extreme, lead to bad outcomes.

The unregulated approach creates enormous value quickly, but it also creates enormous risk. When an AI system wrongly denies someone a mortgage, or a medical chatbot gives dangerous advice, or a hiring tool systematically discriminates, the harm is real and often falls on people who have no visibility into the system that affected them. "Move fast and break things" might be acceptable when you are iterating on a landing page. It is a dangerous philosophy when the "things" being broken are people's livelihoods, health, or civil rights.

The heavily regulated approach protects citizens, but it can also strangle the innovation it claims to want. European startups have been vocal about this. Some welcomed the clarification the Act provides, but others argued the original timeline "favoured deep-pocketed American tech giants who could afford to hire armies of compliance lawyers." When compliance costs create a structural advantage for incumbents, regulation stops being a shield and starts being a moat, just one that protects the wrong people.

A Product Manager's Perspective

Here is where I think the conversation goes wrong. Both sides frame regulation as something that happens to product development, an external force that either helps or hinders. But if you have spent any time building products in complex environments, you know that constraints are not inherently good or bad. What matters is how you design around them.

This is exactly the same principle I wrote about in the context of agile governance. Governance is not the enemy of delivery. Bad governance is the enemy of delivery. Good governance supports the flow of value by creating guardrails that keep teams on track without dictating every step.

The EU AI Act, at its core, is asking for things that good product teams should want anyway:

Transparency: Tell users they are interacting with AI. Explain how decisions are made. This is not a burden; it is basic product integrity.
Explainability: If your AI system makes a consequential decision, you should be able to explain why. If you cannot, that is not a regulation problem. That is a product quality problem.
Human oversight: For high-risk decisions, keep a human in the loop. Again, this is not radical. It is the kind of thing experienced PMs already advocate for.
Logging and monitoring: Maintain records of how your system behaves. This is just good engineering practice dressed up in legal language.
Risk assessment: Before you build, think about what could go wrong and for whom. This is discovery. This is what we are supposed to do.

The issue is not that these requirements exist. The issue is that they are being imposed on products that were never designed with them in mind. Retrofitting explainability into a system that was built as a black box is expensive and painful. Building it in from day one is a design decision, not a compliance project.

The Forcing Function

The tension between regulation and innovation has been a recurring theme on Lenny's Podcast recently, and several conversations have landed on a point that I think is underappreciated: constraints do not just limit what you can build. They change how you think about what you should build.

Brian Balfour described companies that define "incredibly hard constraints" as a strategy, not a problem. One company he worked with benchmarked against competitors and set a constraint that each function would be one-fifth the size. That constraint didn't slow them down. It forced them to find fundamentally different ways of working, including aggressive adoption of AI tooling.

Regulatory constraints can work the same way, if you let them.

Amol Avasare from Anthropic put it even more directly: "As the risks get higher and the stakes get higher, I think the fact that we are taking a stance and safety is critical to what we do, is actually going to become a significant competitive advantage."

This is not compliance-as-cost. This is compliance-as-differentiation. In a market where trust is increasingly scarce, the company that can say "we built this responsibly, and here is the evidence" has an advantage over the one scrambling to bolt on compliance features before a deadline.

Eric Ries made a related point about AI alignment: "Everyone's talking about AI alignment. I'd be a little more sanguine about AI alignment if the companies doing the aligning were better at aligning their human intelligences." The Act's explainability and transparency requirements essentially force companies to make their organisational values explicit. Ries argues this is work the tech industry has "severely underinvested" in. He is right.

Lightweight Governance, Not Bureaucratic Theatre

My position is not that the EU AI Act is perfect. It is not. The classification system is ambiguous in places. The "wellness tool vs medical device" boundary is genuinely unclear. The phased timeline has created perverse incentives, with some companies rushing to launch high-risk products before enforcement kicks in. And the compliance burden falls disproportionately on startups who can least afford it.

But the answer to imperfect regulation is not no regulation. It is better regulation. And the principles we apply in product development point the way.

In product work, we know that heavyweight governance kills velocity. Stage gates, committee approvals, and thick compliance documents slow teams to a crawl. But we also know that zero governance is chaos. Without any constraints, teams build the wrong things, accumulate unmanageable risk, and lose alignment with the broader organisation.

The sweet spot is lightweight governance: clear guardrails, empowered teams, and fast feedback loops. You define the boundaries, then give teams freedom within them. You inspect and adapt. You treat the governance model itself as a product that evolves.

AI regulation should work the same way:

Classification should be clear and predictable. Founders should not be guessing whether their wellness app counts as a medical device. The boundaries need to be sharp enough that a product team can make a confident call during discovery, not after months of legal review.
Compliance should be proportionate. The requirements for a chatbot recommending restaurants should not be the same as for a system making parole decisions. Risk-based tiering is the right idea, but the tiers need to be practical, not just theoretical.
The cost should not be a moat. If compliance is so expensive that only large incumbents can afford it, the regulation is failing. Tooling, templates, and shared infrastructure for common compliance patterns would help enormously.
The focus should be on outcomes, not paperwork. Does the system behave safely? Can affected users understand and challenge decisions? Is there meaningful human oversight where it matters? These are the questions that matter, not whether a specific template was filled in.

What This Means for PMs

If you are a Product Manager working on anything that touches AI, this is your problem now. Not legal's problem. Not compliance's problem. Yours.

The best PMs I know have always understood that constraints are not obstacles; they are design inputs. A screen size constraint forces better information hierarchy. A performance budget forces cleaner architecture. A regulatory requirement forces you to think about who your product affects and how.

Here is what I would do right now:

Assess your risk classification early. During discovery, not after launch. If your product touches hiring, credit, education, healthcare, law enforcement, or critical infrastructure, assume you are high-risk until proven otherwise. Build that assumption into your PRDs from the start.

Design for explainability from day one. If you cannot explain why your system made a decision, you have a product quality problem regardless of what the EU thinks. Explainability is not just a compliance feature. It is a trust feature, and trust drives retention.

Own compliance as a product concern. As Ian McAllister said on Lenny's Podcast: "The more you grow, you have to increasingly find the constraints or barriers to your success and knock them down no matter what they are." Do not wait for legal to hand you a checklist. Understand the requirements yourself and factor them into your roadmap.

Watch the US convergence. Colorado's SB 24-205 introduces risk management policies, algorithmic impact assessments, and consumer notice mechanisms. This is not an EU-only trend. The direction of travel is clear globally, and PMs who treat EU compliance as an isolated European problem will find themselves retrofitting again when similar requirements land closer to home.

The Bottom Line

I remain deeply optimistic about AI. The technology is transformative, and the potential to improve lives at scale is real. But potential and impact are not the same thing. The gap between them is filled with product decisions, and those decisions need guardrails.

The EU AI Act is imperfect, but its core instinct is right: consequential AI systems should be transparent, explainable, and accountable. These are not anti-innovation principles. They are pro-trust principles. And in the long run, trust is the foundation that sustainable innovation is built on.

The PMs who treat compliance as a checkbox will resent it. The PMs who treat it as a product constraint, one that forces clearer thinking, better architecture, and stronger user trust, will build better products because of it.

We should not be choosing between the American model of unchecked speed and the European model of cautious restraint. We should be applying the same principles we use in product development: encourage experimentation, accept that risk is inherent, but apply lightweight governance to guide the flow of value without blocking it.

That is not a regulatory philosophy. That is just good product management.

Agentic Engineering vs Vibe Coding: A Product Manager's Guide to Knowing the Difference

Steve James — Fri, 27 Mar 2026 00:00:00 GMT

The distinction that matters

Scroll through any PM community right now and you will find two very different conversations happening under the same banner of "AI-assisted development." In one, experienced practitioners are describing how AI agents are amplifying their existing expertise, letting them move faster, think bigger, and validate ideas in hours rather than weeks. In the other, people are asking whether Product Managers now need to ship production code because, after all, an LLM can write it for you.

These are not the same conversation. And conflating them is where people get into trouble.

Simon Willison's Agentic Engineering Patterns guide draws a sharp line between two modes of working with AI. Agentic engineering is what happens when domain experts use AI agents to amplify skills they already possess. Vibe coding is what happens when someone without deep understanding of the output delegates entirely to the model and hopes for the best.

For Product Managers, this distinction is everything. Agentic engineering is, by far, the area we should be embracing. It is going to 10X our productivity and unlock superpowers we never had before. Vibe coding has a place in our toolkit too, but we need to be very clear-eyed about what that place is.

Why agentic engineering is our superpower

The core pattern of agentic engineering, as Willison describes it, is delegation and supervision. Instead of treating an LLM as a glorified autocomplete, you define goals, constraints, quality criteria, and workflows for the AI agent. You review and refine its output. You bring your expertise to bear on what the agent produces.

This should sound familiar. It is, fundamentally, what Product Managers do. We have spent our entire careers defining problems clearly, setting constraints, evaluating outcomes, and making trade-off decisions. We orchestrate. We prioritise. We apply judgement to ambiguous situations. Agentic engineering takes these existing strengths and amplifies them to a degree that was simply not possible before.

Lazar Jovanovic, who works as a professional vibe coder at Lovable and was featured on Lenny's podcast, captures the underlying principle perfectly:

AI is an amplifier regardless of your background. If you do not know what you are doing, you are just going to produce garbage faster.

The corollary is equally true. If you do know what you are doing, you produce excellence faster. For PMs with strong product sense, clear thinking, and good judgement, agentic AI is a force multiplier unlike anything we have had before.

What 10X actually looks like

Let me be concrete about what this means in practice, because "10X productivity" can sound like empty hype if you do not ground it.

Research and synthesis at speed. Competitive analysis that used to take a week of desk research can be synthesised in an afternoon. User research themes can be clustered, cross-referenced against quantitative data, and tested for patterns in hours. One PM described an AI agent that saves hours of manual checking each week by surfacing competitive moves automatically. This is not replacing the PM's judgement about what matters. It is eliminating the drudgery that sits between a question and an informed answer.
Faster feedback loops. Elena Verna described on Lenny's podcast how her team at Amplitude went from the traditional multi-month cycle of user research to design sprints to engineering roadmap prioritisation, down to prototyping in a couple of weeks. The compression is not about cutting corners. It is about removing the wait time between having an idea and being able to test it against reality.
Richer stakeholder conversations. When you can spin up an interactive prototype to stress-test a hypothesis before you have even written the first user story, the quality of your conversations with engineering, design, and leadership changes fundamentally. You are no longer describing an abstract concept in a PRD. You are showing people a working example of what those requirements describe.
Judgement as the bottleneck, not bandwidth. Aparna Chennapragada, formerly a senior PM leader at Google, made the point on Lenny's podcast that the taste-making and the editing function becomes really important in an AI-augmented world. If your role was mostly process management and report generation, you should be concerned. But if your value lies in judgement, prioritisation, and the ability to frame problems clearly, you are about to become significantly more powerful. Agentic engineering shifts the constraint from "I do not have enough time to do the analysis" to "I need to make better decisions with the analysis I now have."

The expertise amplifier effect

There is a pattern here that I think is worth naming explicitly. Every capability that agentic engineering unlocks for PMs depends on the expertise you bring to the table. The AI agent does not know which competitive signal matters. You do. The agent does not know which user research theme connects to your strategic priorities. You do. The agent does not know whether the prototype it just built actually solves the customer's problem. You do.

This is what Willison means when he talks about "hoarding things you know how to do." The more patterns, frameworks, and hard-won experience you have accumulated over your career, the more effectively you can direct AI agents.

Your expertise is not threatened by agentic engineering. It is the prerequisite for it.

Jovanovic made the same point using the Aladdin and the Genie analogy. You rub the lamp, the genie comes out, and your first wish is to be taller. The genie makes you 13 feet tall because you were not specific enough. The quality of what you get from an AI agent is directly proportional to the clarity and precision of your instructions. And clarity and precision in describing what needs to be built, for whom, and why, is quite literally the core PM skill.

David Mytton from Arcjet articulated the boundary well: AI coding only works when there are clear guardrails in place, meaning good documentation and comprehensive tests the agent can run. That, he says, is what differentiates vibe coding from agentic engineering. The guardrails require expertise to define. PMs who have spent years learning to write clear requirements, define acceptance criteria, and think through edge cases are already building the muscle that agentic engineering demands.

The vibe coding caveat

All of which brings me to vibe coding, and why it deserves a more measured take than it usually gets.

I am bullish on Product Managers using vibe coding. But specifically for what it is: a way to supercharge the things we are already doing. When you vibe code a prototype, you are not becoming a software engineer. You are doing what PMs have always done, articulating requirements and demonstrating a potential solution, except that now the output is a working interactive example rather than a static wireframe or a written specification. It is analogous to writing a PRD, except you are showing those requirements in a living, testable form.

I have experienced this firsthand. I built a story mapping tool using Claude Code because every commercially available option was either wildly over-engineered and expensive, or there were templates in Miro and Lucid that ended up being more work than they were worth. I built this thing and I could not be happier with it for my own tasks. It does exactly what I need.

But the moment you start thinking about releasing something like that beyond your own use, you need to be honest about what you do not know. When you vibe code something, you have no real understanding of how it works under the surface. You can see the inputs and the outputs. You can test the happy path and a few edge cases. But you have no idea what shortcuts the model took, what security considerations were ignored, what performance problems are lurking, or what happens when the thing encounters a scenario that neither you nor the model anticipated.

There are sensible mitigations. Get a different LLM to conduct a code review. Ask the model to explain each part of the codebase so you have a basic understanding of how it is put together. Jake Knapp and John Zeratsky, updating their Design Sprint methodology for the AI era, noted on Lenny's podcast that teams who jumped to vibe coding prototypes too quickly produced output that was "super generic" and did not really describe what the product was. Their advice: you will move faster if you slow down a little at the beginning. That same discipline applies here.

But the fundamental limitation remains. There will always be something you did not know to ask about. Product Managers have been working with engineers for decades, so we know the kinds of things to look out for at a high level. We know about technical debt, scalability, security reviews, testing. But unless you have actually written production code, maintained it, debugged it at 2am, and dealt with the consequences of architectural decisions made years ago, you are not going to catch everything. And in software, what you miss can range from mildly annoying to catastrophically expensive.

The rule of thumb is simple. Vibe code prototypes enthusiastically. Use them to stress-test ideas, to have better conversations with your engineering teams, to show stakeholders what a solution could look and feel like. And when it is time for that prototype to become production-ready, bring in the engineers with decades of experience to make it work properly. The prototype proves the concept. The engineering team makes it real.

What the full picture could look like

I heard someone describe a workflow recently that I think paints a compelling picture of where this is all heading. Their PM team still conducts user research the way they always have. They send out surveys, capture usage metrics, and go out and speak to customers. All of that is captured, transcribed, and ingested into a data store. Agents then go over this data continuously, looking for trending themes, recurring problems, and emerging opportunities. These are automatically added to a dynamically constructed opportunity solution tree.

The PMs monitor this tree, and when something looks to have enough weight behind it, they build a prototype of the potential solution using a tool like Claude Code. That prototype is then dogfooded by the organisation for a few weeks to stress-test both the idea and the proposed solution. If it looks like it has legs, the prototype is handed to the engineering teams, who pull it apart and build a production-ready version using the correct guardrails and technologies to ensure it is performant, secure, and scalable.

What strikes me about this example is that it is not science fiction. Every piece of this workflow exists today. Agentic engineering handles the research synthesis and opportunity identification, amplifying the PM's expertise rather than replacing it. Vibe coding handles the prototype, giving the team something real to react to rather than a static specification. And professional engineering handles production, because that is where the decades of hard-won expertise in building reliable, secure software actually matters. Each mode of working with AI is used for exactly what it is good at, and nothing more.

That, to me, is the future that PM teams should be aspiring towards.

Where this leaves us

The distinction between agentic engineering and vibe coding is not just semantic. It is the difference between a power tool in the hands of a craftsperson and a power tool in the hands of someone who watched a YouTube tutorial. Both can produce impressive results. Only one can be trusted when it matters.

Product Managers have spent decades building the exact skills that agentic engineering rewards: clarity of thought, structured problem framing, judgement under ambiguity, and the ability to orchestrate people and processes towards an outcome. AI does not diminish any of that. It amplifies all of it. The PMs who recognise this, and who learn to wield these tools with the same discipline they bring to everything else, are going to be extraordinarily effective.

Embrace agentic engineering fully. Use vibe coding enthusiastically for prototypes and personal tools. And when it is time to make something real, bring in the engineers. Not because AI has failed, but because knowing the limits of your expertise has always been the most product-management thing you can do.

The Great PM Skills Debate: What AI Won't Replace

Steve James — Sat, 14 Mar 2026 00:00:00 GMT

There's a debate happening in product management right now that I find genuinely fascinating, not because it's new (people have been asking "will AI replace PMs?" for two years), but because it's finally getting specific. The question has shifted from "will it happen?" to "which bits, exactly?"

The conversation is playing out across Podcasts, research, Medium think pieces, and every PM WhatsApp channel I'm in. Zoë Yang's analysis of 284 Lenny's Podcast episodes captures the trajectory well: PMs are shifting from feature delivery to systems thinking, building evals, instrumenting failures, and operating in roles with increasingly blurred boundaries. What's emerged isn't a clean answer. It's a set of contradictions that tell us something important about where the profession is headed.

The one thing everyone agrees on

Let's start with the common ground, because there is some.

The administrative layer of product management is being compressed. Meeting notes, backlog grooming, ticket management, roadmap formatting, basic prioritisation frameworks, these are already being automated or dramatically accelerated. Nobody seriously disputes this. All opinions seem to frame AI as enhancing PM capabilities rather than replacing them, but even these optimistic takes acknowledge that the coordination and documentation layer is being hollowed out.

Marty Cagan put it bluntly on Lenny's Podcast: if your job is fundamentally "backlog administrator", that work is already being done by AI, and it's only going to get better supported. The question is what's left once you strip that layer away.

Aparna Chennapragada framed it well: if you're mostly a process person, tracking things, sending emails, managing the machinery, you've got a real question to answer about your value add. But on the flip side, she argues, "the taste-making and the editing function becomes really, really important."

That word "editing" keeps coming up. And it's worth digging into.

The Editor-in-Chief thesis

There's a compelling argument gaining traction that the PM role is shifting from "builder" to "editor". Eric M. De Castro captured it directly: the bots can manage the backlog, the agents can optimise the velocity, and the role of the Senior PM has fundamentally shifted from "Builder" to "Editor-in-Chief".

The logic goes like this. When AI can generate strategies, write PRDs, draft user stories, and even prototype features, the PM's job is no longer to produce these artefacts. It's to curate, refine, and judge them. You're not writing the first draft any more. You're deciding which of five AI-generated drafts is actually good, and why.

A similar point surfaces in a piece on taste in the AI age: we're entering an era where anyone can make a lot, so the differentiator isn't how much you can produce, it's how much you can discard. Building is becoming abundant. Editing is becoming the craft.

This shouldn't feel entirely alien to experienced PMs. The best Product Managers were always defined less by the ideas they greenlit and more by the ones they killed. Saying no to the majority of what comes down the funnel, the feature requests, the stakeholder pet projects, the shiny distractions, has always been the job. What's changed is the volume. When AI can generate ten plausible strategies before lunch, the filtering muscle doesn't become less important. It becomes the whole game. Same skill, but on steroids.

The question of whether machines can even develop taste is explored well by Web Designer Depot, who argue that while AI can learn aesthetic patterns, true taste involves cultural context, intentional rule-breaking, and emotional resonance that remain distinctly human.

I find this persuasive up to a point. But it raises an uncomfortable question that nobody seems to have a good answer to yet: how do you develop editorial judgment if you never do the building? Taste requires reps. If AI does the work, where do the reps come from?

Hilde Dybdahl Johannessen pushed back on the "taste as moat" narrative directly, arguing that taste itself can become a form of gatekeeping and that AI may eventually learn to simulate it through preference learning. It's a useful corrective. The taste argument is appealing, but it's not as airtight as its proponents suggest.

The strategy paradox

Here's where it gets really interesting, and where the PM community is genuinely split.

On one side, you have people arguing that strategy is the skill most vulnerable to AI replacement. The reasoning: AI can process vastly more market data, competitive intelligence, and customer feedback than any human. Strategic frameworks are well-documented and replicable. Pattern recognition from thousands of successful strategies is exactly what AI excels at.

On the other side, you have people arguing that strategy is the skill most protected from AI. The reasoning: strategy requires contrarian thinking, and AI is trained on consensus data. It demands human judgment about what to ignore. It involves organisational politics and relationship dynamics that AI cannot navigate. As Hilde Dybdahl Johannessen points out, without deliberate steering, AI will give your company roughly the same advice as your competitors. The market doesn't reward copycats. It rewards contrarians who are right.

Asha Sharma and Lenny discussed this tension directly on the podcast. You'd think that an AI with all the information about where the market is going, your metrics, and your product today would be excellent at developing strategy. And yet many people believe it's the one thing AI won't be good at for a long time, because that's where human judgment is most irreplaceable.

I don't think either side has won this argument. But I think the framing is slightly wrong. Strategy isn't one thing. The analytical component of strategy, market sizing, competitive mapping, trend identification, is clearly vulnerable. The judgment component, what to bet on, what to ignore, when to zig while everyone zags, is clearly protected. The question for any individual PM is: which of those two things do you actually spend your time doing?

The jagged frontier

Ethan Mollick's concept of the "jagged frontier" is the best mental model I've found for thinking about all of this. The idea is that AI's capabilities aren't a clean line. They're uneven. AI excels at some surprisingly complex tasks while failing at some surprisingly simple ones. And the frontier keeps moving.

The practical implication is that you can't make blanket statements about what AI can and can't do. You have to test it, task by task, and develop an instinct for where the frontier currently sits. Mollick's four principles are useful here:

always invite AI to the table,
be the human in the loop,
treat AI like a person (but remember it isn't one), and
assume this is the worst AI you'll ever use.

That last one matters. Whatever AI can't do today, it will probably do tomorrow. So building your career strategy around AI's current limitations is a losing game. The question isn't "what can't AI do?" It's "what will remain uniquely human even as AI gets dramatically better?"

The buy-in problem nobody talks about

Multiple podcast guests raised a point that I think is under-discussed: AI can't do stakeholder management.

Claire Vo put it well when she noted that she doesn't know how an AI bot achieves buy-in and alignment, "unless everybody's got their own little bot and they're all talking to each other."

This sounds like a throwaway observation, but I think it's profound. A huge amount of product management is persuasion. Convincing an engineering lead to prioritise your feature. Getting a sceptical executive to fund an experiment. Navigating the politics of a cross-functional team where everyone has different incentives. Building trust with a design team that's been burned by PMs before.

None of this is analytical. None of it can be synthesised from data. And it's not going away any time soon, because it's fundamentally about human relationships and organisational dynamics. Michele Galli made a related point: many PMs have built their competence around communication, organisation, and alignment. Those skills are useful, but they're also relatively safe. AI compresses the value of coordination, but not the value of genuine influence.

The blurring of roles

One of the most consistent themes across Product related discussions is that the boundaries between PM, engineer, and designer are dissolving.

Tamar Yehoshua predicted that in five to ten years, these lines will blur significantly because AI will enable PMs to build prototypes and designers to code. Meta PMs are already using AI coding tools to become builders themselves, with one PM describing it as being handed "superpowers", operating less like a conductor moving work between functions and more like a product owner who can execute directly.

Casey Winters offered the sharpest version of this: if you thought the PM job was just filling in frameworks and collecting promotions, then yes, AI will replace you. The real PM job, the one requiring genuine subject matter expertise, is the least likely to be replaced.

Zevi Arnovitz pushed back strongly against the concern that AI weakens PM skills, arguing instead that it's a collaborative learning opportunity. He sees AI-assisted building as a way for PMs to deepen their craft, not atrophy it. I think he's partly right. But the risk of atrophy is real for people who skip the fundamentals entirely and go straight to AI-generated outputs without understanding what good looks like.

Where I land

I've been thinking about this a lot, partly because I've lived through enough technology shifts to know that the conventional wisdom is usually wrong in at least one important way.

Here's my take: experienced Product Managers who keep up with the technology are going to become more in demand, not less. The "human" skills that make a great PM, judgment, taste, persuasion, the ability to synthesise conflicting signals into a coherent direction, don't get replaced by AI. They get amplified by it.

The reason is straightforward. If AI handles the execution burden, the research synthesis, the first drafts, the data crunching, the documentation, then experienced PMs can finally step back from the production line and focus on what they should have been doing all along: orchestration. Setting direction. Making judgment calls. Editing rather than writing. That's a 10x opportunity for people who have the foundational skills to take advantage of it.

But, and this is the critical caveat, only if they actually engage with the tools. The PMs who will struggle are the ones who either refuse to use AI (and get outpaced) or rely on it blindly (and lose their edge). The sweet spot is what Mollick describes: being the human in the loop, with genuine expertise about when to trust the output and when to override it.

The bifurcation is real. Administrative PMs are in trouble. Strategic, taste-driven, judgment-heavy PMs are entering their golden age. As Saeed Khan argues, AI won't magically fix role definitions, bad objectives, or lack of strategy, those are human problems that require human solutions. The uncomfortable middle ground is that most PMs are a bit of both, and the transition isn't going to be comfortable.

The junior PM problem

There's a question that almost nobody in this debate is addressing honestly, and it's the one that worries me most: what happens to the people who haven't gotten good yet?

Every optimistic take on AI and product management, including mine, rests on the same assumption: that experienced PMs with strong judgment and taste will thrive. Fine. But where do experienced PMs come from? They come from junior PMs who were, at one point, not very good at the job.

The basic premise of career development in almost every profession has always been the same. When you start, you're bad at it. The company understands you're bad at it. They pay you to do enough of the basic work, the ticket grooming, the meeting notes, the stakeholder chasing, the first-draft PRDs, with the expectation that over time you'll learn from it, develop judgment, and become genuinely valuable. The grunt work isn't just work. It's training.

Now companies are looking at that same grunt work and seeing that AI can do it faster, cheaper, and without needing a desk. The signal from hiring managers is becoming difficult to ignore: they're no longer prepared to pay someone to be bad at the job long enough for them to get good. Junior PM roles are being cut or simply not backfilled. The entry-level pipeline is narrowing.

This is, of course, extremely short-sighted. The senior PMs everyone is so keen to retain won't be around forever. They'll move on, burn out, retire, or get poached. And if there's nobody coming up behind them, nobody who spent two years in the trenches learning what a good user story looks like by writing five hundred bad ones, then you've got a succession crisis dressed up as a cost saving.

It also circles back to the taste question raised earlier. If taste requires reps, and the reps are being automated away, how does the next generation develop the judgment that everyone agrees is irreplaceable? You can't edit what you've never written. You can't curate if you've never built. The "Editor-in-Chief" thesis is compelling for people who already have twenty years of context. It's a dead end for someone in their first role.

I don't have a clean answer to this. But I think it's the most important structural question the profession faces. The debate about whether AI will replace experienced PMs is interesting. The question of whether we're quietly dismantling the path to becoming one is urgent.

The bottom line

The PM skills debate isn't really about AI at all. It's about something the profession has been avoiding for years: what is the actual, irreducible value of a Product Manager?

AI is forcing that question into the open. The coordination gets automated. The strategy gets challenged. What remains is judgment, taste, and the ability to make things happen through people. For experienced PMs who engage with the tools, that's a genuinely exciting shift. The work gets harder, but it also gets closer to the work that matters.

The problem is that this optimism only holds if you zoom in on the people who are already good. Zoom out and the picture is more troubling. If the profession celebrates the rise of the "Editor-in-Chief" PM while quietly eliminating the junior roles where people learn to write in the first place, we're not evolving the discipline. We're hollowing it out.

The companies that get this right will be the ones who recognise that AI doesn't remove the need to develop people. It changes how you do it. The grunt work might look different, the apprenticeship might be shorter, the tools might be better. But the principle remains: you have to let people be bad at something long enough for them to get good. Any organisation that forgets that is saving money today and buying a talent crisis tomorrow.

So yes, experienced PMs who embrace AI are entering a golden age. But the real test of the profession isn't whether the current generation thrives. It's whether we're building the conditions for the next one to exist at all.

Speak Up, Write It Down, Put It Out There

Steve James — Sat, 21 Feb 2026 00:00:00 GMT

Early in my career, I received a piece of feedback that stuck with me. It was the kind of thing that stings a little because you know it's true:

"I would like Steve to be more forceful and vocal in meetings. In my eyes it appears that Steve providing his insights earlier in discussions could at times 'short-cut' meetings."

I knew immediately he was right. It wasn't one of those bits of feedback you file away and rediscover years later with fresh eyes. It landed with the uncomfortable clarity of something I'd already been half-aware of. I'd sit in meetings, form thoughts, wait for the perfect moment to contribute, and then watch as the conversation moved on without me. Or I'd speak up late, only to find that the group had already landed on what I'd been thinking ten minutes earlier. The feedback didn't tell me anything new. It just said it out loud, which, ironically, is exactly what I'd been failing to do.

That realisation sent me down a path I'm still on: articulating my thinking, out loud, to other people, in writing, in public, because it changes the quality of the thinking itself.

Why putting your thinking out there matters

Let's start with what happens when people do articulate their thinking, both for themselves and for the people around them.

According to a Salesforce study reported in Inc., employees who feel their voice is heard at work are 4.6 times more likely to feel empowered to perform their best work. That's not just a nice-to-have. It suggests that the act of contributing, and being received, fundamentally changes how people experience their own capability. Meanwhile, Grammarly's State of Business Communication report found that 66% of knowledge workers and 72% of leaders wish their organisations would invest in tools to help them communicate more effectively. The appetite is there. People want to share their thinking. The friction is in the how.

Deb Liu, author of Take Back Your Power, captures a version of this tension. She tells the story of a talented colleague who kept getting overlooked: "Every time she came up for promotion or calibration, people were like, 'Oh, what does she do?' And it was because she was not good at broadcasting or explaining what she does." This isn't a story about someone who lacked insight. It's a story about insight that never made it out of one person's head and into the space where others could benefit from it.

That's the real opportunity here. Articulating your thinking doesn't just help your career, though it does that too. It sharpens the thinking itself. Every thought you put into words is a thought you've been forced to examine, structure, and stress-test. And every thought you share is one that other people can build on, challenge, or redirect. The value isn't just in being heard. It's in what the act of speaking does to the quality of your ideas.

Why articulation changes your thinking

Here's where things get interesting. Software engineers have known this for decades, through a concept called rubber duck debugging.

The idea, widely attributed to The Pragmatic Programmer by Andrew Hunt and David Thomas, is beautifully simple: when you're stuck on a bug, you explain your problem aloud to a rubber duck. That's it. You talk to a duck.

It sounds absurd, but it works because of something fundamental about how our brains process information. When you're scanning code silently, you're in "pattern-matching mode", your eyes are moving, you feel productive, but you're glossing over assumptions. The moment you force yourself to explain the problem step by step, as if to someone who knows nothing about your code, you shift into "explaining mode." You have to be explicit about what you think is happening versus what's actually happening. And that's usually where the flaw reveals itself.

Many developers find they solve the problem halfway through explaining it. Sometimes before they even finish the question. It's also why posting a question on Stack Overflow and then immediately finding the answer yourself is such a universal experience. The discipline of framing the problem clearly is often all it takes.

This isn't a programming trick. It's a thinking trick. And it maps directly onto that meeting feedback I received. I was doing the silent equivalent of scanning code, processing information internally, pattern-matching, feeling like I was contributing by listening carefully. But the act of forcing my thoughts into words, into the room, would have sharpened them in ways that internal processing simply couldn't.

Getting started (when it feels impossible)

If you're reading this and thinking "that's easy for you to say," I hear you. Speaking up is genuinely uncomfortable, especially in environments where you feel junior, uncertain, or simply outnumbered.

So how do you actually start? Claire Hughes Johnson, in her book Scaling People, offers a deceptively simple entry point: ask a question. A question is not threatening. It doesn't require you to have the answer. It just forces the room to examine an assumption. That's the meeting equivalent of rubber ducking. You're not declaring anything. You're just making implicit thinking explicit.

The fear, of course, is that you'll stumble. That you'll start a sentence and lose the thread halfway through, and everyone will notice. But research on the peak-end rule suggests otherwise: people remember how an experience ends far more than how it begins. That awkward pause where you gathered your thoughts? Nobody remembers it. They remember where you landed.

And here's the thing that took me longest to learn: the words matter less than you think. What matters is whether you're genuinely present in the conversation or holding back out of self-protection. People can tell the difference. A half-formed thought offered with genuine curiosity lands better than a polished point delivered defensively. The barrier to speaking up is almost always internal, and the cost of staying silent is almost always higher than you think.

The introvert question

Everything I've said so far assumes a fairly specific kind of person: someone who has thoughts but holds them back. But not everyone processes information the same way. Some people think by talking. Others think by reflecting. Telling introverts to "just speak up more" can feel like telling left-handed people to write with their right hand.

Donna Lichaw, in her book The Leader's Journey, tells the story of a leader who "was so quiet that her team thought she was not interested in them... it really was detrimental to them all working together." But the solution wasn't to force her into a different personality. It was communication about process: "She just started communicating with them more about, 'Hey, this is my style. I'm a little slower. I often need a couple of hours to really process things. I'm here, and I want you to know that.'"

This feels like a crucial middle path. The goal isn't to turn everyone into the loudest person in the room. It's to make your thinking visible, whatever that looks like for you. For some people, that's speaking up in the moment. For others, it's following up after with a considered email. For others still, it's writing.

From speaking up to writing it down

Deb Liu's manager gave her a piece of advice that changed her career trajectory: "Just write what you repeat. If you say something more than once, just write it down."

This led to years of monthly internal posts, which eventually became external publishing, and ultimately her book. The progression is worth examining: she didn't start by writing for the public. She started by capturing thoughts she was already having and putting them somewhere others could find them.

Ed Elson, on a recent podcast with Kyla Scanlon, pushes this idea further:

"If you want to connect the dots and be smarter about what is happening in the world, you should just make yourself speak more. Like, you should say more things at work. When you're in a meeting, you should just force yourself to say something, have an opinion."

And then the leap to public:

"You should do it online. Like, you should post on LinkedIn. You should start a newsletter. Like, I feel like the more you hold yourself accountable to producing thoughts and ideas, which is really uncomfortable to do, but the more you force yourself to do it, I think it really helps you connect the dots and also develop your own perspective."

What Ed is describing is rubber ducking at scale. When you write for an audience, even a small one, you're forced to examine your assumptions more rigorously than when the thoughts stay in your head. You have to structure your argument. You have to anticipate objections. You have to decide what you actually believe versus what you're merely parroting from someone else.

The accountability mechanism matters too. Deb Liu had a monthly cadence. Ed talks about holding yourself accountable to "producing thoughts and ideas." The regularity is what transforms occasional insight into a consistent practice of thinking clearly.

The critical caveat: informed takes, not hot takes

Now, I know what you might be thinking. Doesn't the internet already have enough people sharing their opinions? Do we really need more LinkedIn posts?

Kyla Scanlon provides the essential correction:

"I don't know if we need like more hot takes. I think we need more informed takes."

This is the distinction that separates valuable public thinking from noise. Hot takes can spark lively debate and provide fresh perspectives, but they also run the risk of oversimplifying complex issues or sensationalising topics for clicks.

And to be clear, none of this is an argument for speaking up purely for the sake of being heard. If you're in a meeting about a topic you know nothing about, interrupting the conversation just to register your presence helps nobody. That's not articulation, it's performance. The goal is to share thinking that you've actually done, not to fill silence with sound. There's a meaningful difference between "I haven't spoken yet so I should say something" and "I have a perspective here that might be useful."

But here's where the rubber duck metaphor becomes particularly powerful. The whole point of rubber ducking is that it forces you to slow down and examine your assumptions before you present your conclusion. The discipline of articulating your thinking carefully, whether to a duck, a colleague, or a LinkedIn audience, is itself the preparation that transforms a hot take into an informed one.

The answer isn't "share less." It's "think more before you share." And the best way to think more is, paradoxically, to commit to sharing regularly, because that commitment forces the preparation.

A progression, not a leap

Looking back at all of this, I think the real insight is that speaking up isn't a single skill. It's a progression:

Talk to the duck. When you're stuck on a problem, explain it out loud. To a colleague, to a rubber duck, to your dog. The act of articulating the problem will often reveal the solution.
Speak in the room. In meetings, force yourself to contribute earlier. Ask a question. Share an observation. Own it as your perspective, not a universal truth. The thought you're holding back might be the one that short-cuts the entire discussion.
Write what you repeat. If you find yourself making the same point in multiple conversations, write it down. Share it internally. You're already doing the thinking; capturing it creates leverage.
Put it out there. When you've built confidence in your perspective through the earlier stages, share it publicly. Start a newsletter. Post on LinkedIn. The accountability of a public audience will sharpen your thinking further.

Each stage builds on the last. Each one involves a slightly larger audience and a slightly higher bar for clarity. And each one delivers the same core benefit: forcing yourself to articulate your thinking makes the thinking better.

Back to the feedback

I think about that career feedback sometimes. Not because it took me ages to absorb, it didn't. But because understanding something intellectually and actually changing your behaviour are two very different things. Knowing you should speak up earlier in meetings is easy. Doing it, consistently, when every instinct is telling you to wait, to refine, to hold back until you're certain, that's the work.

Rubber duck debugging works because it forces you to be explicit about what you think is true. Speaking up in meetings works for the same reason. Writing works for the same reason. Publishing works for the same reason.

The medium changes. The mechanism doesn't.

So speak up. Write it down. Put it out there. Not because the world needs more noise, but because your thinking deserves to be examined, and the only way to examine it properly is to say it out loud.

The Product Sense Myth: Why Your Best PMs Aren't Born, They're Built

Steve James — Thu, 05 Feb 2026 00:00:00 GMT

We've all heard it. Someone in a hiring debrief leans back and says, "They just have great product sense."

It sounds meaningful. Decisive, even. But what does it actually mean?

Usually, it's a way of saying "I can't quite explain why I like this person, but I do." And that's a problem. Because when we treat product sense as some kind of innate gift, we stop asking whether it can be taught. Spoiler: it can.

Why "Product Sense" Is a Confusing Term

Here's the thing. Product sense does describe something real. When a seasoned PM looks at a prototype and immediately spots what's wrong, that's not luck. Their brain is doing something useful, pulling together years of experience into a quick judgement.

Christina Wodtke explains this well. She draws on Davenport and Prusak's concept that intuition is "compressed experience", and points out that when experienced PMs react quickly, "they're not being mystical. Their brain is processing hundreds of micro-signals: user flow friction, business model misalignment, technical complexity, competitive dynamics. Years of experience get compressed into a split-second gut reaction."

That's genuinely valuable. But by calling it "sense", we make it sound like something you're born with. And that's where we go wrong.

Jules Walter, who's spent over a decade building products at YouTube, Slack, and Google, is direct about this: "Contrary to what a lot of PMs believe, product sense is not something you need to be born with. It's a learned skill, just like any other PM skill."

So why do we keep treating it like magic?

What Baseball Can Teach Us About Product Hiring

Bear with me here, because this analogy is worth it.

Before Moneyball came along, baseball scouts picked players based on gut feel. They watched how someone moved, how they looked, whether they had that indefinable "it factor". They were experienced. They were confident. And they were often wrong.

Yahia Hassan makes this connection in his critique of how we think about product sense. He asks: "Could a baseball team win by constructing a roster based solely on their scouts' 'sharp eye for talent'? What would that look like? Selecting players based on their physical appearance? Or perceived skill?" His point is that product management is making the same mistake baseball made before the data revolution.

Now, I'm not saying intuition is useless. Far from it. But when we hire PMs based on whether they "seem" to have good product sense, we're doing something similar to those old baseball scouts. We're mistaking confidence for competence and experience for ability.

Hassan's takeaway is pointed: "The key takeaway here is, like a baseball GM, product managers need to use their product sense as a complement to being data-informed. This is why I think the ability to understand what metrics are important and measure them is what separates great PMs from the pack. More so than product sense."

So What Actually Is Product Sense?

If we're going to develop something, we need to define it properly. And vague definitions don't help anyone.

Jules Walter offers the clearest one I've come across:

Product sense is the skill of consistently being able to craft products (or make changes to existing products) that have the intended impact on their users.

Notice what's useful here. It's measurable. You can actually track whether your product bets are paying off more often than they used to. If you started out being right two times in ten and now you're right eight times in ten, your product sense has improved. Simple as that.

Walter breaks it down into two parts:

(1) empathy to discover meaningful user needs and

(2) creativity to come up with solutions that effectively address those needs.

Neither of those is mystical. Both can be worked on.

Austin Yang takes a slightly different angle. He finds the usual definition, "the ability to understand what makes a product great", to be "a bit self-repeating and ambiguous." Instead, he argues that "product sense should be equated to a person's ability to do certain things when limited information is given. The ability to outline all potential paths and obstacles will take a product closer to its destination, even in extreme ambiguity."

I like this framing because it's practical. It's not about having the right answer. It's about navigating uncertainty well. And that's something you can get better at.

As Yang puts it: "Product sense is a skill you refine through practice, not a talent you are born with."

How Meta Thinks About It

Say what you will about Meta, but they've thought hard about what makes a good PM.

Mihika Kapoor who led products at Meta before joining Figma, explains their approach: "Meta basically distilled the product role into two core capabilities: product sense and execution."

What's interesting is how she defines product sense in practice. It's not some innate gift. It's "just like having good intuition. And so there's kind of this question about like, okay, how do you build up intuition? And I think that it's just by like, having this insatiable curiosity and talking to users at every chance you get."

That's the key insight. Intuition isn't magic. It's pattern recognition built through exposure. And exposure is something you can deliberately seek out.

Stewart Butterfield, the co-founder of Slack, uses a cooking analogy that I find helpful. He compares developing product taste to becoming a chef. No one's born knowing how to balance flavours or time multiple dishes. You learn by doing it over and over again.

He also makes a sharp observation: "Most people don't have good taste and don't invest" in developing it. Which means if you do invest, you've got an edge.

How to Actually Get Better at This

Alright, so product sense is learnable. How do you learn it?

Wodtke argues that most people go about it wrong. They read articles and books about product thinking, which she compares to trying to learn tennis by studying physics. You need reps.

Here's what actually works:

Use products deliberately. Every app on your phone is a case study. Someone made decisions about every button, every flow, every piece of copy. Start asking why. What problem is this solving? Why did they make this trade-off instead of another one?
Spend time with users. Not reading summaries. Not watching highlight reels. Actually sitting with people, watching them use your product, listening to how they describe their problems. Kapoor's advice about "talking to users at every chance you get" is spot on.
Pay attention to trends. Walter recommends tracking both big shifts, like new platforms or regulations, and smaller patterns in your specific area. The PMs who seem to predict the future usually aren't psychic. They've just been paying attention longer.
Reflect on your bets. When you make a product decision, write down what you expect to happen. Then check back later. This turns vague intuition into something testable.

Data vs. Intuition Is a False Choice

One of the most common misunderstandings is that you have to pick a side. Either you're a "data-driven" PM or you trust your gut. But that's a false choice.

Marcy Farrell, writing for Harvard Business Publishing puts it well:

"Intuition is often seen as the opposite of reason, and when cast in this binary way, intuition is often defined as having no place in the age of science and data." But that framing misses the point. "... many decisions are too complex to rely on metrics or gut feelings alone. The best leaders and decision makers use both data and intuition to their advantage."

Laura Huang's research at Harvard Business School shows that gut feelings work best "in highly uncertain circumstances where further data gathering won't sway the decision maker one way or the other."

But there's a catch. The same piece warns that "at its best, intuition is a powerful form of pattern recognition, something human brains are wired to do. But when not managed well, pattern recognition and trusting one's gut may lead to bias and incomplete or overly simplistic thinking."

The best PMs use data to sharpen their intuition, and intuition to know what data to look for.

What This Means for Hiring

If product sense is learnable, we need to rethink how we hire.

Right now, most companies treat product sense interviews as a test of natural ability. Give someone an ambiguous prompt and see if they "get it."

But as Yang points out, junior PMs often haven't had enough experience to develop meaningful intuition yet. Evaluating them on product sense isn't identifying talent. It's identifying people who've been lucky enough to get relevant experience somewhere else.

This creates a cycle that rewards the usual backgrounds and filters out everyone else. Then we congratulate ourselves on finding "natural talent", when really we've just found people who started with more advantages.

A better approach? Hire for curiosity and learning ability. Invest in the experiences that build product judgement over time. Stop treating product sense as something people either have or don't.

The Bottom Line

Scott Belsky, Adobe's Chief Product Officer, captures what good product thinking actually looks like:

"Empathy > passion when building a new product. Empathy for those suffering from a problem outperforms passion for a potential solution."

When products fail despite hard work, he says, "usually, it's due to a lack of empathy for the customer." The answer is to go "shoulder to shoulder with them to identify this problem first before crafting your vision. You must talk to customers, watch them go about their day, and ask why they're struggling."

That's not magic. That's craft. And craft can be learned.

Product sense isn't a gift some people are born with. It's the result of caring about users, spending time with them, thinking hard about problems, and building feedback loops that make your judgement sharper over time.

Anyone can do that. The question is whether we're willing to put in the work, and whether our industry is ready to stop pretending otherwise.

Adventures with Clawdbot: From Autonomy to Economic Reality

Steve James — Wed, 28 Jan 2026 00:00:00 GMT

The New AI Hotness

Unless you've been living under a rock in the tech world lately, you've likely heard about the latest AI obsession: "Clawdbot" or "Moltbot" as it’s now called to avoid Anthropic's wrath. It is an open-source framework designed to give an AI model a persistent memory and personality, promising a future where your AI isn't just a tab in your browser, but a permanent, thinking resident on your hardware.

For the last few months, I’ve been exploring options that go far beyond simple server scripts. I wanted a 24/7 assistant that could 10X my ability to do stuff. A proactive partner that didn't just wait for a prompt, but actually thought and acted on my behalf. I wanted an agent with a "Brain."

Beyond the Sentinel: The Agentic Pivot

If you've visited my website, you may have read about how I spent considerable time building the Autonomous Homelab Sentinel. That project uses n8n for central orchestration, a web of hard-coded logic that, while effective, felt rigid. It was automation, but it wasn't agency.

The arrival of Moltbot offered a chance to move "up-stack." I repurposed an old 2012 Mac Mini to load Ubuntu Server and host "Jennifer", my resident agent. Unlike the node-based logic of n8n, Jennifer used a SOUL.md file for personality and a MEMORY.md file to maintain a persistent context across every interaction, turning my new mini server into the engine room for a full-blown personal OS.

The "Astonishingly Simple" 10X Assistant

The most striking thing about Jennifer wasn't just what she could do, but how astonishingly simple it was to achieve. With traditional automation, you spend hours debugging nodes and logic gates. Even using Claude Code with n8n MCP integration and skills, it was still extremely time-consuming. With Jennifer, very technical tasks were just completed without any bother.

Once Moltbot was set up—a process that included a simple Telegram integration, it didn't feel like working with a "chatbot" anymore. It felt like having a highly intelligent technical assistant on the payroll. I would simply tell her my goal, whether in the terminal, the web UI, or via Telegram, and the work happened behind the scenes:

Life Logistics: She integrated with and managed my to-do list, moving fluidly between conversation and structured data.
Email Synthesis: I gave her read-only access to my email inbox with instructions to identify anything that she felt she could help with. I soon noticed to-do items showing up in my list based on emails I’d received, without me lifting a finger.
Network Sentinel: She monitored my entire network via read-only APIs and alerted me proactively about any issues.
Proactive Engineering: Without me needing to spec out the "how," Jennifer built and deployed three functional apps on a local web server (accessible anywhere via Tailscale): an Interactive Homelab Dashboard, a Blogging Ideas Management tool, and a Project Management Kanban app.

I didn't have to write requirements. I didn't have to put together a PRD. I discussed what I wanted to do and Jennifer just went and did it. Each of these mini "Apps" is well-designed and just works. All of this, from first loading clawdbot to everything integrated was achieved in a few hours.

The Security Shadow: Autonomy vs. Vulnerability

However, this level of power is a double-edged sword. To make Jennifer a "10X" assistant, I needed to give her deep access to my local machine.

There are "gaping holes" in these early agentic frameworks that the community is still grappling with. By giving an AI shell access, you are creating a wide-open gateway into your network. If the framework is compromised, or the agent is "jailbroken" via prompt injection, you aren't just losing data, you have a persistent, autonomous actor working against you from inside your firewall.

The Economic Wall: Subscriptions vs. Reality

Even if you stomach the security risks, you hit the Agency Tax. Here is the cold reality: Using an agent like Moltbot 24/7 essentially violates the "personal use" spirit of Anthropic’s Pro or Max subscription plans. Those plans are built for human-speed chatting, not high-frequency machine orchestration. To stay above board, you have to move to a Pay-As-You-Go API billing model.

This is when the reality hits, hard. When you can literally see your token costs moving in real-time as your assistant endlessly polls your environment to "stay on top of everything," the dream becomes a budget item. To be a "Jarvis," Jennifer has to stay "alive," and that life is measured in millions of tokens.

You want to give your assistant the best brain possible, but using top-tier models (like Opus 4.5) for every background check is economically ruinous. I asked Jennifer to save money by dynamically up- and down-grading her model use, something Claude Code handles natively, and she enthusiastically agreed before promptly losing connection to all models entirely. Took about 20 mins to get that back up and running again.

Moving Up-Stack: The Current State

I’ve had to concede: until costs drop by an order of magnitude and security frameworks catch up, the 24/7 autonomous agent is a luxury I can’t justify. If an agent can't afford to "listen" proactively, it stops being an agent and goes back to being a high-latency chatbot.

I’ve shifted back from Agency to Tooling. I now use Claude Code on the Ubuntu Server, a human-triggered, high-reasoning engine. My monitors are back to being efficient, silent, and free cron jobs. It lacks Jennifer’s "human-speak," but it is Strategically Realistic.

The Jennifer experiment was a preview of the next age of product management. The tech is ready; the economics and security are simply waiting to catch up. I cannot wait until we get there.

The Case for Strategic Realism: Why You Shouldn't Abandon the Ideal (Even When It Hurts)

Steve James — Thu, 15 Jan 2026 00:00:00 GMT

The "Reality Gap" in Product Management

There is a recurring thread I see on Reddit and in private PM Slack communities that goes something like this:

"Marty Cagan, Teresa Torres, and Mik Kersten are have great ideas, but they don't live in the real world. They don't know what it's like to work in "my company" where Legal needs six weeks to approve a survey and the CEO prioritises features based on who he played golf with last weekend."

I understand this sentiment. Viscerally.

When you are fighting for inches of progress in an organisation that seems determined to move backwards, reading Empowered or Project to Product can feel less like inspiration and more like gaslighting. It creates a painful dissonance between the job you thought you had and the job you actually do.

But I want to challenge the conclusion that many Product Managers draw from this. The conclusion usually is: "These theories are idealistic nonsense, so I should just stop trying and accept that Product Management is just taking orders."

That is a mistake.

The authors aren't wrong. The models aren't broken. The problem is that we are underestimating the sheer psychological weight of organisational muscle memory. And more importantly, we are trying to sprint a marathon, burning ourselves out in the process.

We need a new approach. Not a new framework, but a philosophy of Strategic Realism.

The Science of Why It Hurts

It is important to understand that the frustration you feel isn't just professional annoyance; it is a psychological response to structural resistance.

Research into structural inertia theory suggests that organisations are designed to resist change. It is a feature, not a bug. Older organisations have high inertia because their structures and routines are solidified to ensure survival and reliability. When you try to introduce "continuous discovery" or "flow metrics" into this environment, you aren't just changing a process; you are fighting the organisation's survival instinct.

As I explored in my previous post on The Neuroscience of Resistance to Change, this resistance is deeply rooted in how human brains, and by extension, human organisations, process uncertainty. The "muscle memory" of the organisation will always pull it back to the status quo until a critical threshold of dissonance is reached.

This fight has a quantifiable cost:

71% of middle managers report feeling burned out, the highest of any employee group.
"Change fatigue" is a primary driver of disengagement, with nearly half of the global workforce reporting exhaustion severe enough to impact performance.
The failure rate for organisational change initiatives remains stubbornly high, estimated around 70% for decades.

So, if you feel like you are banging your head against a wall, it’s because you are. But that doesn't mean the wall is immovable. It just means you can't knock it down with your forehead.

Breaking the Binary: A New Philosophy

If Option A is "Naïve Idealism" (trying to force perfection immediately and burning out) and Option B is "Cynical Defeatism" (giving up and becoming a feature factory), we are left stuck in the middle without a map.

I suspect many of you are reading this and nodding along, perhaps a bit wearily. The dissonance between "what good looks like" and "what we have to do today" is the defining tension of our role right now.

That is why I'm stealing the term: Strategic Realism.

I am proposing this terminology because we need a way to describe this coping mechanism that doesn't sound like failure. "Compromising" sounds weak. "Giving up" sounds defeated. But Strategic Realism describes exactly what we are doing: making calculated, survival-based decisions to preserve our agency so we can fight another day.

It is a badge of honour, not a mark of shame.

Here is how to apply it:

1. Reframe Resistance as Muscle Memory

When a stakeholder demands a roadmap with fixed dates, they aren't necessarily being "anti-Agile." They are exhibiting muscle memory. They have spent 20 years believing in a system where dates equal certainty.

The Idealist fights the date and loses trust.
The Strategic Realist provides the date but wraps it in caveats, then works quietly to shorten the feedback loop so the date becomes irrelevant sooner.

2. The "Pocket of Air" Strategy

You cannot oxygenate the whole ocean, but you can create a bubble. Find one small area where you have autonomy, a specific feature, a single squad, a minor internal tool, and run it "the right way."

Do the discovery.
Measure the outcomes.
Protect the team. Creating this "pocket of air" proves that the model works without threatening the entire organisational immune system.

3. Accept "Good Enough" for Now

This is the hardest part for high achievers. Sometimes, a B-minus outcome is a victory if it keeps you in the game. If you manage to get one customer interview done this month, that is infinitely better than zero. If you manage to shift the roadmap from "features" to "problems" for just one quarter, take the win. Don't measure your success against a case study in a book. Measure it against where your organisation was six months ago.

Why You Must Not Give Up

The reason I push back against the Reddit cynicism is that the turning point is coming.

As I wrote previously, we are in a transition period. The old models of "project management" are objectively failing to deliver value in the Age of Software. The companies that cling to them will eventually face an existential crisis.

When that crisis hits, when the old muscle memory stops working, the organisation will panic. They will look for answers.

If you have given up and become a cynical ticket-mover, you won't be able to help. But if you have practiced Strategic Realism, if you have kept the flame of "true" Product Management alive in your pocket of air, waiting for the right moment, you will be the one they turn to.

Marty Cagan isn't wrong. He is just describing the destination. The terrain between here and there is swampy, foggy, and full of obstacles.

Your job isn't to pretend the swamp doesn't exist. Your job is to navigate it without drowning.

Take the pragmatic approach for your sanity, but keep your eyes on the ideal. The industry needs you to survive the journey.

The Innovation Paradox: Why You Can't Measure a Paradigm Shift

Steve James — Mon, 05 Jan 2026 00:00:00 GMT

In the modern Product Management universe, "Data-Driven" has become a religious mantra. We don't just use data; we rely on it to absolve us of responsibility. We want A/B tests, validation metrics, and market signals to tell us exactly what to build next.

It feels safe. It feels scientific. But there is a dangerous ceiling to this approach. A friend recently mentioned a thought that really stuck with me:

"No great achievement would have ever survived a cost-benefit analysis."

This isn't just about money; it’s about certainty. If you wait for the data to prove that a paradigm shift is definitely going to be a "good idea", you have already missed the boat. Data is a record of the past, and you cannot measure the performance of a product that creates a behaviour that doesn't exist yet.

The Trap of the Local Maximum

In product theory, we often talk about the Local Maximum.

Figure 1: Graph illustrating the local maximum versus the global maximum.

Imagine you are climbing a hill in the fog. You use your data (your altimeter) to make sure every step you take is "up". You optimise your path perfectly, eventually reaching the very top of the hill. The data says you can go no higher. You have reached the peak.

But because of the fog (the unknown future), you can’t see that you are standing on a small foothill, and right next to you is a massive mountain that soars ten times higher.

This is what a purely data-driven approach does. It is incredible at optimisation (climbing the current hill). It is terrible at innovation (going back down the valley to climb the next mountain).

Data could tell Blockbuster how to optimise late fees and store layouts.
Data could not tell them to abandon their stores and start a streaming service.

To move from the foothill to the mountain requires a temporary drop in metrics. It requires a leap of faith across the gap.

The Innovation Gap: Lessons from the Walkman

This is where the "Product Manager as Scientist" metaphor breaks down. In science, you observe existing phenomena. In innovation, you are trying to create new phenomena.

Consider the famous case of the Sony Walkman. In the late 1970s, market research was explicitly clear: consumers did not want a cassette player that couldn't record. It was a "broken" product. The data was accurate regarding current consumer expectations, but it was blind to future potential.

Akio Morita, Sony’s co-founder, famously ignored the research. He didn't have data; he had a conviction. As noted in innovation case studies, Morita believed that people loved music enough to want it with them everywhere, even if they couldn't articulate it yet. He made a unilateral decision to launch, stating, "The market research is in my head."

If he had waited for the data to validate his decision, the Walkman would never have existed.

Discovery is Not Taking Orders

Does this mean we should abandon research and just guess? Absolutely not.

The mistake many organisations make is treating research as a way to ask customers for solutions. If you ask a user what they want, they will describe a slightly better version of what they already have (a faster horse).

True Product Discovery isn't about gathering requirements; it's about uncovering unknown opportunities.

Don't ask: "Would you buy a portable cassette player that doesn't record?" (They will say no).
Do observe: How often do people struggle to listen to music while travelling? How does music change their mood? Where are the gaps in their current experience?

We must utilise qualitative research, observation, and deep discovery to understand the Problem Space intimately. But when it comes to the Solution Space, we cannot expect the customer to lead us. That is our job.

Conviction vs. Validation

This brings us to the core conflict in modern product leadership: the battle between Validation and Conviction.

Validation is the act of asking the world to confirm you are right. It is external. It seeks safety in numbers. It is the tool of the optimiser.

Conviction is the act of believing you are right before the world agrees. It is internal. It accepts the risk of being wrong in exchange for the chance to be transformative.

This is why true innovation is rare. It isn't a resource problem; it's a courage problem. When you lead with conviction, you strip away the protection of the spreadsheet. You are saying, "The data doesn't see this yet, but I do." You are putting your reputation on the line.

Leading Through the Fog

We shouldn't discard data. We just need to understand its role.

Data is for Navigation: It tells you where you are and how to optimise your current path.
Conviction is for Destination: It tells you where you need to go, even when the path isn't visible.

The best product leaders are Conviction-Led and Data-Informed. They prioritise research to understand the human struggle, but they do not expect research to design the solution.

To achieve something great, you have to accept that for a long time, the spreadsheet will look ugly. You have to be willing to walk down the hill to find the mountain.

Homework: Look at your current roadmap. Are you only building things that you can prove will work? If so, you might be optimising your way to a dead end.

The Critical Mass of Belief: Why We Biologically Resist the Shift to Product

Steve James — Wed, 31 Dec 2025 00:00:00 GMT

The Curse of the "Fixer"

I have a confession to make: I am incapable of looking at a broken process and walking away.

Throughout my career, this trait has been both a blessing and a curse. I’m the person who sees a software team drowning in bureaucracy, or a release cycle that takes six months for a one-line code change, and I have a pathological need to fix it. I’ve spent years working with organisations, trying to help them optimise their development practices, move away from rigid planning, and actually deliver value.

But more often than I care to admit, I’ve hit a brick wall.

I would present the data. I would show the cycle times. I would prove, mathematically, that the rigid need for predictability was damaging our ability to continuously deliver value. And yet, leadership would nod, agree, and then proceed to change absolutely nothing. It has always mystified me. Why, when presented with the solution to their problems, do smart people cling so tightly to the very systems that are sinking them?

The "Aha" Moment

I began to understand why in the most unexpected place.

I was recently reading Dan Brown’s new novel, The Secret of Secrets. I was trying to switch off from thinking about work, but I noticed an interesting passage that made me pause. Our protagonist, Robert Langdon, references the seminal work of philosopher Thomas Kuhn, The Structure of Scientific Revolutions.

The book references Kuhn's argument that a paradigm-altering change cannot truly take place until it has reached a "critical mass." Until that tipping point is reached, the old way of thinking doesn't just persist, it actively fights to survive to protect the status quo.

"Normal science, for example, often suppresses fundamental novelties because they are necessarily subversive of its basic commitments." — Thomas Kuhn

It changed my perspective on the resistance I’ve faced in boardrooms. It wasn't about logic, and it wasn't about data. It was about the psychology of belief.

The Guardian's Dilemma

It is easy to vilify resistant leaders as stubborn or self-interested, but that is rarely the truth. Most leaders aren't resisting because they are trying to protect their jobs, they are resisting because they are trying to protect the company.

These leaders view themselves as the guardians of the organisation's stability and growth. They are responsible for revenue, churn, and shareholder value. When we propose a shift to a Product Operating Model, moving from "fixed scope and dates" to "outcomes and experimentation", they don't hear "agility." They hear "risk."

This reaction is rooted in Loss Aversion. As explained by Wall Street Prep, the psychological pain of a potential loss (e.g., a dip in revenue or a failed release) is twice as powerful as the pleasure of an equivalent gain.

"Loss aversion refers to the tendency for people to prefer avoiding losses to acquiring equivalent gains... The pain of losing is psychologically about twice as powerful as the pleasure of gaining." — Behavioural Economics Principle

To a conscientious leader, the "known bad" of the current model feels safer than the "unknown good" of the new one. They stick to the old ways not out of ignorance, but out of a genuine, albeit misplaced, duty to prevent irreparable damage to the metrics that matter.

The Biology of "No"

This protective instinct is reinforced by our biology. Organisations are made of human beings, and despite what we'd like to believe, human beings are not data-driven machines, we are emotional, biological creatures.

According to research on the neuroscience of change by groups like Neurofied and Eighth Mile Consulting, the brain is wired to perceive uncertainty not as an intellectual challenge, but as a physical threat. When a leader is told to embrace "empiricism" (where the outcome isn't known upfront), it triggers the amygdala.

This flight-or-fight response bypasses the logic centres of the prefrontal cortex. The resistance you face is a biological safety mechanism. The brain favours the path of least resistance because it requires less caloric energy. This is Cognitive Inertia: the tendency to persist in established mental models simply because they are established.

Identity-Protective Cognition

Beyond the fear of damaging the company, there is also the subtle force of Identity-Protective Cognition (IPC).

Yale researcher Dan Kahan’s work suggests that we process information to protect our standing within our "tribe." If a Project Manager has spent 20 years mastering the art of the Gantt chart to give stakeholders a sense of certainty, they have built their professional identity on being "The Predictor."

To accept that predictability is impossible in software is to admit that their primary skill set is no longer relevant.

This triggers Identity-Protective Reasoning. The more data you provide contradicting their worldview, the more they may double down on their original belief. It’s not just about keeping a job; it’s about maintaining their sense of professional worth.

The Age of Software: A Crisis of Paradigms

This brings us back to Kuhn and the current state of our industry.

As Mik Kersten argues in Project to Product, and Carlota Perez maps in Technological Revolutions and Financial Capital, we are currently in the Turning Point of the Age of Software. The old paradigm is not the concept of a "project" itself, segmenting work is necessary. The old paradigm is the management model that demands we predict the scope, time, and cost of those segments before we have even begun.

But the anomalies are piling up. The failed digital transformations, the tech debt bankruptcies, and the inability of legacy banks to compete with digital natives, these are the cracks in the paradigm.

"Technological revolutions are not just about new technology; they are about the new common sense... The old way of doing things no longer works, and the new way is not yet fully understood." — Carlota Perez

The Inevitable End

So, does this biology of resistance mean I am giving up? Absolutely not.

I will not stop pushing for this change. I will continue to highlight the flaws in the old system, continue to provide the data, and continue to build the psychological safety required for leaders to let go of the illusion of control. I am committed to helping this industry reach that tipping point.

But I am also realistic. The "Product" way of thinking has been around for over two decades, yet we are still fighting for it. This suggests that the Critical Mass of belief Kuhn described has not yet been achieved. While I will keep fighting, history suggests that if we cannot win the argument, the paradigm shift will happen in one of two ways:

1. The Generational Shift (The Max Planck Route) As physicist Max Planck famously noted, "A new scientific truth does not triumph by convincing its opponents... but rather because its opponents eventually die, and a new generation grows up that is familiar with it." It is possible that we are simply waiting for the old guard to retire, making way for a generation that doesn't need to be convinced that the world is uncertain, because they've never known it to be anything else.

2. The Extinction Event (The Perez & Kersten Route) If organisations refuse to adapt, the market will make the decision for them. As Mik Kersten and Carlota Perez predict, we are exiting the installation phase of this technological revolution. The companies that cannot adopt the new "common sense" will simply fail, replaced by the digital natives and adaptive firms that embraced the new paradigm.

I’m betting on the change. Whether it happens through enlightenment, retirement, or extinction is up to us.

AI Will Supercharge Your Value Stream, But Only If You Fix Your Architecture First

Steve James — Tue, 16 Dec 2025 00:00:00 GMT

The New Promise of Flow

When we talk about AI in software delivery, the conversation usually stops at the IDE. We obsess over GitHub Copilot writing boilerplate code or LLMs drafting unit tests.

While these gains are real, they are local optimisations. Making a developer type 20% faster doesn't mean the customer gets value 20% sooner, especially if that code sits in a queue for two weeks waiting for approval.

The true revolution isn't in code generation; it is in Flow Intelligence.

AI is arguably the first technology capable of seeing the "invisible" parts of a value stream. It promises to predict bottlenecks, quantify delays, and orchestrate context across teams. But before we get carried away with the technology, we have to acknowledge an uncomfortable truth: we have known how to fix delivery for years, and we still haven't done it.

The Manuals We Ignored

The concept of the Value Stream is not new. It has been a staple of Lean manufacturing for decades and has been discussed in software circles for nearly as long.

The "gold standard" texts on this topic have been sitting on our bookshelves for years:

Mik Kersten’s Project to Product meticulously outlines why the project model fails in the Age of Software and provides the Flow Framework to fix it.
Donald Reinertsen’s The Principles of Product Development Flow provides the mathematical and economic foundation for managing queues, batch sizes, and Cost of Delay.

Despite this wealth of knowledge, true adoption has been poor. Most organisations still pay lip service to "Agile" while maintaining rigid, project-based governance structures.

What is new, however, is that AI has changed the calculus. Previously, implementing Flow required immense manual effort to gather data and police processes. Now, AI offers a game-changing capability to make Flow visible, predictive, and actionable, but only if the underlying streams are constructed correctly.

A Reminder: What Do We Mean by Value Streams and Flow?

Before we look at the potential AI advantage, it is worth reminding ourselves of the fundamentals.

In system development, a Value Stream is simply the sequence of activities required to deliver a product or service to a customer. It starts when a request is made (or a market opportunity is spotted) and ends only when value is realized in the hands of the user.

Flow is the measure of how easily work moves through that stream. In a healthy system:

Flow Velocity is high (value is delivered frequently).
Flow Efficiency is high (work isn't sitting in wait states).
Flow Load is managed (teams aren't drowning in WIP).

If you view your organisation through this lens, you stop managing people (resources) and start managing the work itself.

How AI Supercharges the Stream

If you have a defined Value Stream, AI becomes the ultimate accelerator.

1. Predictive Visibility

AI agents can ingest signals from disparate tools (Jira, GitHub, Slack) to construct a high-fidelity map of your stream. It can analyse Flow Load and historical throughput to predict bottlenecks before they form.

2. Dynamic Cost of Delay

Reinertsen teaches us that Cost of Delay is the key to economic decision-making. AI can finally make this practical. Instead of a theoretical debate, an AI agent can flag a stalled feature and quantify exactly how much potential revenue is being lost every hour it sits in "Ready for QA."

3. Context Orchestration

AI can carry the "context payload" through the stream. It can synthesize user research from the discovery phase directly into acceptance criteria for delivery, ensuring the "why" never gets lost as the work flows to the "how".

Why You Aren't Ready for AI-Driven Flow

You cannot supercharge a system that is fundamentally broken. The reason most organisations cannot leverage AI for flow optimisation isn't a lack of tools; it is a lack of autonomy and structural alignment.

The "Feature as a Project" Trap

Many organisations have stable teams, they don't fire and re-hire developers for every release. However, they still force these stable teams to operate within a Project Funding model.

Instead of funding a continuous stream of value, the business funds a "Feature Release" as a distinct project. As Kersten argues, this forces teams to batch work into massive, high-risk releases to justify the budget.

AI thrives on small batch sizes. If your governance model forces you to stockpile six months of code into one "Big Bang" release, AI cannot help you. It can predict the date you will miss, but it cannot fix the risk inherent in the batch size.

Architecture That Blocks Autonomy

Even if you have a "Product Team," do they actually own the full vertical slice of their value stream?

Often, a team is responsible for a product but lacks the architectural autonomy to deliver it. They might own the application code but depend on a central "Platform Team" for infrastructure, a "Data Team" for schema changes, and a "QA Team" for testing.

These dependencies are flow killers. If a team cannot deploy what they build without three other teams signing off, the value stream is severed. AI can visualize this blockage, but it cannot write the code to decouple your monolith.

What do we do then?

AI offers us the chance to finally solve the "visibility problem" in software delivery. It promises to turn the lights on in the factory, revealing exactly where value is leaking.

But technology amplifies the underlying habits of an organisation. If you apply AI to a well-architected value stream, you can achieve unprecedented speed and adaptability. If you apply it to a bureaucracy of gated releases and dependencies, you will just hit the wall faster.

The blueprints for success, written by Kersten and Reinertsen—have been available for years. AI is simply the urgent wake-up call to finally read them.

A Product Management Manifesto for the AI Era

Steve James — Mon, 08 Dec 2025 00:00:00 GMT

The Agile Manifesto Won the War on Delivery. Now We Need One for Value.

In 2001, seventeen software engineers met in a ski lodge in Utah and changed the world. The Agile Manifesto was a rebellion against heavy-handed documentation and rigid planning. It was necessary, it was effective, and it fundamentally solved the problem of how to build software.

But twenty-five years later, we face a different problem.

Agile frameworks have been so successful that "delivery" is often commoditised. We can ship code faster than ever. And now, with the arrival of Generative AI, the cost of producing code, content, and features is trending toward zero.

The bottleneck has shifted. The challenge today isn't can we build it? The challenge is should we build it?

A Return to Essence, Not a New Definition

To be clear: Prioritising value over volume has always been the true role of a Product Manager.

The best Product Managers have always understood that velocity is a metric for output, not outcomes. They knew that you could double your speed and still build something nobody wanted.

However, for the last two decades, the sheer difficulty of software delivery gave the industry an excuse. It was easy to drift into the role of "Backlog Administrator", spending days writing tickets, managing Jira, and "feeding the beast" of engineering. It was dysfunctional, but it was considered necessary work.

That excuse is now gone.

In the AI era, the "tactical ladder" of ticket writing, summarising interviews, and basic analysis is being automated. If your primary contribution is moving tickets around a board, an agent will soon do it faster and cheaper.

This manifesto is not about inventing a new job. It is about stripping away the administrative safety blanket and focusing on the only thing that has ever really mattered: judgement, strategy, and the definition of value.

The Shift: From Delivery to Value

We respect the original four values. They ground us in the reality of software creation:

Individuals and interactions over processes and tools

Working software over comprehensive documentation

Customer collaboration over contract negotiation

Responding to change over following a plan

But for Product Managers – especially those grappling with the speed and uncertainty of AI – we need to overlay a new set of priorities.

1. Customer & Outcome Learning over Backlog Administration

The most valuable time a PM spends is with the problem, not the ticket. We must get close to customers, frame the outcome, and decide how we will measure success. Backlog hygiene matters, but only as a means to an end. We must pair product outcomes (activation, retention, NRR) with delivery health (lead time, deployment frequency) to tell the complete story: "We shipped this, and it improved that". Refs: DORA, Accelerate

2. Validated Value over Shipped Features

A feature isn’t valuable because it shipped; it’s valuable because it changed user behaviour. In the AI era, this is doubly true. "Done" is no longer just code in production; it includes clean data, pre-agreed evaluation thresholds, and safety guardrails. We value the evidence of impact more than the volume of output. Refs: Model Cards, AI PRD

3. Continuous Collaboration over Periodic Alignment

Quarterly planning is too slow for modern product work. We value a steady, "little and often" rhythm of orchestration between customers, engineering, data science, and compliance. We prefer short discovery loops and frequent, transparent updates over "big bang" alignment meetings that are obsolete the moment they finish. Refs: INSPIRED, Project to Product

4. Adaptive Strategy over Fixed Roadmaps

Strategy is a set of bets, not a set of certainties. In a world of fat-tail risks and rapidly evolving AI capabilities, assumptions rot quickly. We treat the roadmap as a living hypothesis tied to outcomes, reserving the right to adjust scope and sequence as we learn. We value the agility of our direction more than the fidelity of our Gantt chart. Ref: Fat-tail uncertainty

12 Principles for Modern Product Management

If those are the values, how do we live them? Here are 12 principles to guide the daily work of a Product Manager in 2025.

1. Start with problems, frame outcomes. Begin every initiative by writing the customer problem in plain English. Identify who is affected and how you will measure "better." If you can't articulate the success metric – activation goes up, friction comes down – you aren't ready to build.

2. Flow and Value are inseparable. Engineering capability and commercial results are linked. Make this visible. Review DORA metrics (how fast we ship) alongside product KPIs (how much value we create). When delivery slows, outcomes suffer. You cannot manage one without the other.

3. Decide with evidence, not opinion. Replace long debates with small tests: a prototype, a concierge trial, a fake-door prompt. Agree on the thresholds before you test – know what "pass" and "fail" look like before the data comes in to prevent post-hoc rationalisation.

4. Shape small bets; sequence by risk. Break work into thin slices that prove the riskiest assumptions first. Whether it’s legal constraints, API latency, or user willingness-to-pay, tackle the "unknowns that can kill you" before polishing the UI.

5. Make AI "Definition of Done++". If you are building with AI, "done" means more than "it works on my machine." It requires documented data lineage, clear offline benchmarks, online success metrics, and safety guardrails. Operations should be boring – write the docs to keep them that way.

6. Organise by value streams, not projects. Fund durable teams that own a customer outcome end-to-end. Avoid short-lived projects that treat teams like mercenaries. When a team owns the long-term result, they make better long-term decisions.

7. Roadmaps tell the story, not the secrets. Use "Now/Next/Later" to signal direction without locking yourself into a lie. Stakeholders need to know the goals and the guardrails, but the specific features should emerge from discovery. Build trust through transparency, not false certainty.

8. Prioritise capabilities over ceremony. The best teams aren't faster because they have better meetings; they are faster because they invest in architecture, automated testing, and feature flags. Invest in the machinery that reduces the cost of experimentation.

9. Design for responsible impact. Users judge products by how they behave on their worst day. Bake privacy, safety, and explainability into discovery. Ask: "What is the worst plausible misuse of this?" and build the mitigation before you launch.

10. Make trade-offs explicit. Every choice has a cost. Write down the options considered (e.g., Build vs Buy, Fine-tune vs Prompt Engineering) and why you chose your path. A short "Decision Record" saves months of re-litigation later.

11. Teach the organisation to learn. Hold short reviews that focus on what we learned, not just what we did. Celebrate the null results – experiments that failed early save the company money. Share the learning, not just the launch.

12. Lead with clarity and kindness. PMs often lack formal authority, so our superpower is shared understanding. Be precise with language, generous with credit, and assume good intent. People follow leaders who make the complex feel manageable.

The Steward of "Why"

The original Agile Manifesto was a reaction to a time of scarcity, where shipping software was difficult, expensive, and slow. Its principles were designed to remove friction.

Today, we face a crisis of abundance. AI and modern tooling have made the act of creation easier than ever. The barriers to entry have collapsed. But this ease of creation brings a new danger: the ability to build the wrong things, faster and with more confidence than ever before.

This is the turning point for Product Management. The tactical ladder of ticket-writing and backlog administration is disappearing. What remains is the harder, more human work: judgement, orchestration, ethics, and the courage to stop a feature that adds no value.

We must stop pretending we can predict the future with Gantt charts and instead build the capability to respond to it. We must move from being the managers of schedules to the stewards of value.

That is the manifesto for the next age. Not just to work faster, but to work with purpose.

Agile Governance: Changing the Rules Without Losing Control

Steve James — Thu, 20 Nov 2025 00:00:00 GMT

When organisations talk about "going agile", governance is often the awkward topic that everyone sidesteps.

Leaders worry that if they relax traditional controls, work will go off the rails. Teams worry that if they keep the old controls, nothing will really change and the agile transformation will remain a set of rituals on top of business as usual.

The key point is simple:

Agile ways of working do not remove the need for governance.
They change how governance is designed so that it supports the flow of value rather than blocking it.

To get this right, you cannot just bolt an agile framework like Scrum or SAFe onto a traditional project management model. The real work is to apply agile principles to the way you govern. That means shaping governance around your own culture, risk appetite, and constraints. What works in one organisation may fail completely in another.

Agile Is More Than Scrum

I always feel that it's important to remind people that Agile doesn't necessarily mean Scrum. It is very easy to fall into the trap of thinking that "agile" means Scrum ceremonies and story points.

Agile refers to a set of principles and values. Frameworks like Scrum, Kanban, XP or SAFe are simply different ways of applying those principles in context.

Some of the core agile principles that matter for governance are:

Prioritising outcomes and customer value over internal activity
Welcoming change because learning is continuous
Delivering in small increments so that risk and uncertainty are reduced
Working in close collaboration with stakeholders and users
Building sustainable pace into the way teams deliver
Making decisions based on feedback from real use, not just plans

If governance is not aligned with these principles, it does not matter which framework you choose. You will still have friction.

What Governance Is Really For

Governance is not a dirty word. It is simply a structured way of making decisions and providing oversight.

At the delivery level, governance exists to answer questions like:

Are we investing in the right things
Are we delivering value at an acceptable pace
Are we managing risk in a responsible way
Are we learning and adapting as we go

These are good questions. The problem is not the intent. The problem is the way many organisations have answered them, usually through heavy, document driven processes and a stage gate mindset.

In traditional project management, that typically means:

Large batches of requirements defined up front
A detailed plan that is expected to hold for months or years
A sequence of gates where documents are reviewed and signed off
A strong focus on time, scope, and budget as success criteria

This model optimises for predictability in a stable environment. Modern digital product work lives in an unstable environment and is full of uncertainty. The old model clashes with that reality.

When Agile Principles Meet Traditional Governance

Agile delivery, regardless of framework, assumes that:

You will not know everything up front
You will learn as you go and adjust direction
Smaller, more frequent releases are safer than big bangs
Teams need local decision making authority
Feedback loops should be short and continuous

Traditional governance often assumes the opposite. The result is friction at several levels.

1. Up front certainty vs emergent learning

Traditional governance likes fixed scope, fixed dates, and fixed budgets presented early in the process. Agile principles accept that scope will evolve, that dates will be refined as we learn, and that investment should follow evidence of value.

When governance insists on a detailed, frozen plan, product teams end up writing documents for gate approval while the real decisions live in backlogs, roadmaps, and conversations.

2. Big, infrequent decisions vs continuous decision making

Stage gates condense decision making into occasional large events with big packs and formal sign offs.

Agile principles favour many small decisions. Teams refine what to build next, stakeholders react to real increments of working software or service, and priorities shift based on new information.

When you try to run both models in parallel, either the agile decisions are slowed down by the gate schedule, or the gates become rubber stamps that add cost without real control.

3. Centralised sign off vs empowered product ownership

Traditional approaches rely heavily on committees, steering groups, and central approval bodies.

Agile principles expect clear, empowered roles. A Product Owner or similar role is accountable for value and priorities. The delivery team is accountable for quality and execution. Decisions are made as close to the work as possible, within clear organisational constraints.

When every meaningful decision still has to go back to a board, you remove the very autonomy that agile depends on. The result is delay, frustration, and a lot of duplicated storytelling.

4. Slowing change vs flowing change

A classic risk response is to add more controls and approvals. If something went wrong in the past, the instinct is to require an extra sign off or another checklist.

Agile principles and modern engineering practice take a different approach. They reduce risk by making changes smaller, more frequent, and better tested. Automated tests, continuous integration, deployment pipelines, and peer review increase the safety of change without slowing it down.

If governance is still built on the idea that fewer, larger releases are safer, it will continually fight the engineering practices that actually reduce risk in complex systems.

What Agile Governance Looks Like In Practice

Agile governance is not the removal of control. It is the redesign of control so that it lives inside the way work is done and supports the flow of value.

Some common elements of agile aligned governance are:

1. Clear decision rights based on roles

Product roles (Product Owner, Product Manager) are given real authority over what gets built and in what order
Teams are given autonomy over how they deliver, within constraints on quality, security, and compliance
Risk, legal, finance, and other control functions have defined points of engagement, and their input is integrated into the regular rhythm of delivery rather than handled as a separate track

Everyone knows who decides what and on what basis.

2. Self organising teams with explicit boundaries

Self organisation does not mean chaos. It means teams are trusted to organise their work to meet agreed goals.

Governance defines the boundaries. For example:

Non negotiable standards for security, privacy, and architecture
Expectations for testing, documentation, and support
Release practices, such as the need for automated checks and rollback plans

Within those boundaries, teams have freedom. Governance is expressed as guardrails, not detailed instructions.

3. Built in inspection and adaptation

Agile teams use a variety of regular events and feedback loops, for example:

Short planning cycles
Frequent reviews or demos of working software or service
Regular retrospectives that look at process, not just output
Continuous monitoring of systems in production

These are not just team rituals. They are governance mechanisms. They create transparency and trigger adaptation. The presence and quality of these practices can be a formal governance concern.

4. Technical practices as controls

Definitions of Ready and Done, code review, automated tests, and deployment pipelines are all part of governance.

Instead of asking "Did you fill in this template", governance can ask:

Do we have a reliable way to prove that this change is safe
Can we deploy and roll back quickly
Is the environment observable enough that we can detect and respond to issues fast

These questions are more directly connected to risk than a sign off signature on a slide deck.

5. Governance that consumes real artefacts

In an agile setting, the most reliable sources of truth are living artefacts:

Product backlogs and roadmaps
Delivery metrics and flow measures
Pipeline status and test results
Observability dashboards and incident reports

Governance bodies should review these real artefacts, not separate status packs that summarise them weeks later. That keeps the conversation grounded in reality.

There Is No One Size Fits All Model

A critical point that often gets overlooked is that agile governance is not a single template you can copy from another organisation.

Agile principles are universal. How you realise them in your context is not.

Different organisations have different:

Cultures and levels of trust
Regulatory obligations and risk profiles
Legacy systems and technical constraints
Talent profiles and experience levels
Business models and strategic horizons

A bank with heavy regulatory requirements will design governance differently from a start up that is still exploring product market fit. Both can be agile in intent, but their controls will look very different.

Trying to copy a governance model from another organisation without adapting it to your own context usually fails. People go through the motions, but either risk is not managed properly or delivery slows to a crawl.

The right question is not "What is the best practice governance model". The right question is "How do we apply agile principles to governance for our organisation, given our culture and constraints".

Why Agile Governance Cannot Sit On Top Of Traditional Project Management

If agile principles are driving your delivery model, and traditional project management is driving your governance model, you are effectively running two operating systems at once.

They optimise for different outcomes:

Traditional project governance	Agile aligned governance
Optimises for adherence to plan	Optimises for learning, flow, and outcomes
Assumes scope can be known and fixed up front	Assumes scope will evolve as we learn
Prefers large, infrequent releases	Prefers small, frequent, reversible changes
Centralises decision making at formal gates	Distributes decision making within clear boundaries
Treats change as something to minimise	Treats change as normal and expected
Measures success by time, cost, and scope	Measures success by outcomes and impact

You can already see the conflicts:

The project plan says scope is fixed, the product team wants flexibility
The gate calendar says decisions happen monthly, the team wants to deploy daily
Governance asks for a big pack, the team points to live dashboards and backlogs

If you persist with this split brain model, you usually get the worst of both worlds. Delivery slows down, and the apparent control is more illusion than reality.

For agile governance to be effective, you have to let agile principles shape the governance model itself. You cannot treat governance as an untouched layer that sits above delivery.

Designing Governance For Flow Of Value

So how do you start to redesign governance in a way that supports agile delivery and still respects your responsibilities as an organisation

Here are some practical starting points.

1. Anchor governance in outcomes and value

Make business and customer outcomes the centre of the conversation. Ask:

What problems are we solving
How will we know if we have made things better
How quickly can we detect that something is not working

Let these questions guide investment and stop or pivot decisions.

2. Define clear accountabilities and decision rights

Map out who is accountable for:

Product direction and priorities
Delivery approach and quality
Risk management and compliance input
Financial oversight

Create simple, explicit rules about what teams can decide independently and what must be escalated or shared.

3. Use flow and quality metrics instead of document counts

Track things like:

Lead time from idea to production
Deployment frequency
Change failure rate and time to restore
Defect trends and incident patterns

These metrics give a much clearer picture of health than the number of documents produced.

4. Approve capabilities, not individual changes

Instead of signing off every release in a committee, invest in the capabilities that make changes safe:

Automated tests at multiple levels
Reliable pipelines
Strong peer review practices
Good observability

Once those are in place and monitored, allow teams to release within defined boundaries.

5. Tailor governance to your context and iterate

Start with a light, simple governance model that applies agile principles in your environment. Put it in writing, but keep it short and clear.

Then treat that model as a product:

Inspect it regularly
Gather feedback from teams, stakeholders, and control functions
Adapt it when you see bottlenecks, misunderstandings, or blind spots

Governance itself should evolve through experiments and learning, just like your products do.

Wow, you've made it this far?

So, to summarise...

Governance and agile delivery are not enemies. They are only in conflict when governance is frozen in a world that no longer exists.

To make governance work in an agile environment:

Think in terms of principles, not frameworks
Let those principles shape governance, not just delivery rituals
Tailor governance to your culture, your risks, and your constraints
Focus on enabling flow of value, not simply enforcing a plan

You cannot just "run agile" inside the walls of a traditional project management model and hope it all fits together. You need a governance approach that is just as thoughtfully designed and just as adaptive as the teams it aims to guide.

Product Management at the AI Turning Point: Why the Next Age Demands a New Role

Steve James — Sun, 28 Sep 2025 00:00:00 GMT

Intro

Every few decades, the world crosses a turning point, a moment when old paradigms no longer work and entire industries, organisational models, and roles are upended. In Technological Revolutions and Financial Capital, Carlota Perez maps how each major innovation wave (steam, railways, electricity, mass production, information) passes through an Installation Period, then a crisis-ridden Turning Point, and finally a Deployment Period when the new paradigm becomes the operating system of the economy. Companies that reorganise around the new paradigm survive, those that cling to the old rules fade. Buy Perez (Amazon UK)

Mik Kersten’s Project to Product extends this lens to modern software and beyond. He argues that at the turning point, organisations must rethink how they create value, not by adding a tool or two, but by overhauling how work is organised, funded, and measured. As he summarises (paraphrased), at the turning point, businesses either master the new means of production or become relics of the last age. Buy Kersten (Amazon UK)

It’s not surprising then that this is dominating conversation. A quick Google search for “Product Management and AI” returns tens of thousands of results across blogs, newsletters, and LinkedIn posts. The sheer volume of writing on the subject is a signal that product leaders everywhere are grappling with the implications of this paradigm shift, wondering what parts of their role will endure and which will be automated away.

AI is reshaping not just the products we build but the very nature of work itself. For Product Managers, this turning point demands a shift in mindset, moving beyond execution to strategy, ethics, and orchestration.

1) Turning points mean overhaul, not tweaks

Perez shows each revolution reconfigures the economy’s rules: capital flows, institutions, and the organisation of production all change. Companies that handled electricity like “just a new machine” lost to those that redesigned factories, supply chains, and skills for continuous, scalable power. The lesson carries forward: you don’t “install” a revolution, you rebuild around it.

Carlota Pérez puts it starkly:

“Societies are profoundly shaken and shaped by each technological revolution and, in turn, the technological potential is shaped and steered as a result of intense social, political and ideological confrontations and compromises.” — Carlota Pérez

This is the scale of what we face with AI. It is not a feature set, it is a societal shift that will demand structural responses.

2) Why the AI turning point is different

In prior ages, machines replaced muscle, digitisation replaced paper, and software scaled logic. AI now automates and scales analysis, drafting, synthesis, and decision support. The consequences:

Human-centred responsibilities become automatable. Research synthesis, competitive scans, early wireframes and copy, call triage, contract pre-review, first-line support, large chunks can be handled by AI agents.
Barriers to entry collapse. Small, AI-augmented teams can rival large incumbents on output speed and variety.
Feedback loops compress. AI cycles through ideate, build, measure faster, enabling “learn before launch” and rapid iteration.

Marty Cagan has noted the magnitude of this shift:

“What makes this discussion so hard is that almost every day things are changing … AI is a goldmine of opportunity. It’s also the biggest threat to how we do products.” — Marty Cagan

Paweł Huryn, in collaboration with Miqdad Jaffer (Product Lead @ OpenAI), cautions that most failures come from treating AI like a bolt-on feature rather than a system requiring new alignment, risk thinking and evaluation. Their widely shared AI PRD Template forces teams to consider business case, failure modes (for example hallucinations), and guardrails up front, because skipping AI-specific considerations is exactly how teams ship clever demos that don’t survive contact with reality.

“Given the hype around AI, many implement AI without a clear, justified business case… AI-specific considerations are often overlooked.” — Product Compass (AI PRD, Huryn & Jaffer)

Read it here: AI PRD Template — Product Compass

3) What changes for Product Managers (and why junior tasks vanish)

For years, junior PMs climbed by doing repeatable, tactical work:

Writing user stories and acceptance criteria
Summarising interviews and clustering insights
Competitive analysis and teardown decks
Drafting copy and low-fi wireframes
Pulling analytics and building dashboards

AI agents can already handle large portions of this. Huryn and Jaffer both emphasise AI literacy and AI intuition as the new baseline: understanding probabilistic behaviour, non-determinism, error analysis, and how to iterate prompts, retrieval and context.

Marty Cagan captures the enduring essence of PM work:

“The most important thing to remember about product management is that it’s not your job to make things happen, it’s your job to make sure things happen.” — Marty Cagan

This distinction is even sharper in the AI era. PMs must orchestrate, align, and decide, not write all the tickets themselves.

Lenny Rachitsky echoes this: AI will automate some PM activities, while amplifying others (influence, judgement, storytelling, prioritisation). PMs are well positioned if they move up-stack to what cannot be automated: Why PMs Are Best Positioned to Thrive in the AI Era

4) A possible playbook to adapt

There is no single blueprint for how Product Managers should respond to the AI era. Each organisation, product, and market context will be different. That said, there are some patterns that appear promising, and which could serve as a starting point for PMs looking to adapt. Think of this less as a prescriptive checklist and more as a set of suggestions that might help guide your approach.

Build AI literacy and “AI product sense”

You don’t need to become a data scientist, but you do need to understand how AI systems behave in practice. Concepts like embeddings, retrieval-augmented generation (RAG), fine-tuning, hallucination, and model drift should feel as familiar to you as “MVP” or “retention curve”. Without this fluency, it’s difficult to make credible product decisions or challenge technical trade-offs.

Example: Instead of just accepting an engineering estimate, a PM could prototype a feature themselves using a no-code tool like LangChain or a custom GPT. This hands-on experiment helps them understand whether the model reliably returns accurate results. They can then have an informed discussion with engineering about latency, cost per API call, and what failure handling will look like.

Use AI-augmented discovery (but validate with people)

Discovery is still about learning what customers need, but AI lets you accelerate early steps. Large language models can cluster survey responses, identify patterns in support tickets, and even suggest unmet needs. But this is a starting point, not a substitute for talking to customers.

Example: Imagine you receive 1,000 pieces of open-text feedback from beta users. Instead of reading them all, you feed them into an AI model to generate clusters: “speed issues”, “confusing navigation”, “missing integrations”. The PM then validates those categories by calling five users in each cluster to ask: “Tell me more about when this slows you down.” AI accelerates pattern recognition, humans still validate what matters most.

Redesign metrics for outcomes, not outputs

Traditional product metrics, velocity, number of tickets closed, or features shipped, are less meaningful when AI can generate outputs instantly. Instead, focus on outcomes: do customers trust the AI? Are they retaining? Do they get value faster?

Example: A team launches an AI-powered writing assistant. Instead of reporting “we shipped 12 features”, the PM tracks Time to Value (TTV), how quickly a new user produces their first useful draft. If TTV drops from 20 minutes to 3 minutes after redesigning onboarding, that’s evidence the AI is working. Similarly, tracking Time to Learn (TTL) (for example how fast the team can validate whether a new prompt-engineering change reduces hallucinations) becomes central to iteration speed.

Treat safety, ethics and explainability as product requirements

AI systems behave probabilistically, they can and will fail in unexpected ways. If PMs ignore ethical risks, bias, or explainability, they risk reputational damage or regulatory pushback. Treat these as first-class features in your roadmap, not afterthoughts.

Example: A PM working on an AI-powered recruitment tool might add a requirement: “The system must provide a plain-language rationale for why a candidate was shortlisted.” This could take the form of an “explain” button showing which parts of the CV matched the role description. Rather than being an extra feature, explainability is part of the core product requirement, because trust is the product.

Lead operating-model change (not just roadmaps)

AI is not just a product feature, it changes how organisations must work. PMs need to step into a leadership role, helping decide how teams are structured, how budgets are allocated, and where human oversight is required.

Example: At a SaaS company, a PM notices multiple teams are experimenting with their own AI chatbots, duplicating work. Instead of letting each team go their own way, the PM champions the creation of a shared AI platform team. This avoids duplicated infrastructure costs and ensures consistent guardrails for safety. In doing so, the PM is not just managing a feature, they are helping re-architect the operating model.

Become the narrative architect

With AI products, much of the PM’s influence comes from explaining why this AI matters, how it works, and what guardrails are in place. Engineers, executives, sales, and customers all need confidence. A good PM turns technical complexity into a story people can rally around.

Example: A PM launching an AI-powered financial advisor doesn’t just say “it uses GPT-4 for recommendations”. Instead, they frame it as: “Our AI helps you make decisions in seconds that used to take hours. It suggests options but always gives you the source so you can double-check. Think of it as a co-pilot, not a replacement for your judgement.” This framing reassures compliance teams, inspires marketing, and sets realistic customer expectations.

Final Thoughts

Perez and Kersten remind us that turning points don’t reward half-measures. You can’t “install” a technological revolution, you rebuild around it. AI isn’t just another wave inside the Age of Software, it’s a fundamental shift in how cognitive work is done. That means roles, operating models, and industries are up for renegotiation.

For Product Managers, the implication is clear:

The tactical ladder is disappearing as AI handles junior-level tasks.
The new job is judgement, orchestration, ethics, metrics, and organisational change.
The winners won’t be the teams that “adopt a framework”, but the ones that overhaul how they operate, anchored in outcomes and trust.

The turning point is here. Adapt now, or be reshaped by the next age.

References & Further Reading

Scaling Agile Without Losing Its Soul: A Product Leader’s Perspective on SAFe, Flow, and Value Streams

Steve James — Thu, 14 Aug 2025 00:00:00 GMT

Agile began with a promise: faster delivery, empowered teams, and relentless customer focus.

For those of us leading Product organisations, staying true to those ideals becomes exponentially harder as we scale — when small, nimble teams become hundreds of engineers and multiple interdependent systems.

The question becomes:
How do we scale without losing what made Agile powerful in the first place?

Many have tried to answer it.
Dean Leffingwell’s SAFe, Marty Cagan’s Product Operating Model, Mik Kersten’s Flow Framework — each offers valuable tools and insights.

But here's the truth:
There is no silver bullet. No framework, no model, no operating system can simply be "installed" to make scaling Agile effortless.

Instead, great Product organisations treat these frameworks as toolkits, not dogma.
We adapt concepts thoughtfully — never blindly applying "best practices" without understanding our context.

What SAFe Gets Right About Scaling Complex Systems

At its core, SAFe offers something important:
A way to think about scaling agility across large, high-risk, high-complexity environments where many teams must move in concert.

The Value Stream Layer in SAFe recognises that:

Multiple Agile Release Trains (ARTs) must be coordinated without losing autonomy.
Larger-scale solution development requires new roles (like Value Stream Engineers and Solution Management) to balance technical, process, and content leadership.
Solutions live inside complex contexts, and understanding that environment is essential.
Capabilities must be incrementally delivered, not deferred into giant waterfall releases.

These concepts — Value Streams, Solution Intent, Capabilities, customer collaboration across scale — are powerful ideas that every scaling organisation should grapple with, whether they formally "use SAFe" or not.

Where Scaling Efforts Go Wrong

The failure mode isn’t SAFe itself, or Flow, or any other model.
It’s how organisations misinterpret or misuse them:

Turning flexible frameworks into rigid bureaucratic process templates.
Prioritising governance and reporting over outcomes and learning.
Smothering teams under compliance paperwork instead of enabling autonomy.
Rebuilding the very command-and-control structures Agile was meant to replace — but now with new names.

When scaling frameworks are weaponised as control mechanisms, we don't just fail to scale Agile — we actively destroy it.

The tragedy is that organisations then blame the framework, when the real issue is a failure of mindset and leadership.

Leading Scaled Agile - Principles Over Prescriptions

As a Product leader, my focus is not on "installing" frameworks.
It’s on guiding scaling efforts based on enduring Agile principles:

1. Organise Around Value

Start with Value Streams — not project budgets, not team head-counts.
Understand how value flows to customers and align teams around that.

2. Empower Teams at Every Level

No matter how large the system, autonomy remains essential.
Roles like Solution Management or Value Stream Engineering exist to enable teams — not control them.

3. Deliver Incrementally

Whether you call them Features, Capabilities, or Flow Items, the mandate is the same:
Ship value early and often. Learn fast. Avoid big-bang integration disasters.

4. Preserve Flexibility, Then Converge

Scaling doesn’t mean fixing all requirements up front.
It means managing uncertainty thoughtfully — using modelling, analysis, and incremental validation to converge on the right solution over time.

5. Treat Customers and Suppliers as Full Members of the Value Stream

Customers aren't stakeholders "over there."
Suppliers aren't external vendors.
They are part of the end-to-end system and must be engaged continuously, not just at contract milestones.

Frameworks Are Maps, Not Mandates

SAFe offers a map.
So does Cagan's Product Operating Model.
So does Kersten's Flow Framework.

But maps are not a mandate.

Every organisation is different:
Its market, its technology, its history, its people.
Blindly applying someone else’s map will inevitably lead you into dead ends and wasted effort.

Instead, Product leadership at scale is about thoughtful adaptation:

Borrow the best concepts.
Respect the need for coordination at scale.
Preserve Agile’s core — customer focus, empowered teams, fast feedback, continuous learning.

And above all, stay skeptical of anything that claims to be a one-size-fits-all solution.

Agility at scale is possible.
But it’s not installed.
It’s built — deliberately, collaboratively, patiently, and always with eyes open.

The Innovation-Killing Myth of Predictability

Steve James — Thu, 24 Jul 2025 00:00:00 GMT

"It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so."
—Mark Twain

For decades, businesses have treated predictability as the holy grail of software development. If we can just get the estimates right, tighten up our roadmaps, and control the delivery pipeline, we tell ourselves, innovation will follow. But the truth is more uncomfortable: our pursuit of certainty is actively killing innovation.

Why Innovation Matters More Than Ever

Chasing predictable practices leads to predictable outcomes. And while predictability might offer short-term comfort, it rarely delivers long-term growth. Innovation is the engine of differentiation. Without it, your product or service becomes commoditised, and when you’re only competing on price, you're in a race to the bottom.

As Scott Galloway puts it, there are three questions every product or service must answer to succeed:

1. Differentiation

Is the product or service truly differentiated? Does the strategy clear the hurdle of differentiation from our competitors?

If your offering isn’t meaningfully different, then why should customers choose you? Innovation is how you escape the sea of sameness. Differentiation isn’t just about having more features, it’s about positioning, usability, quality, emotional appeal, and solving a real pain better than anyone else. If you look like everyone else in your market, then price is all that’s left to compete on.

2. Relevance

Does anyone care about the differentiation?

Relevance is the filter that determines whether differentiation has any actual value. You might have an elegant solution, but if it doesn’t speak to your users' jobs-to-be-done, it's irrelevant. Relevance means being attuned to changing customer needs, market dynamics, and usage patterns. It’s not just about what you can do, it’s about what your users are trying to do.

3. Sustainability

What steps can we take to ensure that our products and services continue to be differentiated and relevant?

Markets evolve. Competitors copy. Technology shifts. A point of differentiation today can be table stakes tomorrow. Sustainability means building a repeatable system of learning and adapting. That includes investing in research, creating feedback loops, enabling rapid iteration, and maintaining organisational flexibility.

In short: without innovation, your brand is just a logo. With it, you build something people want, remember, and come back for.

The Cost of Certainty

Donald Reinertsen, in The Principles of Product Development Flow, puts it succinctly:

“We will create risk-averse development processes that strive to 'do it right the first time.' This risk aversion will drive innovation out of our development process.”

In trying to remove variability, we remove the very conditions required for discovery.

Variability Is the Work

Lean and Agile weren’t designed to enforce predictability, they were created to manage variability in service of value.

“Responding to change over following a plan.”
— Agile Manifesto

Marty Cagan is clear on this point in Inspired:

“No software development methodology, Agile, Lean, or otherwise, can ensure predictability.”

The goal is not to follow a plan perfectly. It’s to inspect, adapt, and course-correct based on what you learn. As Dan North puts it:

“Predictability is a trap. If you're measuring success by your ability to predict the future, you’re optimising for the wrong thing.”

Agile works not because it avoids variability, but because it embraces it.

Context Matters: The Cynefin Framework

Dave Snowden’s Cynefin Framework is a decision-making model for managing uncertainty. Most innovation work happens in the Complex domain — where cause and effect can only be known in retrospect.

When we force complex work into predictable containers, we choke innovation.

Local Optimisations, Global Dysfunction

In The Phoenix Project, Gene Kim and co-authors write:

“Local optimisations are the enemy of global flow.”

When teams are judged by outputs instead of outcomes, they optimise for activity, not value. As Marty Cagan reinforces in Empowered:

“Empowered product teams are assigned problems to solve... and are held accountable to results.”

Why We Still Chase Control

In Radical Focus, Christina Wodtke warns against turning OKRs into glorified task lists.

In Antifragile, Nassim Taleb argues:

“If you see fraud and do not say fraud, you are a fraud.”

We pretend to plan our way out of complexity. But planning is not foresight, it’s often just structured denial.

A Better Way Forward

Shift from Projects to Products Long-lived teams build context and evolve solutions.
Use Objectives, Not Deadlines Let teams focus on problems, not fixed outputs.
Create Slack for Innovation 100% utilisation destroys adaptability.
Trust Teams with the Why Empower them to find the right how.
Measure What Matters Focus on outcomes, not just throughput.
Match Approach to Domain Use Cynefin to know when to explore and when to execute.
Design for Differentiation, Relevance, and Sustainability Solve meaningful problems in ways that last.

Final Thoughts

Predictability isn’t inherently bad, but it must be applied with care. Roadmaps, OKRs, and deadlines are tools, not doctrine. When everything is optimised for predictability, innovation withers.

The real work of building great products lies in navigating uncertainty, not eliminating it.

Because if you can predict it, you’re probably not innovating.

“Innovation is the child of freedom and the parent of growth.”
—William J. McDonough

Understanding Before Arguing: Why Belief Without Evidence Is So Dangerous

Steve James — Sun, 15 Jun 2025 00:00:00 GMT

Recently, I came across a quote from Charlie Munger that has been stuck in my head:

“I never allow myself to have an opinion on anything that I don’t know the other side’s argument better than they do.”

At first glance, it seems like common sense. But the more I sit with it, the more I realise how rare, and radical, it actually is.

We live in a time where belief, not evidence, not nuance, not data, often seems to be the primary currency for truth. As long as something aligns with what we already think, it feels “right.” The problem is, that kind of thinking doesn’t leave a lot of room for understanding. It turns political identity into a kind of faith system: I believe, therefore it is true. You don’t, therefore you are wrong, or worse, dangerous.

What Munger’s talking about is a kind of intellectual discipline, or maybe just humility. It’s the idea that before you criticise someone’s viewpoint, you should be able to explain it fairly and fully, in a way they would recognise. It’s not about agreeing with them. It’s about giving their position the respect of genuine attention.

Why This Matters in How We Argue

I think about this every time I scroll through political debates online. So many arguments are just people yelling past each other, beating up a cartoon version of the other side. We call this strawmanning, the act of misrepresenting someone’s position so it’s easier to attack. Instead of engaging with what someone actually believes, we invent a flimsy imitation and tear that down.

Closely related, and just as common, is the misuse of a rhetorical tool called reductio ad absurdum. In its proper form, it’s a logical technique used to disprove a point by showing that it leads to an absurd or contradictory outcome (Britanica). But in everyday argument, it’s often weaponised: someone exaggerates their opponent’s view to an extreme or ridiculous scenario, and then critiques that instead.

For example:

“You want to regulate car emissions? What’s next — banning all cars and making us walk to work?”

At that point, the original argument isn’t even in the room anymore. It’s not dialogue, it’s mockery dressed up as logic.

What we need more of is the opposite: steelmanning, engaging with the strongest, most thoughtful version of someone else’s view, even if (especially if) we disagree with it.

The Reverence and Risk of Belief

Belief and faith are often seen as admirable, especially in religious traditions. They signal conviction, loyalty, identity. But that admiration can come at a cost when belief is decoupled from evidence.

Philosopher W.K. Clifford made this point forcefully:

“It is wrong always, everywhere, and for anyone, to believe anything upon insufficient evidence.”
(Clifford's Principle)

Clifford argued that belief isn’t just a private matter, it shapes our actions and our society. When we build entire ideologies, policies, or communities around ideas we haven't truly tested, we open the door to harm.

And yet, the opposite view exists too. William James argued that sometimes belief without evidence is necessary, like believing in the possibility of love before it’s proven, or the success of a risky venture. He called this “The Will to Believe” (source).

Both positions reflect the complexity of belief. But in public discourse, when beliefs calcify into identity, when “I believe” becomes a substitute for “I know”, the danger increases.

This Isn’t Just Politics

It’s tempting to think this only applies to politics, Labour vs Tory, Democrat vs Republican, Leave vs Remain. But the same dynamic shows up everywhere:

iOS vs Android
Religion vs Atheism
Meat eaters vs Vegans
Pro-vaccine vs Vaccine-sceptic
Climate change believers vs deniers
Free speech absolutists vs content moderation advocates
Science vs spirituality
Parenting styles: Gentle vs Authoritative

These debates often escalate not because the issues are unsolvable, but because each side reduces the other to caricature. We don’t engage, we retreat into tribes.

But what if, instead of digging in, we paused and asked: Could I explain their point of view, fairly, generously, clearly? That’s the kind of thinking Munger was getting at.

Why We Struggle: The Psychology of Stubborn Belief

Why is this so hard?

Because we’re not nearly as rational as we think we are. Multiple psychological effects explain why we hold tight to beliefs, even when they’re wrong.

Motivated reasoning: We unconsciously distort facts to support what we already believe. As psychologist Peter Ditto puts it, this is “the emotional tail wagging the rational dog.” (Maine Public)
Confirmation bias: We selectively seek and favour information that confirms our views and avoid what challenges them.
Belief perseverance: Even after being shown that a belief is wrong, we often continue to cling to it. The original belief becomes anchored in our sense of identity.

And at a deeper level, Moral Foundations Theory, developed by Jonathan Haidt, shows that people weight moral values like fairness, loyalty, authority, or sanctity very differently, which makes certain arguments persuasive to one group and meaningless to another. (moralfoundations.org)

A Simple Challenge

So what do we do with all this?

Here’s a challenge I’m trying to live by, and one I think Charlie Munger would appreciate:

Pause before jumping into disagreement.
Steelman the opposing view: articulate it as clearly and charitably as you can.
Ask yourself: “Would they agree with my summary of their position?”
Only then — if you still feel the need — share your response.

Because understanding doesn’t weaken your argument. It strengthens your humanity.

And maybe, if more of us did this, we’d stop yelling across divides and start building bridges across them.

How Security and Data Governance Improve User Experience

Steve James — Wed, 07 May 2025 00:00:00 GMT

We live in an age where our personal information is increasingly digitised. Every click, every search, and every online purchase leaves behind a digital footprint. As a Product Manager, my main goal is not just to deliver feature-rich applications but to ensure that these applications are safe, trustworthy, and provide an excellent user experience. One might assume that optimising for security and data governance might hinder the user experience. Contrary to that belief, I'm here to argue that enhancing security and data governance is, in fact, a significant booster to the overall user experience.

Building Trust The most apparent benefit of security optimisation is trust. When users realise that a product or service prioritises their security, they are more likely to trust it. Trust, in the digital realm, is not just a nice-to-have; it's a necessity. A product that loses its users' trust might as well lose its users.
Smooth Experience with Fewer Interruptions Imagine a scenario where an application gets compromised. This would lead to downtimes, emergency maintenance, or even data breaches. Each of these outcomes disrupts the user experience. By investing in security from the get-go, we ensure a seamless experience for our users, free from unexpected disruptions.
Empowerment through Transparency Data governance isn't just about keeping user data safe; it's about being transparent about how this data is used. By giving users a clear insight into what data is collected, why it's collected, and how it's used, we empower them. An empowered user is an engaged user, and an engaged user is the key to enhanced user experience.
Tailored User Experiences without the Creepiness One of the main reasons companies collect data is to provide a more tailored experience to their users. However, there's a fine line between personalisation and intrusiveness. Proper data governance ensures that personalisation is done right, without crossing the boundaries of user comfort.
Compliance Equals Access In today's global market, regulatory compliance is not optional. Regulations like GDPR in Europe or CCPA in California have strict requirements regarding data privacy and security. By adhering to these regulations, we not only avoid hefty fines but ensure that our product is accessible to a broader audience without regional restrictions.
Future-proofing the Business Security threats and data misuse are not static; they evolve. By building a foundation based on robust security and data governance, we are preparing our product for the challenges of the future. This ensures that our users continue to have a consistent and safe experience even as digital landscapes change.
User Feedback and Continuous Improvement When users feel safe and trust a platform, they are more likely to provide genuine feedback. This feedback is invaluable. It allows us to continuously refine and improve, ensuring the user experience keeps getting better.

In conclusion, security and data governance are not mere checkboxes in the product development process. They are crucial pillars that hold the weight of user trust and engagement. By integrating them into the core of our products, we are not sacrificing user experience; we are elevating it.

Remember, in the end, it's not just about building a product; it's about building a relationship with our users. And like any successful relationship, it must be built on trust, transparency, and mutual respect.

Embracing Iterative Requirement Development in Product Ownership

Steve James — Fri, 25 Apr 2025 00:00:00 GMT

"The definition of insanity is doing the same thing over and over again and expecting different results."
— Attributed to Einstein

The Myth of Up-Front Certainty

As a Product Owner, one of the most common questions I get is:

“How do you manage requirement development in an Agile way?”

It’s a fair question — traditional product requirement processes often assume you can capture everything upfront, get it signed off, and hand it over to developers like a blueprint for a house.

This old-school approach typically follows these steps:

Plan the analysis approach
Elicit requirements
Analyse and design the solution
Prioritise the requirements
Get approval and sign-off

Once finalized, these requirements are passed along to the development team, with the expectation that the resulting product will match the original vision.

But here's the problem: this process almost never works in software. According to the CHAOS Report by the Standish Group, projects that follow this linear model consistently underperform — and many outright fail.

What Experience Has Taught Us

Years of building software products have made a few truths crystal clear:

You can’t know everything at the beginning.
Requirements will change.
Written requirements are always subject to interpretation.

Despite our desire for predictability, software development is inherently complex and uncertain. Efforts to impose rigid plans often backfire, introducing waste, rework, and stakeholder frustration.

Instead of trying to predict the future, successful product teams embrace empiricism — the practice of making decisions based on what is, rather than what we hope will happen.

“Empiricism means working in a fact-based, experience-based, and evidence-based manner.”
— Scrum.org: The Three Pillars of Empiricism

Predictability vs Progress

In traditional settings, predictability is prized. But demanding it in uncertain environments leads to a dangerous illusion of control.

“We work in an uncertain world, and our main goal in pursuing agility is to confront the unknown… Pursuing predictability causes us to lay a veneer of fiction over the real world.”
— Scrum.org: Escaping the Predictability Trap

This is why Agile frameworks such as Scrum promote short feedback loops, continuous learning, and transparent decision-making. It allows Product Owners to continuously refine and reprioritize based on:

Actual user feedback
Technical insights
Market signals

In this model, the product backlog is a living document, not a requirements bible.

Iterative Requirement Development in Practice

So how should Product Owners approach requirements in this dynamic environment?

Instead of detailed specs for a year’s worth of work, aim for just enough clarity — at just the right time.

A healthy product backlog typically contains:

2–3 sprints worth of refined stories
Lightly defined epics for the next quarter
High-level ideas for future exploration

This model aligns with the Cone of Uncertainty — a concept that acknowledges we know the least at the beginning of a project and gradually gain clarity through discovery and iteration.

By delaying commitment until the last responsible moment, teams can:

Minimize waste
Reduce rework
Respond quickly to change

The Role of Discovery and Up-Front Analysis

Does this mean we abandon up-front planning altogether?

Absolutely not.

Good Product Ownership requires thoughtful discovery work — especially at the outset of new initiatives. The key is to balance early analysis with flexibility.

Marty Cagan’s concept of Product Discovery offers a powerful framework for this. Discovery involves defining the right problems to solve, validating solutions with customers, and aligning stakeholders — all before investing in development.

“The purpose of product discovery is to quickly separate the good ideas from the bad. We don’t want to build things customers won’t use or that won’t work for the business.”
— Marty Cagan, SVPG

Not every part of a product requires the same depth of analysis. Skilled Product Owners must evaluate:

What’s known and stable (requiring less detail)
What’s unknown or risky (requiring deeper investigation)

This risk-based approach to discovery ensures we invest time where it matters most.

Final Thoughts: Empiricism is Empowering

When Product Owners embrace iterative, empirical requirement development, they create the conditions for:

Better alignment with end users
More resilient roadmaps
Higher-value outcomes
Stronger partnerships with development teams

We stop pretending we know the future and instead build the capability to respond to it.

As Agile manifesto co-author Mike Cohn puts it:

“We want to delay decisions until the last responsible moment to preserve flexibility and avoid unnecessary work.”

If you're still clinging to thick BRDs and waterfall mindsets in your product practice, it's time to evolve.

Additional Reading

Book Review: The Principles of Product Development Flow – A Masterclass in Modern Software Development Thinking

Steve James — Thu, 24 Apr 2025 00:00:00 GMT

As an experienced Product Manager who draws inspiration from thought leaders like Mik Kersten (Project to Product), Marty Cagan (Inspired, Empowered), and Gene Kim (The Phoenix Project, The Unicorn Project), Donald Reinertsen’s The Principles of Product Development Flow feels like a deep intellectual homecoming.

In a world where agile slogans are sometimes thrown around without understanding the economic drivers behind them, Reinertsen delivers a technical and philosophical tour de force. This is not a beginner's guide or a book of platitudes. It's a dense, deliberate, and at times, uncompromising exploration of how to truly think about product development — as a system of flows constrained by economics, physics, and human behavior.

What Stands Out

✅ Economics First Reinertsen relentlessly emphasizes that every product development decision should be economically motivated. Whether deciding to ship with known bugs or balancing feature creep, he demands that we quantify the Cost of Delay.
In the same spirit that Mik Kersten ties software delivery to business outcomes in Project to Product, Reinertsen reminds us: "Proxy metrics are dangerous distractions. Profitability is king."

✅ Queues are the Hidden Killer Just like WIP bottlenecks in SAFe or the flow disruptions that Kersten describes, Reinertsen lays bare that unmanaged queues — unseen and unmeasured — are what sabotage cycle time, morale, and innovation.
His industrial-grade treatment of queueing theory was a wake-up call for me: traditional timeline management is not just ineffective, it’s actively harmful if queues remain invisible.

✅ Variability is Not the Enemy Much like Gene Kim’s "local optimizations are the enemy of global flow" argument in The Phoenix Project, Reinertsen’s nuanced view of variability resonates deeply.
In software, uncertainty is not a bug — it's the nature of the work. Instead of driving it out (as Six Sigma urges), we must manage and even harness variability economically.

✅ The Power of Small Batches and Fast Feedback For any PM who loves the Lean Startup ethos but sometimes feels it's been dumbed down to "fail fast" memes, this book restores rigor. The push for smaller batch sizes and faster feedback isn’t just a process trick — it’s a calculated economic optimization.

✅ Decentralization with Purpose Echoing the autonomy principles Marty Cagan champions for empowered teams in Empowered, Reinertsen shows how decentralized decision-making — when combined with clear economic frameworks — is the only way to survive in high-variability environments.

Where It Might Challenge Modern PMs

No Handholding This is not a "feel-good" book. Reinertsen assumes you have an appetite for systems thinking, applied mathematics, and a fair bit of discomfort as he dismantles traditional PM orthodoxies.
Economic Thinking is Hard He demands more from PMs than many are used to. It’s no longer enough to ask “Is this feature valuable?” — you must ask “How much is a week of delay on this feature worth to our lifecycle profits?”
It’s an intellectually heavier lift, but an essential one.

Final Verdict: Essential for Serious Product Thinkers

If you’re serious about mastering product management in the age of continuous delivery, platform thinking, and relentless business alignment, The Principles of Product Development Flow deserves a place alongside Inspired, Project to Product, and The Phoenix Project on your shelf.

More than anything, Reinertsen empowers product leaders to think for themselves — to ditch proxy metrics, the worship of efficiency, and one-size-fits-all agile prescriptions — and design systems that deliver value economically, efficiently, and sustainably.

⭐️⭐️⭐️⭐️⭐️ (5/5)

Key Takeaways

Quantify the Cost of Delay - If you can only measure one thing, measure Cost of Delay. It’s the "golden key" to prioritization and flow optimization.
Prioritize Managing Queues Over Managing Timelines - Unseen work-in-process queues create hidden delays and risk. Control them, and cycle time will take care of itself.
Variability Can Be an Asset - Variability fuels innovation. Your goal isn’t to eliminate it, but to manage it economically.
Small Batch Sizes and Fast Feedback Are Non-Negotiable - Move in small, economic increments to accelerate learning, reduce risk, and improve adaptability.
Push Decisions Down, But Align on Outcomes - Empower teams with decentralized control frameworks tied tightly to economic priorities, not just project plans.

References

The Principles of Product Development Flow, Donald Reinertsen, 2009
Mik Kersten, Project to Product
Marty Cagan, Inspired / Empowered
Gene Kim, The Phoenix Project / The Unicorn Project

From Project to Product: A Paradigm Shift for the Age of Software

Steve James — Sun, 20 Apr 2025 00:00:00 GMT

By Mik Kersten | Summary by STVPJ

As we progress through the digital era, the rules of business are rapidly changing. Companies that once dominated their industries are finding themselves outpaced by nimble startups and tech giants who’ve mastered the art of software delivery. In his groundbreaking book, Project to Product, Mik Kersten reveals why traditional project-based management models are failing and offers a new framework—the Flow Framework—to help organizations survive and thrive in the Age of Software.

The Premise: Why the Project Model is Obsolete

Kersten opens the book with a compelling argument: the project-based structures that powered businesses through the Age of Mass Production are no longer fit for today’s software-driven world. Instead of treating software delivery as a series of finite, scope-bound projects, companies must adopt a product mindset—focusing on long-term value streams and continuous delivery.

"We're managing software like we're still building bridges." — Mik Kersten

The old model, built for predictability and control, stifles innovation. It creates silos, breaks accountability, and fails to account for the non-linear, iterative nature of modern software development.

The Flow Framework: A New Way to Think About Work

At the core of the book is the Flow Framework, Kersten’s blueprint for transforming how organizations manage software delivery. It introduces four types of work items that flow through software value streams:

Features – User-driven enhancements
Defects – Quality issues
Risks – Security, compliance, or other risk-related work
Debts – Technical or infrastructure improvements

These “Flow Items” are the units of measurement in the Flow Framework. Organizations then track Flow Metrics like:

Flow Velocity - Flow Efficiency - Flow Time - Flow Load

The ultimate goal? Aligning IT with business outcomes in a measurable, real-time way.

The Age of Software and the Turning Point

Kersten situates this shift in the broader context of Carlota Perez’s theory of technological revolutions. We are now entering the Deployment Period of the Age of Software—the point where companies must either adapt to software-driven ways of working or fade into irrelevance.

Companies like Nokia failed, not because they didn’t try to adapt, but because they measured success with flawed metrics—like Agile adoption rates—rather than true value delivery.

Value Streams: The Heart of the Product Model

Kersten urges organizations to orient around value streams, not departments or projects. A value stream represents the end-to-end activities required to deliver value to the customer. Rather than shifting people between projects, companies should:

Fund long-lived product teams, not one-off projects
Track flow, not just budget and timeline
Integrate tooling into a seamless Value Stream Network

Case Studies and Epiphanies

Throughout the book, Kersten shares stories of transformation—both failed and successful. He reflects on lessons from:

The BMW Leipzig plant (a marvel of lean production)
Nokia’s Agile rollout
A global bank’s $1B transformation gone sideways

He outlines three epiphanies from his journey:

Architecture must align to value stream, not the reverse
Disconnected toolchains destroy productivity 3. Software is not a pipeline—it’s a collaborative network

Why Most Digital Transformations Fail

Many organizations fail despite well-funded Agile/DevOps efforts. Kersten identifies three core issues:

Focus on activities (e.g., Agile training), not outcomes - Lack of business–IT alignment - Inability to see or measure value flow

The solution is not more tooling or process frameworks—it’s a shift in mindset and management logic.

Beyond the Turning Point

Kersten finishes with a clear call to action. The Age of Software is here, and those who fail to adapt risk becoming obsolete. But those who embrace product thinking and build connected Value Stream Networks will:

Deliver better customer outcomes
Make smarter investments
Attract and retain top talent
Regain competitive advantage

TL;DR: Key Takeaways

The project model is broken for modern software delivery.
The Flow Framework replaces it with measurable, value-centric metrics.
Shift focus from projects to products and value streams.
Digital transformation fails without visibility into Flow.
The Age of Software demands a new management paradigm.

Inspired by Mik Kersten’s Project to Product. For more, visit projecttoproduct.org.

What Lean Really Means

Steve James — Sat, 19 Apr 2025 00:00:00 GMT

Lean Thinking: What It Is—and What It Isn’t

Lean is one of the most misunderstood concepts in modern business. It’s often mistaken for a buzzword, a quick-fix method, or worse—just another cost-cutting exercise.

Let’s start by setting the record straight.

Lean is not:

A miracle cure for all business problems
A synonym for “doing more with less” at any cost
A new religion, fad, or acronym

Lean is a long-term, strategic approach to improving how organizations work—one that puts customer value at the centre and treats anything that doesn’t contribute to that value (or the safety, quality, and stability of the organization) as waste.

Lean is a philosophy of continuous improvement that eliminates waste, empowers people, and focuses relentlessly on delivering value to customers.

The Three Pillars of Lean

All credible definitions of Lean align around three core goals:

Deliver better value to customers 2. Do more with less 3. Ensure quality and long-term sustainability are never compromised

Organizations that embrace Lean behave in consistent, observable ways:

Everyone understands what customers truly value
Continuous improvement is part of daily operations
Respect for people is central
The organization aligns long-term thinking with short-term actions
Lean becomes part of the culture—not just a program

The Toyota Way: 14 Principles of Lean Excellence

Jeffrey Liker’s The Toyota Way breaks Lean into 14 principles, grouped into four themes:

1. Philosophy as a Foundation

1. Make decisions based on long-term thinking, not short-term gains.

2. The Right Process Produces the Right Results

2. Create continuous flow to surface problems
3. Use pull systems to avoid overproduction
4. Level out workloads (heijunka)
5. Stop to fix problems and get quality right the first time
6. Standardize tasks as the basis for improvement and empowerment
7. Use visual controls so no problems stay hidden
8. Use only reliable, thoroughly tested technology that supports people

3. Add Value by Developing People and Partners

9. Grow leaders who understand and teach the philosophy
10. Build great teams that align with company goals
11. Respect partners and suppliers—help them improve too

4. Continuous Problem Solving Drives Learning

12. Go see for yourself (genchi genbutsu) to understand situations
13. Make decisions slowly by consensus, implement them quickly
14. Become a learning organization through reflection and improvement

The 8 Wastes of Lean

Lean identifies 8 common forms of waste, present in both manufacturing and service sectors:

Waiting – Delays between process steps
Overproduction – Doing more than needed
Rework – Fixing mistakes or defects
Motion – Unnecessary movement of people
Transport – Unneeded movement of materials or information
Processing – Extra work that doesn’t add value
Inventory – Excess stock or queued activity
Talent – Underusing people's skills and creativity

Final Thoughts: Lean as a Culture, Not a Checklist

Lean is not a toolset. It’s a mindset.
It’s not a one-time transformation—it’s a way of thinking.

It asks us to:

Empower people
Eliminate waste
Pursue purpose
Continuously improve

It’s hard work. But for organizations willing to commit, Lean offers a path to more meaningful value, healthier teams, and resilient long-term success.

Agile Isn’t What You Think

Steve James — Thu, 20 Mar 2025 00:00:00 GMT

The Pitfalls of One-Size-Fits-All Agile: Why Most Agile Implementations Miss the Point

Agile is everywhere. Or so it seems.

Walk into nearly any modern software organization and you’ll be told they “do agile.” There are stand-ups. Sprints. Story points. Jira boards. Retros. There's probably a release train or two. On paper, it looks like agility. But in reality, it's anything but.

Despite widespread adoption of agile methodologies and frameworks, most companies still fail to embody what agile truly stands for. What we’re seeing isn’t agile—it’s agile theater. A performance of rituals devoid of the mindset and principles that actually make agile work.

Why? Because no one wants to talk about the real reason agile fails in big organizations:

True agility means you won’t know exactly how long something will take, or how much it will cost, before you begin.

That’s the uncomfortable truth. And it's the one truth that almost no one in a boardroom wants to hear.

The Illusion of Predictability

Let’s be clear: agile is not a framework. It’s a mindset.

When the Agile Manifesto was written, it was a response to the rigidity of traditional software development—a world in which long-term plans rarely survived contact with reality. Agile promised something better: flexibility, collaboration, and continuous delivery of value.

But large organizations have tried to retrofit agile into the very structures it was meant to replace. They’ve imposed predictability on top of adaptability. They've institutionalized delivery cadences and backlogs and roadmaps and burndown charts in an effort to make agile feel safe and familiar. In doing so, they've fundamentally undermined it.

Frameworks like SAFe and LeSS may offer a sense of control at scale, but they often do so by compromising on the core agile principle: embrace uncertainty.

Because at the heart of true agility is a simple reality: you are building something new, and you cannot plan certainty into discovery.

Why Agile Needs Uncertainty

The Agile Manifesto begins with individuals and interactions over processes and tools. It emphasizes working software over comprehensive documentation. It encourages customer collaboration over contract negotiation. And above all, it values responding to change over following a plan.

These are not project management practices. These are philosophical commitments.

And if you take them seriously, then it follows that:

You cannot guarantee fixed scopes, timelines, or budgets up front.
You will learn things mid-flight that will force you to change direction.
You must give teams the autonomy to solve problems in ways that aren't fully spec’d before they begin.

This is what makes agile work—and what makes it so uncomfortable for command-and-control management styles.

Velocity Is Not Value

In the absence of real agility, organizations default to what they can measure. And the easiest thing to measure is velocity—how many story points a team completes per sprint. But velocity is a local optimization. It tells you how busy your team is, not how effective they are.

Velocity is about output. Agile is about outcomes.

You can double your team’s velocity and still build something no one wants. You can hit every sprint goal and still miss the market. You can have high throughput and zero impact.

This is where most agile transformations break down: they prioritize activity over value.

So, What Should We Measure?

If agile is about delivering value, then value must be what we measure.

But value isn’t a simple thing. It’s not a number you pull out of Jira. It’s not a burndown chart or a feature count.

Value is what helps your organization succeed.

It shows up in:

Increased revenue
Reduced customer churn
Improved user satisfaction
Greater market share
Stronger engagement or retention

These are lagging indicators of success—they tell you after the fact whether what you built actually made a difference. But because they lag, they’re not always useful for day-to-day decision-making.

That’s where proxy metrics come in.

Proxy Metrics: Aligning Toward Impact

To make agile actionable in the short term, teams need leading indicators that signal whether they’re moving in the right direction.

Engineering teams focus on delivery health:
- Cycle time
- Deployment frequency
- Lead time for changes
- Change failure rate
These help identify bottlenecks, inefficiencies, and delivery risks. They don’t measure value directly—but they indicate how reliably the team can deliver value when it’s found.
Product teams focus on signals of product-market fit:
- Activation rates
- Feature adoption
- Retention curves
- Customer satisfaction scores
- Net Promoter Scores (NPS)
- Task success rates
These help gauge whether users are finding what’s being built useful and valuable.

The real power comes from aligning these two sets of indicators: ensuring that delivery teams are enabled to move fast and safely, while product teams are focused on the highest-leverage problems.

Efficiency vs. Effectiveness: Know the Difference

Here’s the critical distinction that many agile teams blur:

Delivery teams should be optimized for efficiency—to execute with speed, quality, and stability.
Product teams should be optimized for effectiveness—to ensure that what gets delivered actually matters.

If you optimize only for speed, you risk building the wrong thing faster.
If you optimize only for insight, you risk discovering great opportunities too slowly.

Agile is about continuously improving both sides of that equation.

That means:

Creating a clear separation of concerns between prioritization and execution.
Funding and structuring teams around outcomes, not projects.
Giving product and delivery teams joint ownership of goals, but distinct accountability for what gets done and how it gets done.

Agility Without Illusions

The real tragedy of faux-agile is that it gives organizations the illusion of adaptability without requiring any of the discipline or humility that true agility demands.

Agile is not about adopting a framework. It’s about cultivating a mindset:

One that values learning over knowing.
One that embraces change over control.
One that prioritizes customer outcomes over internal process fidelity.

When you strip away the jargon, the tickets, and the tooling, this is the real work:
→ Helping organizations get comfortable with uncertainty. → Focusing teams on value, not velocity. → And measuring success not by how much you deliver—but by how much it matters.

Final Thought

If your agile process gives you the illusion of predictability but none of the adaptability, it’s not agile. It’s a façade.

And if your stakeholders expect certainty in timelines and cost before any real discovery has happened, they’re not signing up for agile—they’re asking for waterfall with daily standups.

The question isn’t “are we agile?”
The real question is: do we have the courage to let go of control in pursuit of value?

Why Product Managers Should Trust Their Gut When Data Falls Short

Steve James — Fri, 14 Feb 2025 00:00:00 GMT

As Product Managers, we’re often told to “let the data speak” and to rely on hard evidence to validate our hypotheses. And rightly so—data is a powerful tool for minimizing risk and understanding user behavior. However, there’s a vital piece of the product innovation puzzle that is often overlooked: intuition. When the right data isn’t available or doesn’t exist, your gut can be your most valuable ally.

In fact, some of the most transformative innovations we celebrate today were born in moments when data was either nonexistent or irrelevant. Let’s explore why Product Managers shouldn’t shy away from trusting their instincts when navigating uncharted waters.

Data Is a Compass, Not a Crystal Ball

Marty Cagan, author of Inspired, emphasizes the limitations of relying solely on data. He writes, “Data is essential, but it only tells you about the past. If you want to invent the future, you have to look beyond the data.” Data can show us what has worked before, but it rarely points to what will work in the future, especially when you’re venturing into unexplored territory.

When a Product Manager tries to innovate in a space that’s new or undefined, historical data often falls short. Imagine being the first to propose something like the iPhone. What data would have validated that hypothesis? None. Yet, Steve Jobs famously said, “People don’t know what they want until you show it to them.”

Visionary Thinking Often Precedes Data

Some of the most game-changing innovations weren’t built on data—they were built on a deep understanding of human needs and the courage to make bold decisions. Steve Jobs’ intuition about user experience led to products that redefined industries. Had Apple waited for market research to prove demand, the iPhone might never have happened.

Similarly, Pawel Huryn, product leader and author, argues that “great product managers are great storytellers—they imagine a better world and then find a way to make it real.” In the early stages of product development, storytelling and vision often fill the gaps where data cannot.

When to Trust Your Gut

Trusting your gut doesn’t mean ignoring data—it means knowing when it isn’t enough. Here are some scenarios where intuition can be invaluable:

When entering uncharted markets: No historical data exists for truly innovative ideas.
When experimenting with new paradigms: Early-stage products often lack the metrics to guide decisions.
When responding to qualitative insights: User interviews and anecdotal evidence sometimes reveal truths that numbers can’t.

Pawel Huryn puts it succinctly: “Your gut feeling is your subconscious mind processing years of experience, knowledge, and observations. It’s not irrational; it’s informed.”

Balancing Data and Intuition

The key is to strike a balance. Use data when it’s available and relevant, but don’t let the absence of perfect data paralyze you. Great Product Managers are not just analysts—they’re visionaries who can navigate ambiguity and take calculated risks.

As Marty Cagan reminds us, “At the end of the day, your job is to solve problems in a way that creates value for your customers and your company. Don’t let data become an excuse for inaction.”