AI Strategy

The Agent Reliability Cliff: Why PE Deployments Fail at Month Six

Author

Dr. Leigh Coney

Published

April 29, 2026

Reading Time

10 minutes

AI agents that look great in pilot quietly fall apart at month six. The same agent that produced impressive output in week three is producing slightly off output in month four, and clearly wrong output by month seven. The team stops trusting it. The deployment dies without ever being formally killed. The cause is not the model. It is the compounding of three drifts that nobody is monitoring.

By Dr. Leigh Coney, Founder of WorkWise Solutions

The pattern is so consistent across PE AI deployments that I have stopped being surprised by it.

The pilot looks great. Month one, the agent does exactly what the demo promised. Month two, the team is using it daily and the output quality is high. Month three, the team is happy enough that they sign the annual contract.

Month four, the first weird outputs appear. The team notices but does not flag them. The agent is mostly right. Nobody wants to be the person who triggers a vendor escalation over an edge case.

Month five, the weird outputs are more frequent. The team is now spot-checking the agent's work before relying on it. The time savings the agent was supposed to deliver have shrunk because the team is doing more verification.

Month six, the head of operations notices that the team is bypassing the agent on important deals. They use the agent for low-stakes work and do high-stakes work manually. The deployment is dying.

By month seven or eight, the agent is still installed but nobody is really using it. The annual contract auto-renews. The team has reverted to pre-deployment workflows for anything that matters.

This is the agent reliability cliff. It is not a technology failure. It is a compounding of three drift types that almost nobody monitors, and that almost every vendor underplays.

Drift Type 1: Data Drift

The agent was calibrated against the data the firm had at deployment. Six months later, the data has changed. Not dramatically. Just enough.

A new portfolio company joined the portfolio. Its chart of accounts is structured differently from the others. The agent normalizes the data, but the normalization is slightly off because the agent has not been retrained on the new structure.

A vendor that supplies sector data updated its taxonomy. The agent's classification logic, which was calibrated to the old taxonomy, now miscategorizes 8% of incoming deals. The deals do not get filtered correctly. Small enough to miss in casual review. Big enough to compound across a quarter.

An ERP at a portfolio company got upgraded. The export format changed slightly. The agent's parser still produces output, but the dollar figures are now off by a factor of 1,000 in one specific field because the new format reports thousands instead of single units. The agent does not catch its own mistake because the dollar figures are still numerical.

Each of these is small. None of them, individually, would be caught by a quarterly QA check. All of them together compound until the agent's outputs are wrong in ways that look right.

Most vendors do not monitor data drift in production. They check the agent at deployment and assume the inputs stay structurally similar. They do not stay structurally similar. Six months is enough time for the underlying data shape to shift in ways that quietly break agent output.

Drift Type 2: Prompt Drift

Prompt drift is the one most vendors will not admit to.

When an agent is deployed, the vendor configures a set of prompts that drive its behavior. How to read a CIM. How to score a deal against the firm's thesis. How to format the output. The prompts are tuned over the first few weeks of deployment based on the team's feedback.

Then the team's needs evolve. New thesis criteria get added. The IC starts asking for output in a slightly different format. The deal team adopts a new diligence framework. None of these changes are explicitly fed back to the agent's prompt set, because nobody at the firm thinks of the agent's instructions as something that needs to be updated.

Two months later, the agent is producing output that matches its original prompt instructions but is no longer aligned with how the team actually works. The output looks fine. The output is wrong, in the sense that it is answering a question the team is no longer asking.

Vendors deal with this differently. The honest ones build a process for prompt revision and run quarterly check-ins where the team validates that the prompts still match current workflows. The dishonest ones do not mention this is needed and let the deployment quietly drift out of alignment.

Dr. Leigh Coney, Founder of WorkWise Solutions, notes: "The vendors that survive in PE are the ones that treat prompt maintenance as a recurring service, not a one-time configuration. The vendors that fail at month six are the ones that treat the prompt set as something configured once and then forgotten. The team's needs always evolve. The prompts have to evolve with them."

Drift Type 3: Supervision Fatigue

Supervision fatigue is the human side of the cliff. It is the most predictable and the least talked about.

In month one, the team checks every agent output carefully. They are still building trust. The check is part of the workflow.

In month three, the team checks samples. The agent is reliable enough that full review is unnecessary. They spot-check.

In month five, the team has stopped spot-checking systematically. They do not check at all unless something looks obviously wrong. The agent is now operating without supervision.

This is the same pattern that kills airline accidents and surgical errors. NASA's research on automation in aviation shows that pilots monitoring automated systems for routine periods become significantly worse at catching the moments when the system needs intervention. The same effect operates in financial services with AI agents. Humans who have learned to trust an agent become bad at catching the agent when it is wrong.

By the time supervision fatigue compounds with data drift and prompt drift, the agent is producing wrong outputs that nobody is catching. When the wrongness becomes too obvious to ignore, the team's reaction is not to fix the agent. Their reaction is to lose trust in it. The deployment dies because the team cannot tell, without significant work, where the agent is reliable and where it is not.

The cruel part of supervision fatigue is that it is rational. The team's degraded checking is the correct response to the agent's apparent reliability in early months. The team is being efficient, which is what the agent was supposed to enable. The fatigue is not laziness. It is exactly the behavior change the deployment promised would happen.

Why Month Six Specifically

The cliff hits at month six because that is roughly when all three drifts have compounded enough to produce visible failures.

Data drift alone is usually catchable in month two or three. Prompt drift alone usually surfaces around month four when the team's workflow has shifted enough that the agent's output feels off. Supervision fatigue alone is harmless if the agent is not drifting.

When all three compound, the agent is producing increasingly wrong output, the team is decreasingly checking it, and the prompts are decreasingly aligned with how the team works. By month six, the wrongness becomes visible. By month seven, trust is broken. By month eight, the deployment is effectively dead.

The exact timing varies by deployment. Some firms hit the cliff at month four because their data shifts faster. Some firms last until month nine because they have a more disciplined supervision process. Six months is the modal case.

What Prevents the Cliff

Three practices, run together. None of them are dramatic. None of them are technically interesting. All of them are easy to skip and most deployments do skip them.

1. Monthly drift audits. Once a month, someone runs the agent against a fixed set of validation cases. Same inputs every month. The outputs should match the original calibration outputs within a defined tolerance. When they do not match, the audit produces a flag and a diagnosis. This catches data drift early, while the team still remembers what the agent should be producing.

2. Quarterly prompt reviews. Once a quarter, the team and the vendor sit down and walk through the agent's prompt set. What has changed in how the team works? What new thesis criteria have been added? What output formats are now obsolete? The prompts are updated to match. This sounds like overhead. It is overhead. It is also the difference between an agent that survives and an agent that dies.

3. Random output verification, scheduled into the team's calendar. Once a week, someone on the team picks three agent outputs at random and verifies them against the underlying source data. Not the obvious-looking ones. Random. The verification is a 30-minute calendar block. It is not enthusiastic. It is disciplined. It is what catches supervision fatigue from compounding into trust collapse.

Together, these three practices add up to roughly four hours of work per agent per month. Most deployments do not budget for them. Most deployments fail at month six.

The Vendor Question

When evaluating any AI agent vendor, the question that matters most is not "how good is the agent at deployment." It is "what does the vendor do in months four through nine."

Ask the vendor specifically. Do they monitor for data drift, and how? Do they run prompt reviews, and how often? Do they have a process for catching supervision fatigue before it kills trust? If they do not have specific answers to all three, they do not have a serious post-deployment practice. The agent will work in pilot. It will hit the cliff at month six. The deployment will quietly fail.

The vendors that survive in PE are the ones that have built an ongoing service practice around the agent, not just a deployment service. The deployment is the easy part. The five years afterward are where the work is. The vendors that ignore the maintenance phase are not building agents. They are selling demos.

For more on the broader frame, see our post on why AI projects fail in PE and our analysis of AI reliability in private equity contexts.

The reliability cliff is not inevitable. It is the predictable outcome of treating AI agents as software that gets installed once. The agents that deliver value at year three are the same agents that survived month six, which are the same agents that had a real maintenance practice from day one.

If you are evaluating a deployment, the question is not whether the pilot will go well. The pilot will go well. The question is whether anybody is going to be looking after the agent when month six arrives. If the answer is no, the deployment is already dead. The team just does not know it yet.