AI Reliability in Private Equity: Why Bigger Context Windows Don't Mean Better Results
Dr. Leigh Coney
February 7, 2026
12 minutes
AI vendors sell context window size like car makers sell top speed. Impressive on paper, misleading in practice. Research shows most models break down long before they hit the claimed limits. PE firms deploying AI need to check actual reliability, not the spec sheet.
By Dr. Leigh Coney, Founder of WorkWise Solutions
AI reliability is the most overlooked risk in PE technology evaluation. The AI marketing playbook has a favorite number: the context window. Google's Gemini touts a million tokens. Meta's Llama 4 claims ten million. The arms race is on. Bigger windows, better AI.
If you are a PE operating partner checking AI across a portfolio, or a family office principal reviewing a target's AI, these numbers look reassuring. They suggest AI can handle entire deal rooms, process full financial models, and reason over hundreds of pages of due diligence in one pass. The spec sheet says it can handle it.
Here is what nobody puts in the press release. A model claiming 200,000 tokens might start falling apart at 30,000. One claiming a million might stumble well before it gets there. The context window number is like a car's top speed. Technically true, practically irrelevant, and dangerous to chase. The real question is how much of that information the AI can actually use.
Training costs for frontier AI models have risen fast. GPT-4 is estimated at $78 million. Google Gemini Ultra at $191 million (Stanford HAI, AI Index Report 2025). Spending that much to train a model does not make it reliable. Context window specs are marketing numbers, not performance guarantees.
The Gap Between Specs and Reality
Researchers have been quietly making the case that context windows are misleading. The evidence is hard to ignore. It matters for any firm relying on AI to process complex, high-stakes information.
In February 2025, researchers from Adobe Research and LMU Munich published NoLiMa, a benchmark that tested models on something harder than the old "needle-in-a-haystack" trick [1]. Old benchmarks let models cheat by matching keywords between the question and the buried answer. NoLiMa removed those cues and forced models to actually reason about what they were holding. The results were ugly. At 32,000 tokens (a fraction of what most frontier models claim to support) 11 of 13 tested models dropped below half their short-context performance. Even GPT-4o, one of the strongest, fell from 99.3% to 69.7%.
Later that year, Norman Paulsen published "Context Is What You Need," introducing a concept the industry had been avoiding: the Maximum Effective Context Window [2]. After collecting hundreds of thousands of data points across several major models, he found the gap between advertised context window sizes and actual usable performance was enormous. Some models fell short of their claimed capacity by over 99%. Some top-of-the-line models showed severe accuracy drops with as little as 1,000 tokens in context.
Then came the work that gave the problem a name. In June 2025, a Hacker News commenter coined "context rot" [16]. The next month, the AI database startup Chroma published a widely read study that put rigor behind the term [3]. Chroma tested 18 models and confirmed that the drop-off is not just about length. The type of irrelevant content in the window matters. Some kinds of noise hurt more than others. Complex tasks fall apart faster than simple ones, even on tasks as trivial as repeating a string of words.
The research tells a consistent story. Models do not use their context evenly. Reliability breaks down as input grows, often in unpredictable ways.
Why This Matters for PE and Alternative Investment Firms
This is not an academic curiosity. It is an operational risk. It affects how AI performs where PE firms, family offices, and independent sponsors are using it.
Think about a typical AI-assisted due diligence process. An analyst loads financial statements, management presentations, market analyses, and legal documents into an AI system. The combined context might run 80,000 to 150,000 tokens, well within the advertised window of most frontier models. The AI answers questions. It flags risks. It summarizes findings. Everyone feels confident because the spec sheet says the model handles half a million tokens.
But the research says at 80,000 tokens, the model may already be breaking down. Critical details buried in the middle of a CIM might be invisible. A material risk factor on page 47 of a legal review might not register while the model processes 120 pages of financial data. The model will not tell you it missed something. It will produce an answer that looks complete but has gaps nobody catches until the deal closes.
The same thing happens across portfolio operations. A generic AI tool used for customer service might handle the first few messages well, then lose track as the conversation grows. An AI financial analysis tool might produce accurate results on small datasets but break down as inputs get more complex. A throughput-boosting workflow built on the assumption that AI can reliably handle large volumes might be producing outputs with quietly dropping accuracy.
The attention mechanism in modern AI is both its strength and the root of this problem. Every token has to "attend" to every other token. As context grows, each token competes for the model's attention with every other token. Signal gets drowned by noise. The critical instruction on page one gets buried under forty pages of background material. Giving an AI more context can make it perform worse.
The Vendor Evaluation Problem
For firms evaluating AI vendors (internal ops, portfolio companies, or a target's tech stack during diligence), context window size has become a dangerous shorthand for capability. A vendor claims a million-token window and it sounds like it can handle anything. The industry needs a more honest way to talk about context. Move past the theoretical maximum. Start measuring the range where a model holds up on the tasks people actually care about.
When we do AI due diligence for PE firms, context window claims are the first thing we pressure-test. The questions that matter are not "how big is the window?" They are: At what point does accuracy drop for this specific use case? How does the system handle conflicting information across a large context? What has the vendor done to fight context rot?
These are the questions that separate AI that creates lasting value from AI that looks good in a demo and quietly fails in production. They need technical evaluation that most standard diligence processes skip.
How the Best Providers Are Fixing This
Most of the industry has focused on making context windows bigger. The best providers have been working on something that matters more: making them smarter.
Think of context like an orchestra. Adding more musicians does not help if nobody knows when each section should play. You need a conductor.
Anthropic's engineering blog said it directly: "Context is a critical but finite resource for AI agents" [5]. While the rest of the industry tried to expand the resource, Anthropic asked a different question: what if we just used it better? That produced two strategies that get at the core of the problem.
Strategy 1: Keep the window clean. In a typical AI conversation, every message, every response, every tool result stacks up in the window. After enough back and forth, the window fills with stale information. The model's attention gets spread across everything, relevant or not. Anthropic's answer is automatic conversation compaction [8]. When a conversation nears its limit, the system summarizes what was discussed, what was decided, and what is still in progress. Then it continues from that summary. It looks like a convenience feature. It is a performance strategy. The window stays focused on what matters now, not cluttered with a transcript from three hours ago.
Strategy 2: Do not make one window do everything. Instead of cramming everything into a single window, the best architectures split work across multiple fresh windows. Anthropic's Research feature [6] uses a lead agent that analyzes a query, builds a strategy, and spawns specialized subagents to explore different aspects at the same time. Each subagent gets its own clean window, searches and reasons within it, and returns a summary to the lead. In Anthropic's testing, this beat a single model by 90.2% on research tasks [6]. Token usage alone explained about 80% of the performance variance. In plain terms: splitting reasoning across fresh, focused windows beats stuffing everything into one big window, no matter how large.
Agent Teams [11] takes this further. Multiple AI instances work on different parts of a shared project at once, each with its own window, coordinating through a shared task list. Early access partner Rakuten reported that this approach "autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization across 6 repositories" [15]. That kind of coordination only works because the load is split. No single window has to hold the whole problem.
What This Means for Portfolio AI Strategy
The gap between context size and context quality has real implications for how PE firms should think about AI across portfolios.
1. Stop using context window size as a buying criterion. If a portfolio company's CTO recommends a vendor because it has "the biggest context window," that is a red flag, not a selling point. Ask how the vendor manages context instead. What compression strategies, multi-agent architectures, and attention techniques have they built? A 200,000-token window with smart management beats a million-token window that dumps everything in and hopes for the best.
2. Design for split reasoning, not one giant task. Portfolio companies building AI workflows should assume no single AI instance can hold and reason over everything at once. Break complex tasks (financial analysis, legal review, operational audits) into focused subtasks, each with its own clean context. This is how good human teams work. You do not hand one analyst the entire data room. You split the work, specialize the focus, and combine the findings.
3. Monitor AI deployments from day one. Context rot is real. Portfolio companies need ways to spot when their AI systems are breaking down. Log not just whether the AI produced an output, but whether that output matched the full input. Test deployed systems against known-answer benchmarks on a schedule. Build governance frameworks that catch silent failures before they compound.
4. Factor context quality into AI-driven EBITDA projections. Value creation plans that assume AI can reliably handle large volumes need to be stress-tested. Throughput gains from AI are real, but they depend on proper architecture. A model that is 95% accurate at low loads but drops to 70% at production scale generates very different ROI than the investment memo suggests.
The Bigger Picture
Epoch AI found that while context windows have grown roughly 30x a year since mid-2023, the effective length where top models stay 80% accurate has grown even faster, rising over 250x in nine months [4]. Some of that comes from better models. A lot comes from better context management. The engineering work of deciding what gets loaded, what gets compressed, and what gets delegated.
The context window arms race is not over. Google, Meta, and others will keep pushing the numbers higher. Those numbers do matter. A million-token window can attempt things a 4,000-token model cannot. But the gap between "can attempt" and "can reliably do" is where the real value lives. Bigger windows will not close that gap. Smarter management will.
For PE and alternative investment firms deploying AI across portfolios, this is the difference between technology that creates lasting value and technology that quietly builds risk. The context window number on the box tells you one thing. What happens inside that window tells you everything.
References
[1] Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., & Schütze, H. (2025). "NoLiMa: Long-Context Evaluation Beyond Literal Matching." Forty-second International Conference on Machine Learning.
[2] Paulsen, N. (2025). "Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs." Open Access Journal of Artificial Intelligence and Machine Learning, September 2025.
[3] Hong, K., Troynikov, A., & Huber, J. (2025). "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Research Technical Report, July 14, 2025.
[4] Burnham, G. & Adamczewski, T. (2025). "LLMs now accept longer inputs, and the best models can use them more effectively." Epoch AI, June 25, 2025.
[5] Anthropic Applied AI Team. (2025). "Effective context engineering for AI agents." Anthropic Engineering Blog, September 29, 2025.
[6] Anthropic Engineering Team. (2025). "How we built our multi-agent research system." Anthropic Engineering Blog, June 13, 2025.
[8] Anthropic. (2026). "Compaction: Server-side context compaction for managing long conversations." Claude API Documentation.
[11] Anthropic. (2026). "Agent teams: Orchestrate teams of Claude Code sessions." Claude Code Documentation.
[15] Anthropic. (2026). "Introducing Claude Opus 4.6." Anthropic News, February 5, 2026.
[16] Lee, T. B. (2025). "Context rot: the emerging challenge that could hold back LLM progress." Understanding AI, November 10, 2025.
Vendor evaluation and context architecture are core to reliable AI. See where they fit in our High-Stakes AI Blueprint for reliable AI systems.
Related Articles
The AI-Diligence Gap: Why Standard Due Diligence Is No Longer Enough
Standard due diligence misses AI debt and AI potential. Learn to evaluate data quality, workflows, and automation readiness during M&A.
Engineering Autonomous Agents for Due Diligence
Multi-agent orchestration and verification patterns for building production-grade autonomous agents in financial due diligence.
Rolling AI out across your portfolio?
We help PE firms, family offices, and independent sponsors vet AI vendors, build reliable AI workflows, and set up governance frameworks that catch silent failures. See how we've helped firms with AI due diligence and portfolio-wide rollouts.
Schedule a Consultation