AI Reliability in Private Equity: Why Bigger Context Windows Don't Mean Better Results
Dr. Leigh Coney
February 19, 2026
12 minutes
AI reliability is the most overlooked risk in private equity technology evaluation. The AI marketing playbook has a favorite number: the context window. Google's Gemini touts a million tokens. Meta's Llama 4 claims ten million. The arms race is on, and the marketing writes itself. Bigger windows, better AI.
If you're a PE operating partner evaluating AI deployments across a portfolio, or a VC partner assessing a target company's AI infrastructure, these numbers look reassuring. They suggest that AI tools can ingest entire deal rooms, process complete financial models, and reason over hundreds of pages of due diligence materials in a single pass. The spec sheet says it can handle it.
But here's what nobody puts in the press release: a model claiming 200,000 tokens of context might start falling apart at 30,000. One claiming a million might stumble well before it gets there. The context window number on the spec sheet is like a car's top speed—technically achievable, practically irrelevant, and potentially dangerous to chase. The real question has always been about how much of that information an AI can actually use.
The Gap Between Specs and Reality
Researchers have been quietly building the case that context windows are deeply misleading. The evidence is now hard to ignore—and it has direct implications for any firm relying on AI to process complex, high-stakes information.
In February 2025, researchers from Adobe Research and LMU Munich published NoLiMa, a benchmark that tested models on something harder than the old "needle-in-a-haystack" trick [1]. Traditional benchmarks let models cheat by matching keywords between the question and the buried answer. NoLiMa removed those surface-level cues and forced models to actually reason about the information they were holding in context. The results were ugly. At 32,000 tokens—a fraction of what most frontier models claim to support—10 out of 12 tested models dropped below half their short-context performance. Even GPT-4o, one of the strongest performers, went from a near-perfect 99.3% baseline down to 69.7%.
Later that year, Norman Paulsen published "Context Is What You Need," introducing a concept the industry had been avoiding: the Maximum Effective Context Window [2]. After collecting hundreds of thousands of data points across several major models, he found that the gap between advertised context window sizes and usable performance was enormous. In some cases, models fell short of their claimed capacity by over 99%. Some top-of-the-line models showed severe accuracy degradation with as little as 1,000 tokens in context.
Then came the work that gave the problem a name. In June 2025, a Hacker News commenter coined the phrase "context rot" [16]. The following month, the AI database startup Chroma published a widely read study that put rigor behind the term [3]. Chroma's team tested 18 models and confirmed that performance degradation isn't just about length. The type of irrelevant content in the context window matters. Some kinds of noise degrade performance more than others. And more complex tasks degrade faster than simple ones, even on tasks as trivial as repeating a string of words.
Put together, the research tells a consistent story: models don't use their context uniformly. Their reliability erodes as input length grows, often in unpredictable and non-linear ways.
Why This Matters for PE and VC Firms
This isn't an academic curiosity. It's an operational risk that directly affects how AI performs in the environments where PE and VC firms are deploying it.
Consider what a typical AI-assisted due diligence process looks like. An analyst loads financial statements, management presentations, market analyses, and legal documents into an AI system. The combined context might run 80,000 to 150,000 tokens—well within the advertised window of most frontier models. The model answers questions about the materials. It identifies risks. It summarizes key findings. Everyone feels confident because the spec sheet says the model can handle half a million tokens.
But the research says that at 80,000 tokens, the model may already be experiencing significant degradation. Critical details buried in the middle of a CIM might be effectively invisible. A material risk factor on page 47 of a legal review might not register when the model is simultaneously processing 120 pages of financial data. The model won't tell you it missed something. It will confidently produce an answer that looks complete but has gaps that no one will catch until the deal closes.
The same dynamic plays out across portfolio company operations. A generic AI tool deployed for customer service might handle the first few messages in a conversation well but start losing track of earlier context as the interaction grows longer. An AI-powered financial analysis tool might produce accurate outputs on small datasets but degrade as the input complexity increases. A throughput-multiplying workflow built on the assumption that the AI can reliably process large volumes of information might be producing outputs with quietly declining accuracy.
The attention mechanism that powers modern AI is both the source of its strength and the root of this problem. Every token in a context window has to "attend" to every other token. As the context grows, each token competes for the model's attention with every other token. Signal gets drowned by noise. The critical instruction on page one gets buried under forty pages of background material. Giving an AI more context can actually make it perform worse.
The Vendor Evaluation Problem
For firms evaluating AI vendors—whether for internal operations, portfolio company deployments, or assessing a target's technology stack during diligence—context window size has become a dangerously easy shorthand for capability. A vendor claims a million-token context window, and it sounds like it can handle anything. But the industry needs a more honest way to talk about context. We need to move past the theoretical maximum and start measuring the effective range where a model maintains reliable performance on the kinds of tasks real people actually care about.
When we conduct AI due diligence on behalf of PE firms, context window claims are one of the first things we pressure-test. The questions that matter aren't "how big is the window?" but rather: At what point does accuracy degrade for this specific use case? How does the system handle conflicting information spread across a large context? What architecture decisions has the vendor made to mitigate context rot?
These are the kinds of questions that separate AI deployments that create durable value from ones that look impressive in a demo and quietly fail in production. And they require a level of technical evaluation that most standard diligence processes aren't equipped to deliver.
How the Best AI Providers Are Solving This
While much of the industry has been focused on making context windows bigger, the most sophisticated providers have been working on something that matters more: making context windows smarter.
If you think of a language model's context as an orchestra, adding more musicians won't help if nobody knows when each section should play, when to bring instruments in and out, and how to keep the whole performance from falling apart as the piece gets longer and more complex. You need a conductor.
Anthropic's engineering blog framed this philosophy directly: "Context is a critical but finite resource for AI agents" [5]. While the rest of the industry was trying to expand that resource, they were asking a different question: what if we just used it better? That question has produced two strategies that get at the core of the problem.
Strategy 1: Keep the Window Clean. In a typical AI conversation, every message, every response, every tool result piles up inside the context window. After enough back and forth, the window is packed with old, stale information that's no longer relevant. The model's attention is spread across everything, relevant or not. Anthropic's answer is automatic conversation compaction [8]: when a conversation approaches its context limit, instead of hitting a wall, the system generates a structured summary of what's been discussed, what decisions have been made, and what's still in progress. Then it continues from that compressed foundation. This looks like a convenience feature, but it doubles as a performance strategy. A compacted conversation keeps the context window focused on what matters right now, not cluttered with the transcript of everything that happened three hours ago.
Strategy 2: Don't Make One Window Do Everything. Instead of cramming everything into a single context window and hoping the model can sort it out, the most effective architectures distribute the work across multiple fresh windows. Anthropic's Research feature [6] uses a lead agent that analyzes a query, develops a strategy, and spawns specialized subagents to explore different aspects simultaneously. Each subagent gets its own clean context window, searches and reasons within it, and returns a compressed summary back to the lead. In Anthropic's internal testing, this architecture outperformed a single model by 90.2% on research tasks [6]. Token usage alone explained roughly 80% of the performance variance. In plain terms: distributing reasoning across fresh, focused context windows beats stuffing everything into one big window, no matter how large that window is.
This principle has been pushed further with Agent Teams [11]—multiple AI instances working simultaneously on different aspects of a shared project, each maintaining its own context window while coordinating through a shared task list. Early access partner Rakuten reported that this approach "autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization across 6 repositories" [15]. That kind of coordination across a large project is only possible because the cognitive load is distributed. No single context window has to hold the entire problem.
What This Means for Portfolio AI Strategy
The distinction between context size and context quality has practical implications for how PE firms should think about AI across their portfolios.
1. Stop using context window size as a procurement criterion. When a portfolio company's CTO recommends an AI vendor because it has "the biggest context window," that should be a red flag, not a selling point. The relevant question is how the vendor manages context—what compression strategies, multi-agent architectures, and attention optimization techniques they've implemented. A 200,000-token window with intelligent context management will outperform a million-token window that just throws everything in and hopes for the best.
2. Architect for distributed reasoning, not monolithic processing. Portfolio companies building AI workflows should design them around the principle that no single AI instance should be asked to hold and reason over everything at once. Complex tasks—financial analysis, legal review, operational auditing—should be decomposed into focused subtasks, each processed with a clean, relevant context. This mirrors how effective human teams work: you don't hand one analyst the entire data room and expect perfect recall. You divide the work, specialize the focus, and synthesize the findings.
3. Build monitoring into AI deployments from day one. If context rot is real—and the research says it is—then portfolio companies need ways to detect when their AI systems are degrading. This means logging not just whether the AI produced an output, but whether that output was accurate relative to the full input. It means testing deployed systems against known-answer benchmarks on a regular cadence. It means building the kind of governance frameworks that can catch silent failures before they compound into material problems.
4. Factor context engineering into AI-driven EBITDA projections. Value creation plans that assume AI will reliably process large volumes of information need to be stress-tested against the reality of context degradation. The throughput gains from AI are real, but they're conditional on proper architecture. A model that produces accurate outputs 95% of the time at low context loads but drops to 70% at production-scale loads will generate a very different ROI than the one projected in the investment memo.
The Bigger Picture
The Epoch AI research group found that while context window sizes have grown roughly 30x per year since mid-2023, the effective length where top models maintain 80% accuracy has been improving even faster, rising over 250x in just nine months [4]. Part of that improvement comes from better models. But a large part comes from better context management—the engineering work that determines what gets loaded, what gets compressed, and what gets delegated elsewhere.
The context window arms race isn't over. Google, Meta, and others will keep pushing the raw numbers higher. And those numbers do matter. A model with a million-token window can attempt things a 4,000-token model simply cannot. But the gap between "can attempt" and "can reliably accomplish" is where real value lives. Bigger windows alone won't close that gap. Smarter management of the windows that already exist will.
For PE and VC firms deploying AI across portfolios, this distinction is the one that will actually determine whether the technology creates durable value or quietly accumulates risk. The context window number on the tin tells you one thing. What happens inside that window tells you everything.
References
[1] Modarressi, A., Deilamsalehy, H., Dernoncourt, F., Bui, T., Rossi, R. A., Yoon, S., & Schütze, H. (2025). "NoLiMa: Long-Context Evaluation Beyond Literal Matching." Forty-second International Conference on Machine Learning.
[2] Paulsen, N. (2025). "Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs." Open Access Journal of Artificial Intelligence and Machine Learning, September 2025.
[3] Hong, K., Troynikov, A., & Huber, J. (2025). "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Research Technical Report, July 14, 2025.
[4] Burnham, G. & Adamczewski, T. (2025). "LLMs now accept longer inputs, and the best models can use them more effectively." Epoch AI, June 25, 2025.
[5] Anthropic Applied AI Team. (2025). "Effective context engineering for AI agents." Anthropic Engineering Blog, September 29, 2025.
[6] Anthropic Engineering Team. (2025). "How we built our multi-agent research system." Anthropic Engineering Blog, June 13, 2025.
[8] Anthropic. (2026). "Compaction: Server-side context compaction for managing long conversations." Claude API Documentation.
[11] Anthropic. (2026). "Agent teams: Orchestrate teams of Claude Code sessions." Claude Code Documentation.
[15] Anthropic. (2026). "Introducing Claude Opus 4.6." Anthropic News, February 5, 2026.
[16] Lee, T. B. (2025). "Context rot: the emerging challenge that could hold back LLM progress." Understanding AI, November 10, 2025.
AI vendor evaluation and context architecture design are core components of our approach to reliable, high-impact AI deployment. See how it fits into our High-Stakes AI Blueprint for reliable AI systems.
Related Articles
The AI-Diligence Gap: Why Standard Due Diligence Is No Longer Enough
Standard due diligence misses AI debt and AI potential. Learn to evaluate data quality, workflows, and automation readiness during M&A.
Engineering Autonomous Agents for Due Diligence
Multi-agent orchestration and verification patterns for building production-grade autonomous agents in financial due diligence.
Deploying AI across your portfolio?
We help PE and VC firms evaluate AI vendors, architect reliable AI workflows, and build governance frameworks that catch silent failures before they compound. See how we've helped firms navigate AI due diligence and portfolio-wide deployments.
Schedule a Consultation