AI Model
Evals
Which AI models actually excel at financial analysis? We analyzed the leading foundation models across 537 real-world finance questions to find out.
WorkWise Solutions | February 2026 | Data: Vals.ai Finance Agent Benchmark
Claude Opus 4.6 (Thinking) leads the benchmark with the highest overall accuracy score.
Curated by Stanford researchers, GSIB practitioners, and financial industry experts.
From equity analysis to options pricing, spanning easy, medium, and hard difficulty levels.
For PE and VC firms evaluating AI-powered due diligence or portfolio analysis, these numbers matter. The gap between the best and worst performing models is over 25 percentage points — meaning the wrong model choice could leave critical financial insights on the table. This benchmark provides the empirical foundation for smarter model selection in high-stakes financial environments.
Model Leaderboard
Overall accuracy rankings across the full 537-question benchmark. Results reflect each model's ability to reason through complex financial analysis tasks.
| Rank | Model | Vendor | Accuracy | Cost / Query |
|---|---|---|---|---|
| 1 | Claude Opus 4.6 (Thinking) | Anthropic | 60.65% | $1.82 |
| 2 | GPT 5.1 | OpenAI | 56.55% | $0.35 |
| 3 | Claude Sonnet 4.5 (Thinking) | Anthropic | 55.32% | $0.98 |
| 4 | Gemini 3 Pro | 54.84% | $0.52 | |
| 5 | GPT 5 | OpenAI | 53.47% | $0.40 |
| 6 | Claude Sonnet 4.5 | Anthropic | 52.62% | $0.21 |
| 7 | o3 | OpenAI | 52.28% | $1.45 |
| 8 | Gemini 2.5 Flash | 50.52% | $0.12 | |
| 9 | GPT 4o | OpenAI | 45.23% | $0.09 |
| 10 | Llama 4 Maverick | Meta | 42.15% | $0.06 |
| 11 | DeepSeek R1 | DeepSeek | 35.10% | $0.04 |
Improvement trend: The top-performing models have improved accuracy by over 15 percentage points in the past six months, suggesting rapid capability gains in financial reasoning. For enterprise deployments, this means today's selection is a moving target — continuous evaluation is essential.
Cost vs. Accuracy
The Pareto frontier reveals which models deliver the best accuracy per dollar — critical intelligence for enterprise-scale deployment decisions.
Enterprise deployment implication: GPT 5.1 and Gemini 3 Pro sit on or near the Pareto frontier, offering strong accuracy at moderate cost. Claude Opus 4.6 commands a premium but delivers the highest accuracy — a trade-off that may be justified for high-stakes financial analysis where errors are costly. For high-volume screening tasks, Gemini 2.5 Flash offers an attractive cost-accuracy ratio.
Dataset & Methodology
Question Distribution
Provenance
The benchmark was curated by researchers from Stanford, practitioners at global systemically important banks (GSIBs), and financial industry experts. Questions are drawn from real-world financial analysis scenarios and publicly available financial data.
Tools Provided to Agents
Each AI agent was given access to four tools during evaluation:
Question Taxonomy
Nine categories of financial analysis questions, organized by difficulty. Each tests different aspects of financial reasoning, from basic retrieval to complex multi-step analysis.
Metrics & Ratios
Calculate standard financial ratios from provided statements.
"What was Apple's current ratio in Q3 2024?"
Financial Data Retrieval
Extract specific data points from filings and reports.
"What was Tesla's total revenue in FY 2024?"
Time Series Basics
Identify trends and patterns in sequential financial data.
"Calculate the 3-year CAGR for Amazon's AWS segment."
Equity Analysis
Valuation, comparable analysis, and investment thesis development.
"Estimate the intrinsic value of MSFT using a DCF with 10% WACC."
Credit Analysis
Evaluate creditworthiness, debt structures, and default risk.
"Assess the debt service coverage ratio for Boeing's 2024 filings."
Corporate Actions
Analyze M&A, buybacks, dividends, and restructuring impacts.
"How many shares did Netflix repurchase in Q4 2024?"
Options & Derivatives
Price options, calculate Greeks, and analyze derivative strategies.
"Calculate the implied volatility for SPY Jan 2026 calls."
Multi-Step Reasoning
Chain multiple analyses to arrive at a complex financial conclusion.
"Compare the risk-adjusted returns of a tech vs. healthcare portfolio over 5 years."
Scenario Analysis
Model outcomes under varying economic assumptions and stress tests.
"Model the impact of a 200bp rate hike on regional bank portfolios."
Top 3 Models by Category
Tool Usage Analysis
How effectively models leveraged each tool reveals their strengths and weaknesses in different aspects of financial analysis.
Key finding: Models consistently excelled at quantitative computation (calculator and code interpreter) but struggled with document retrieval — the tool most critical for real-world financial analysis where data must be located before it can be analyzed. This suggests that for enterprise deployments, retrieval-augmented generation (RAG) pipeline quality may matter more than raw model intelligence.
Real-World Example
A concrete example shows how models diverge on the same financial question. This Netflix Q4 2024 stock repurchase question illustrates the practical difference between a correct and incorrect answer.
"Based on Netflix's Q4 2024 10-Q filing, how many shares of common stock did the company repurchase during the quarter, and what was the total cost?"
Netflix repurchased 2.6 million shares for approximately $2.5 billion in Q4 2024, at an average price of ~$935 per share.
2.6M shares repurchased at a total cost of $2.5B. The program was authorized under the board's $15B repurchase plan.
Per the filing, Netflix bought back 2.6 million shares in Q4 2024, spending $2.5 billion under its existing authorization.
Netflix repurchased 1.8 million shares for $1.6 billion in Q4 2024.
Incorrect share count and total cost — likely hallucinated from older filing data.
This example illustrates a common failure mode: models that lack robust retrieval capabilities may confabulate plausible but incorrect financial figures, pulling from training data rather than the provided documents. In financial due diligence, this distinction is the difference between a sound investment thesis and a flawed one.
What This Means for Financial Services
AI-Powered Due Diligence Is No Longer Hypothetical
With top models now exceeding 60% accuracy on a benchmark designed by financial professionals, AI agents are crossing the threshold from experimental to deployable for specific financial analysis tasks. For PE firms running hundreds of deal screenings annually, the productivity gain from even semi-automated financial analysis is substantial — but only if the right model is selected for the task.
Model Selection Is a Strategic Decision
The benchmark reveals that no single model dominates across all categories. Claude Opus leads overall, but GPT 5.1 offers compelling cost efficiency. Gemini 3 Pro excels at specific category types. For organizations deploying AI across multiple financial analysis workflows, a multi-model strategy — routing different question types to specialized models — may yield the best results.
Cost Is Not a Proxy for Quality
The most expensive model (Claude Opus at $1.82/query) does deliver the highest accuracy, but GPT 5.1 at $0.35/query captures over 93% of that accuracy at less than 20% of the cost. For large-scale deployments processing thousands of queries, this cost differential compounds dramatically. Understanding the Pareto frontier is essential for budget-conscious enterprises.
The Retrieval Gap Is the Real Bottleneck
Perhaps the most important finding is that models struggle most with information retrieval — the foundational step of any financial analysis. This suggests that enterprises should invest as heavily in their data pipelines and RAG infrastructure as they do in model selection. A mediocre model with excellent retrieval may outperform a top model with poor data access.
Attribution & Methodology
Data Source
All benchmark data is sourced from the Vals.ai Finance Agent Benchmark. Vals.ai is an independent AI evaluation platform that provides standardized benchmarks for assessing AI model performance across professional domains.
Contributors
The benchmark questions were developed by researchers at Stanford University, practitioners at global systemically important banks, and financial industry subject matter experts. The evaluation framework was designed to test capabilities that directly mirror real-world financial analysis workflows.
Methodology Notes
Models were evaluated in an agentic setting with access to four tools (calculator, retriever, code interpreter, plotter). Accuracy reflects exact-match scoring on the full 537-question test set. Cost per query represents the average API cost at standard pricing as of February 2026. Analysis and editorial commentary by WorkWise Solutions.
Deploy AI-Powered Financial Analysis
Selecting the right AI model for your financial workflows requires more than benchmarks — it requires understanding your specific use cases, data infrastructure, and risk tolerance. Let us help you build a deployment strategy grounded in data.
Schedule Strategic Consultation