AI Model
Evals
Which AI models actually excel at financial analysis? We analyzed the leading foundation models across 537 real-world finance questions to find out.
WorkWise Solutions | March 2026 | Data: Vals.ai Finance Agent Benchmark (v1.1)
Claude Sonnet 4.6 leads the benchmark with the highest overall accuracy score.
Curated by Stanford researchers, GSIB practitioners, and financial industry experts.
From equity analysis to options pricing, spanning easy, medium, and hard difficulty levels.
For PE and alternative investment firms evaluating AI-powered due diligence or portfolio analysis, these numbers matter. The gap between the best and worst performing models is over 26 percentage points — meaning the wrong model choice could leave critical financial insights on the table. This benchmark provides the empirical foundation for smarter model selection in high-stakes financial environments.
Model Leaderboard
Overall accuracy rankings across the full 537-question benchmark. Results reflect each model's ability to reason through complex financial analysis tasks.
| Rank | Model | Vendor | Accuracy | Cost / Query |
|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 63.31% | $0.45 |
| 2 | Claude Opus 4.6 (Thinking) | Anthropic | 62.01% | $1.82 |
| 3 | Gemini 3.1 Pro | 60.83% | $0.48 | |
| 4 | GPT 5.2 | OpenAI | 59.76% | $0.32 |
| 5 | GPT 5.4 | OpenAI | 59.14% | $0.38 |
| 6 | Claude Sonnet 4.5 (Thinking) | Anthropic | 57.31% | $0.98 |
| 7 | GPT 5.1 | OpenAI | 56.55% | $0.35 |
| 8 | Gemini 2.5 Flash | 52.18% | $0.12 | |
| 9 | GPT 4o | OpenAI | 46.82% | $0.09 |
| 10 | Llama 4 Maverick | Meta | 43.57% | $0.06 |
| 11 | DeepSeek R1 | DeepSeek | 37.24% | $0.04 |
Improvement trend: The top-performing models have improved accuracy by over 15 percentage points in the past six months, suggesting rapid capability gains in financial reasoning. For enterprise deployments, this means today's selection is a moving target — continuous evaluation is essential.
Cost vs. Accuracy
The Pareto frontier reveals which models deliver the best accuracy per dollar — critical intelligence for enterprise-scale deployment decisions.
Enterprise deployment implication: Claude Sonnet 4.6 now leads in both accuracy and cost efficiency, sitting firmly on the Pareto frontier at $0.45/query. GPT 5.2 and Gemini 3.1 Pro offer strong accuracy at competitive cost. Claude Opus 4.6 (Thinking) commands a premium at $1.82/query but delivers near-top accuracy — justified for high-stakes financial analysis where errors are costly. For high-volume screening tasks, Gemini 2.5 Flash offers an attractive cost-accuracy ratio.
Dataset & Methodology
Question Distribution
Provenance
The benchmark was curated by researchers from Stanford, practitioners at global systemically important banks (GSIBs), and financial industry experts. Questions are drawn from real-world financial analysis scenarios and publicly available financial data.
Tools Provided to Agents
Each AI agent was given access to four tools during evaluation:
Question Taxonomy
Nine categories of financial analysis questions, organized by difficulty. Each tests different aspects of financial reasoning, from basic retrieval to complex multi-step analysis.
Quantitative Retrieval
Extract specific numerical data points from SEC filings and financial statements.
"What was Tesla's total revenue in FY 2024?"
Qualitative Retrieval
Locate and summarize qualitative information from financial documents and disclosures.
"What risk factors did Apple disclose in its latest 10-K?"
Numerical Reasoning
Perform calculations using retrieved financial data — ratios, growth rates, and basic math.
"Calculate Apple's current ratio from its Q3 2024 balance sheet."
Complex Retrieval
Synthesize information across multiple filings, exhibits, or time periods to answer a question.
"Compare Microsoft's segment revenue mix in 2023 vs. 2024."
Adjustments
Apply accounting adjustments — normalizing earnings, restating figures, and reconciling GAAP vs. non-GAAP.
"Calculate Meta's adjusted EBITDA excluding stock-based compensation."
Beat or Miss
Determine whether a company beat or missed analyst consensus estimates on key metrics.
"Did Netflix beat or miss Q4 2024 EPS consensus?"
Trends
Identify and analyze multi-period financial trends, inflection points, and trajectory changes.
"Analyze the 3-year margin trend for Amazon's AWS segment."
Financial Modeling
Build or validate financial models — DCFs, LBO models, and projection scenarios.
"Estimate the intrinsic value of MSFT using a DCF with 10% WACC."
Market Analysis
Evaluate market dynamics, competitive positioning, and macro impacts on financial performance.
"Assess the impact of rising rates on regional bank net interest margins."
Top 3 Models by Category
Tool Usage Analysis
How effectively models leveraged each tool reveals their strengths and weaknesses in different aspects of financial analysis.
Key finding: Models consistently excelled at quantitative computation (calculator and code interpreter) but struggled with document retrieval — the tool most critical for real-world financial analysis where data must be located before it can be analyzed. This suggests that for enterprise deployments, retrieval-augmented generation (RAG) pipeline quality may matter more than raw model intelligence.
Real-World Example
A concrete example shows how models diverge on the same financial question. This Netflix Q4 2024 stock repurchase question illustrates the practical difference between a correct and incorrect answer.
"Based on Netflix's Q4 2024 10-Q filing, how many shares of common stock did the company repurchase during the quarter, and what was the total cost?"
Netflix repurchased 2.6 million shares for approximately $2.5 billion in Q4 2024, at an average price of ~$935 per share.
2.6M shares repurchased at a total cost of $2.5B. The program was authorized under the board's $15B repurchase plan.
Per the filing, Netflix bought back 2.6 million shares in Q4 2024, spending $2.5 billion under its existing authorization.
Netflix repurchased 1.8 million shares for $1.6 billion in Q4 2024.
Incorrect share count and total cost — likely hallucinated from older filing data.
This example illustrates a common failure mode: models that lack robust retrieval capabilities may confabulate plausible but incorrect financial figures, pulling from training data rather than the provided documents. In financial due diligence, this distinction is the difference between a sound investment thesis and a flawed one.
What This Means for Financial Services
AI-Powered Due Diligence Is No Longer Hypothetical
With top models now exceeding 60% accuracy on a benchmark designed by financial professionals, AI agents are crossing the threshold from experimental to deployable for specific financial analysis tasks. For PE firms running hundreds of deal screenings annually, the productivity gain from even semi-automated financial analysis is substantial — but only if the right model is selected for the task.
Model Selection Is a Strategic Decision
The benchmark reveals that no single model dominates across all categories. Claude Sonnet 4.6 leads overall with remarkable cost efficiency, while Opus 4.6 (Thinking) excels at the hardest tasks. Gemini 3.1 Pro and GPT 5.2 offer strong alternatives. For organizations deploying AI across multiple financial analysis workflows, a multi-model strategy — routing different question types to specialized models — may yield the best results.
Cost Is Not a Proxy for Quality
Claude Sonnet 4.6 has upended the cost-accuracy calculus — delivering the highest accuracy at just $0.45/query. Meanwhile, GPT 5.2 at $0.32/query captures over 94% of that accuracy at an even lower cost. Claude Opus 4.6 (Thinking) at $1.82/query remains a premium option for the most demanding tasks. For large-scale deployments processing thousands of queries, understanding the Pareto frontier is essential for budget-conscious enterprises.
The Retrieval Gap Is the Real Bottleneck
Perhaps the most important finding is that models struggle most with information retrieval — the foundational step of any financial analysis. This suggests that enterprises should invest as heavily in their data pipelines and RAG infrastructure as they do in model selection. A mediocre model with excellent retrieval may outperform a top model with poor data access.
Attribution & Methodology
Data Source
All benchmark data is sourced from the Vals.ai Finance Agent Benchmark (v1.1). Vals.ai is an independent AI evaluation platform that provides standardized benchmarks for assessing AI model performance across professional domains. Data last updated March 2026.
Contributors
The benchmark questions were developed by researchers at Stanford University, practitioners at global systemically important banks, and financial industry subject matter experts. The evaluation framework was designed to test capabilities that directly mirror real-world financial analysis workflows.
Methodology Notes
Models were evaluated in an agentic setting with access to four tools (calculator, retriever, code interpreter, plotter). Accuracy reflects exact-match scoring on the full 537-question test set. Cost per query represents the average API cost at standard pricing as of February 2026. Analysis and editorial commentary by WorkWise Solutions.
Deploy AI-Powered Financial Analysis
Selecting the right AI model for your financial workflows requires more than benchmarks — it requires understanding your specific use cases, data infrastructure, and risk tolerance. Let us help you build a deployment strategy grounded in data.
Solutions Powered by These Benchmarks
The models you've evaluated here are the engines behind our production-grade solutions. See how they perform at scale.
AI Deal Screener
Deploy these models for automated CIM analysis — screen hundreds of deals with AI-powered document intelligence.
Investor Reporting Engine
AI-powered institutional reporting that transforms raw financial data into investor-ready narratives and quarterly updates.
Public Markets Engine
Filing analysis, earnings intelligence, and competitive monitoring — the benchmark leaders deployed at production scale.