Approach
Services
Solutions
Tools
Case Studies
Resources
About
Contact
Benchmark Analysis

AI Model
Evals

Which AI models actually excel at financial analysis? We analyzed the leading foundation models across 537 real-world finance questions to find out.

WorkWise Solutions | March 2026 | Data: Vals.ai Finance Agent Benchmark (v1.1)

63.31%
Top Accuracy

Claude Sonnet 4.6 leads the benchmark with the highest overall accuracy score.

537
Questions

Curated by Stanford researchers, GSIB practitioners, and financial industry experts.

9
Categories

From equity analysis to options pricing, spanning easy, medium, and hard difficulty levels.

For PE and alternative investment firms evaluating AI-powered due diligence or portfolio analysis, these numbers matter. The gap between the best and worst performing models is over 26 percentage points — meaning the wrong model choice could leave critical financial insights on the table. This benchmark provides the empirical foundation for smarter model selection in high-stakes financial environments.

Model Leaderboard

Overall accuracy rankings across the full 537-question benchmark. Results reflect each model's ability to reason through complex financial analysis tasks.

Rank Model Vendor Accuracy Cost / Query
1 Claude Sonnet 4.6 Anthropic 63.31% $0.45
2 Claude Opus 4.6 (Thinking) Anthropic 62.01% $1.82
3 Gemini 3.1 Pro Google 60.83% $0.48
4 GPT 5.2 OpenAI 59.76% $0.32
5 GPT 5.4 OpenAI 59.14% $0.38
6 Claude Sonnet 4.5 (Thinking) Anthropic 57.31% $0.98
7 GPT 5.1 OpenAI 56.55% $0.35
8 Gemini 2.5 Flash Google 52.18% $0.12
9 GPT 4o OpenAI 46.82% $0.09
10 Llama 4 Maverick Meta 43.57% $0.06
11 DeepSeek R1 DeepSeek 37.24% $0.04

Improvement trend: The top-performing models have improved accuracy by over 15 percentage points in the past six months, suggesting rapid capability gains in financial reasoning. For enterprise deployments, this means today's selection is a moving target — continuous evaluation is essential.

Cost vs. Accuracy

The Pareto frontier reveals which models deliver the best accuracy per dollar — critical intelligence for enterprise-scale deployment decisions.

Enterprise deployment implication: Claude Sonnet 4.6 now leads in both accuracy and cost efficiency, sitting firmly on the Pareto frontier at $0.45/query. GPT 5.2 and Gemini 3.1 Pro offer strong accuracy at competitive cost. Claude Opus 4.6 (Thinking) commands a premium at $1.82/query but delivers near-top accuracy — justified for high-stakes financial analysis where errors are costly. For high-volume screening tasks, Gemini 2.5 Flash offers an attractive cost-accuracy ratio.

Dataset & Methodology

Question Distribution

Provenance

The benchmark was curated by researchers from Stanford, practitioners at global systemically important banks (GSIBs), and financial industry experts. Questions are drawn from real-world financial analysis scenarios and publicly available financial data.

Tools Provided to Agents

Each AI agent was given access to four tools during evaluation:

Calculator — Arithmetic and financial computations
Retriever — Look up financial data from provided documents
Code Interpreter — Execute Python for analysis and modeling
Plotter — Generate charts and data visualizations

Question Taxonomy

Nine categories of financial analysis questions, organized by difficulty. Each tests different aspects of financial reasoning, from basic retrieval to complex multi-step analysis.

Easy

Quantitative Retrieval

Extract specific numerical data points from SEC filings and financial statements.

"What was Tesla's total revenue in FY 2024?"

Easy

Qualitative Retrieval

Locate and summarize qualitative information from financial documents and disclosures.

"What risk factors did Apple disclose in its latest 10-K?"

Easy

Numerical Reasoning

Perform calculations using retrieved financial data — ratios, growth rates, and basic math.

"Calculate Apple's current ratio from its Q3 2024 balance sheet."

Medium

Complex Retrieval

Synthesize information across multiple filings, exhibits, or time periods to answer a question.

"Compare Microsoft's segment revenue mix in 2023 vs. 2024."

Medium

Adjustments

Apply accounting adjustments — normalizing earnings, restating figures, and reconciling GAAP vs. non-GAAP.

"Calculate Meta's adjusted EBITDA excluding stock-based compensation."

Medium

Beat or Miss

Determine whether a company beat or missed analyst consensus estimates on key metrics.

"Did Netflix beat or miss Q4 2024 EPS consensus?"

Hard

Trends

Identify and analyze multi-period financial trends, inflection points, and trajectory changes.

"Analyze the 3-year margin trend for Amazon's AWS segment."

Hard

Financial Modeling

Build or validate financial models — DCFs, LBO models, and projection scenarios.

"Estimate the intrinsic value of MSFT using a DCF with 10% WACC."

Hard

Market Analysis

Evaluate market dynamics, competitive positioning, and macro impacts on financial performance.

"Assess the impact of rising rates on regional bank net interest margins."

Top 3 Models by Category

Tool Usage Analysis

How effectively models leveraged each tool reveals their strengths and weaknesses in different aspects of financial analysis.

Key finding: Models consistently excelled at quantitative computation (calculator and code interpreter) but struggled with document retrieval — the tool most critical for real-world financial analysis where data must be located before it can be analyzed. This suggests that for enterprise deployments, retrieval-augmented generation (RAG) pipeline quality may matter more than raw model intelligence.

Real-World Example

A concrete example shows how models diverge on the same financial question. This Netflix Q4 2024 stock repurchase question illustrates the practical difference between a correct and incorrect answer.

Question

"Based on Netflix's Q4 2024 10-Q filing, how many shares of common stock did the company repurchase during the quarter, and what was the total cost?"

Correct Answers
Claude Sonnet 4.6

Netflix repurchased 2.6 million shares for approximately $2.5 billion in Q4 2024, at an average price of ~$935 per share.

Gemini 3.1 Pro

2.6M shares repurchased at a total cost of $2.5B. The program was authorized under the board's $15B repurchase plan.

GPT 5.2

Per the filing, Netflix bought back 2.6 million shares in Q4 2024, spending $2.5 billion under its existing authorization.

Incorrect Answers
GPT 4o

Netflix repurchased 1.8 million shares for $1.6 billion in Q4 2024.

Incorrect share count and total cost — likely hallucinated from older filing data.

This example illustrates a common failure mode: models that lack robust retrieval capabilities may confabulate plausible but incorrect financial figures, pulling from training data rather than the provided documents. In financial due diligence, this distinction is the difference between a sound investment thesis and a flawed one.

What This Means for Financial Services

AI-Powered Due Diligence Is No Longer Hypothetical

With top models now exceeding 60% accuracy on a benchmark designed by financial professionals, AI agents are crossing the threshold from experimental to deployable for specific financial analysis tasks. For PE firms running hundreds of deal screenings annually, the productivity gain from even semi-automated financial analysis is substantial — but only if the right model is selected for the task.

Model Selection Is a Strategic Decision

The benchmark reveals that no single model dominates across all categories. Claude Sonnet 4.6 leads overall with remarkable cost efficiency, while Opus 4.6 (Thinking) excels at the hardest tasks. Gemini 3.1 Pro and GPT 5.2 offer strong alternatives. For organizations deploying AI across multiple financial analysis workflows, a multi-model strategy — routing different question types to specialized models — may yield the best results.

Cost Is Not a Proxy for Quality

Claude Sonnet 4.6 has upended the cost-accuracy calculus — delivering the highest accuracy at just $0.45/query. Meanwhile, GPT 5.2 at $0.32/query captures over 94% of that accuracy at an even lower cost. Claude Opus 4.6 (Thinking) at $1.82/query remains a premium option for the most demanding tasks. For large-scale deployments processing thousands of queries, understanding the Pareto frontier is essential for budget-conscious enterprises.

The Retrieval Gap Is the Real Bottleneck

Perhaps the most important finding is that models struggle most with information retrieval — the foundational step of any financial analysis. This suggests that enterprises should invest as heavily in their data pipelines and RAG infrastructure as they do in model selection. A mediocre model with excellent retrieval may outperform a top model with poor data access.

Attribution & Methodology

Data Source

All benchmark data is sourced from the Vals.ai Finance Agent Benchmark (v1.1). Vals.ai is an independent AI evaluation platform that provides standardized benchmarks for assessing AI model performance across professional domains. Data last updated March 2026.

Contributors

The benchmark questions were developed by researchers at Stanford University, practitioners at global systemically important banks, and financial industry subject matter experts. The evaluation framework was designed to test capabilities that directly mirror real-world financial analysis workflows.

Methodology Notes

Models were evaluated in an agentic setting with access to four tools (calculator, retriever, code interpreter, plotter). Accuracy reflects exact-match scoring on the full 537-question test set. Cost per query represents the average API cost at standard pricing as of February 2026. Analysis and editorial commentary by WorkWise Solutions.

Deploy AI-Powered Financial Analysis

Selecting the right AI model for your financial workflows requires more than benchmarks — it requires understanding your specific use cases, data infrastructure, and risk tolerance. Let us help you build a deployment strategy grounded in data.

Schedule Consultation