Approach
Services
Solutions
Tools
Case Studies
Resources
About
Contact
Benchmark Analysis

AI Model
Evals

Which AI models are actually good at financial analysis? We tested the top foundation models on 537 real finance questions to find out.

WorkWise Solutions | March 2026 | Data: Vals.ai Finance Agent Benchmark (v1.1)

63.31%
Top Accuracy

Claude Sonnet 4.6 leads the benchmark with the highest overall accuracy score.

537
Questions

Curated by Stanford researchers, GSIB practitioners, and financial industry experts.

9
Categories

From equity analysis to options pricing, spanning easy, medium, and hard difficulty levels.

For PE and alternative investment firms picking AI for due diligence or portfolio analysis, these numbers matter. The gap between the best and worst models is over 26 percentage points. Pick the wrong one and you leave critical insights on the table. This benchmark gives you the data to choose well.

Model Leaderboard

Overall accuracy across the full 537-question benchmark. Results show each model's ability to reason through complex financial analysis.

Rank Model Vendor Accuracy Cost / Query
1 Claude Sonnet 4.6 Anthropic 63.31% $0.45
2 Claude Opus 4.6 (Thinking) Anthropic 62.01% $1.82
3 Gemini 3.1 Pro Google 60.83% $0.48
4 GPT 5.2 OpenAI 59.76% $0.32
5 GPT 5.4 OpenAI 59.14% $0.38
6 Claude Sonnet 4.5 (Thinking) Anthropic 57.31% $0.98
7 GPT 5.1 OpenAI 56.55% $0.35
8 Gemini 2.5 Flash Google 52.18% $0.12
9 GPT 4o OpenAI 46.82% $0.09
10 Llama 4 Maverick Meta 43.57% $0.06
11 DeepSeek R1 DeepSeek 37.24% $0.04

Improvement trend: Top models have gained over 15 percentage points in accuracy in the last six months. Financial reasoning is improving fast. For enterprise deployments, today's pick is a moving target. Keep evaluating.

Cost vs. Accuracy

The Pareto frontier shows which models give you the best accuracy per dollar. Critical for enterprise-scale deployments.

What it means: Claude Sonnet 4.6 now leads on both accuracy and cost, sitting on the Pareto frontier at $0.45/query. GPT 5.2 and Gemini 3.1 Pro are strong runners-up at a competitive cost. Claude Opus 4.6 (Thinking) is a premium pick at $1.82/query, but delivers near-top accuracy. Worth it for high-stakes analysis where errors are expensive. For high-volume screening, Gemini 2.5 Flash gives you a strong cost-accuracy ratio.

Dataset & Methodology

Question Distribution

Provenance

Stanford researchers, GSIB practitioners, and financial industry experts wrote the benchmark. Questions come from real financial analysis scenarios and public financial data.

Tools Provided to Agents

Each agent got four tools during evaluation:

Calculator — Arithmetic and financial computations
Retriever — Look up financial data from provided documents
Code Interpreter — Execute Python for analysis and modeling
Plotter — Generate charts and data visualizations

Question Taxonomy

Nine categories of financial analysis questions, sorted by difficulty. Each tests a different part of financial reasoning, from basic lookups to complex multi-step analysis.

Easy

Quantitative Retrieval

Pull specific numbers from SEC filings and financial statements.

"What was Tesla's total revenue in FY 2024?"

Easy

Qualitative Retrieval

Find and summarize qualitative information from financial documents and disclosures.

"What risk factors did Apple disclose in its latest 10-K?"

Easy

Numerical Reasoning

Run calculations on retrieved data: ratios, growth rates, basic math.

"Calculate Apple's current ratio from its Q3 2024 balance sheet."

Medium

Complex Retrieval

Pull information across multiple filings, exhibits, or time periods to answer a question.

"Compare Microsoft's segment revenue mix in 2023 vs. 2024."

Medium

Adjustments

Apply accounting adjustments: normalize earnings, restate figures, reconcile GAAP vs. non-GAAP.

"Calculate Meta's adjusted EBITDA excluding stock-based compensation."

Medium

Beat or Miss

Tell whether a company beat or missed analyst consensus estimates on key metrics.

"Did Netflix beat or miss Q4 2024 EPS consensus?"

Hard

Trends

Find and analyze multi-period trends, inflection points, and trajectory changes.

"Analyze the 3-year margin trend for Amazon's AWS segment."

Hard

Financial Modeling

Build or validate financial models: DCFs, LBO models, projection scenarios.

"Estimate the intrinsic value of MSFT using a DCF with 10% WACC."

Hard

Market Analysis

Judge market dynamics, competitive positioning, and macro impacts on financial performance.

"Assess the impact of rising rates on regional bank net interest margins."

Top 3 Models by Category

Tool Usage Analysis

How well models used each tool tells you where they're strong and where they struggle.

Key finding: Models were consistently strong at quantitative work (calculator and code interpreter) but weak at document retrieval. Retrieval is the tool that matters most in real-world financial analysis because you have to find the data before you can analyze it. The lesson for enterprise deployments: your RAG pipeline quality may matter more than raw model intelligence.

Real-World Example

One concrete example shows how models split on the same question. This Netflix Q4 2024 stock repurchase question shows the practical difference between a correct and incorrect answer.

Question

"Based on Netflix's Q4 2024 10-Q filing, how many shares of common stock did the company repurchase during the quarter, and what was the total cost?"

Correct Answers
Claude Sonnet 4.6

Netflix repurchased 2.6 million shares for approximately $2.5 billion in Q4 2024, at an average price of ~$935 per share.

Gemini 3.1 Pro

2.6M shares repurchased at a total cost of $2.5B. The program was authorized under the board's $15B repurchase plan.

GPT 5.2

Per the filing, Netflix bought back 2.6 million shares in Q4 2024, spending $2.5 billion under its existing authorization.

Incorrect Answers
GPT 4o

Netflix repurchased 1.8 million shares for $1.6 billion in Q4 2024.

Incorrect share count and total cost — likely hallucinated from older filing data.

This shows a common failure mode. Models with weak retrieval make up plausible but wrong numbers, pulling from training data instead of the provided documents. In financial due diligence, that's the difference between a sound thesis and a flawed one.

What This Means for Financial Services

AI-Powered Due Diligence Isn't Hypothetical Anymore

Top models now clear 60% accuracy on a benchmark built by financial professionals. AI agents have crossed from experimental to deployable for specific tasks. For PE firms running hundreds of screenings a year, the productivity gain from even semi-automated financial analysis is huge. But only if you pick the right model.

Model Selection Is a Strategic Decision

No single model wins across all categories. Claude Sonnet 4.6 leads overall with strong cost efficiency. Opus 4.6 (Thinking) is best at the hardest tasks. Gemini 3.1 Pro and GPT 5.2 are strong alternatives. For firms running AI across many workflows, a multi-model strategy (routing different question types to different models) often works best.

Cost Isn't a Proxy for Quality

Claude Sonnet 4.6 has flipped the cost-accuracy math. The highest accuracy at just $0.45/query. GPT 5.2 at $0.32/query gets you 94% of that for less. Claude Opus 4.6 (Thinking) at $1.82/query is the premium pick for the hardest work. For large deployments running thousands of queries, understanding the Pareto frontier matters.

The Retrieval Gap Is the Real Bottleneck

The most important finding: models struggle most with retrieval. That's the foundation of any financial analysis. Invest as much in your data pipelines and RAG as you do in picking models. A mediocre model with great retrieval will beat a top model with poor data access.

Attribution & Methodology

Data Source

All benchmark data comes from the Vals.ai Finance Agent Benchmark (v1.1). Vals.ai is an independent AI evaluation platform that runs standardized benchmarks across professional domains. Data last updated March 2026.

Contributors

The questions were built by researchers at Stanford University, practitioners at global systemically important banks, and financial industry experts. The framework was designed to test capabilities that mirror real financial analysis work.

Methodology Notes

Models were evaluated in an agentic setting with four tools (calculator, retriever, code interpreter, plotter). Accuracy is exact-match scoring on the full 537-question test set. Cost per query is the average API cost at standard pricing as of February 2026. Analysis by WorkWise Solutions.

Deploy AI-Powered Financial Analysis

Picking the right AI model for your financial workflows takes more than benchmarks. It takes understanding your specific use cases, data infrastructure, and risk tolerance. We'll help you build a deployment strategy grounded in data.

Schedule Consultation