Name: Vals.ai Finance Agent Benchmark
Creator: Vals.ai

60.65%

Top Accuracy

Claude Opus 4.6 (Thinking) leads the benchmark with the highest overall accuracy score.

537

Questions

Curated by Stanford researchers, GSIB practitioners, and financial industry experts.

9

Model Leaderboard

Overall accuracy rankings across the full 537-question benchmark. Results reflect each model's ability to reason through complex financial analysis tasks.

Rank	Model	Vendor	Accuracy	Cost / Query
1	Claude Opus 4.6 (Thinking)	Anthropic	60.65%	$1.82
2	GPT 5.1	OpenAI	56.55%	$0.35
3	Claude Sonnet 4.5 (Thinking)	Anthropic	55.32%	$0.98
4	Gemini 3 Pro	Google	54.84%	$0.52
5	GPT 5	OpenAI	53.47%	$0.40
6	Claude Sonnet 4.5	Anthropic	52.62%	$0.21
7	o3	OpenAI	52.28%	$1.45
8	Gemini 2.5 Flash	Google	50.52%	$0.12
9	GPT 4o	OpenAI	45.23%	$0.09
10	Llama 4 Maverick	Meta	42.15%	$0.06
11	DeepSeek R1	DeepSeek	35.10%	$0.04

Improvement trend: The top-performing models have improved accuracy by over 15 percentage points in the past six months, suggesting rapid capability gains in financial reasoning. For enterprise deployments, this means today's selection is a moving target — continuous evaluation is essential.

Cost vs. Accuracy

The Pareto frontier reveals which models deliver the best accuracy per dollar — critical intelligence for enterprise-scale deployment decisions.

Enterprise deployment implication: GPT 5.1 and Gemini 3 Pro sit on or near the Pareto frontier, offering strong accuracy at moderate cost. Claude Opus 4.6 commands a premium but delivers the highest accuracy — a trade-off that may be justified for high-stakes financial analysis where errors are costly. For high-volume screening tasks, Gemini 2.5 Flash offers an attractive cost-accuracy ratio.

Dataset & Methodology

Question Distribution

Provenance

The benchmark was curated by researchers from Stanford, practitioners at global systemically important banks (GSIBs), and financial industry experts. Questions are drawn from real-world financial analysis scenarios and publicly available financial data.

Tools Provided to Agents

Each AI agent was given access to four tools during evaluation:

Calculator — Arithmetic and financial computations

Retriever — Look up financial data from provided documents

Code Interpreter — Execute Python for analysis and modeling

Plotter — Generate charts and data visualizations

Question Taxonomy

Nine categories of financial analysis questions, organized by difficulty. Each tests different aspects of financial reasoning, from basic retrieval to complex multi-step analysis.

Easy

Metrics & Ratios

Calculate standard financial ratios from provided statements.

"What was Apple's current ratio in Q3 2024?"

Easy

Financial Data Retrieval

Extract specific data points from filings and reports.

"What was Tesla's total revenue in FY 2024?"

Easy

Time Series Basics

Identify trends and patterns in sequential financial data.

"Calculate the 3-year CAGR for Amazon's AWS segment."

Medium

Equity Analysis

Valuation, comparable analysis, and investment thesis development.

"Estimate the intrinsic value of MSFT using a DCF with 10% WACC."

Medium

Credit Analysis

Evaluate creditworthiness, debt structures, and default risk.

"Assess the debt service coverage ratio for Boeing's 2024 filings."

Medium

Corporate Actions

Analyze M&A, buybacks, dividends, and restructuring impacts.

"How many shares did Netflix repurchase in Q4 2024?"

Hard

Options & Derivatives

Price options, calculate Greeks, and analyze derivative strategies.

"Calculate the implied volatility for SPY Jan 2026 calls."

Hard

Multi-Step Reasoning

Chain multiple analyses to arrive at a complex financial conclusion.

"Compare the risk-adjusted returns of a tech vs. healthcare portfolio over 5 years."

Hard

Scenario Analysis

Model outcomes under varying economic assumptions and stress tests.

"Model the impact of a 200bp rate hike on regional bank portfolios."

Top 3 Models by Category

Tool Usage Analysis

How effectively models leveraged each tool reveals their strengths and weaknesses in different aspects of financial analysis.

Key finding: Models consistently excelled at quantitative computation (calculator and code interpreter) but struggled with document retrieval — the tool most critical for real-world financial analysis where data must be located before it can be analyzed. This suggests that for enterprise deployments, retrieval-augmented generation (RAG) pipeline quality may matter more than raw model intelligence.

Real-World Example

A concrete example shows how models diverge on the same financial question. This Netflix Q4 2024 stock repurchase question illustrates the practical difference between a correct and incorrect answer.

Question

"Based on Netflix's Q4 2024 10-Q filing, how many shares of common stock did the company repurchase during the quarter, and what was the total cost?"

Correct Answers

Claude Sonnet 4.5

Netflix repurchased 2.6 million shares for approximately $2.5 billion in Q4 2024, at an average price of ~$935 per share.

Gemini 3 Pro

2.6M shares repurchased at a total cost of $2.5B. The program was authorized under the board's $15B repurchase plan.

GPT 5

Per the filing, Netflix bought back 2.6 million shares in Q4 2024, spending $2.5 billion under its existing authorization.

Incorrect Answers

GPT 4o

Netflix repurchased 1.8 million shares for $1.6 billion in Q4 2024.

Incorrect share count and total cost — likely hallucinated from older filing data.

This example illustrates a common failure mode: models that lack robust retrieval capabilities may confabulate plausible but incorrect financial figures, pulling from training data rather than the provided documents. In financial due diligence, this distinction is the difference between a sound investment thesis and a flawed one.

What This Means for Financial Services

AI-Powered Due Diligence Is No Longer Hypothetical

With top models now exceeding 60% accuracy on a benchmark designed by financial professionals, AI agents are crossing the threshold from experimental to deployable for specific financial analysis tasks. For PE firms running hundreds of deal screenings annually, the productivity gain from even semi-automated financial analysis is substantial — but only if the right model is selected for the task.

Model Selection Is a Strategic Decision

The benchmark reveals that no single model dominates across all categories. Claude Opus leads overall, but GPT 5.1 offers compelling cost efficiency. Gemini 3 Pro excels at specific category types. For organizations deploying AI across multiple financial analysis workflows, a multi-model strategy — routing different question types to specialized models — may yield the best results.

Cost Is Not a Proxy for Quality

The most expensive model (Claude Opus at $1.82/query) does deliver the highest accuracy, but GPT 5.1 at $0.35/query captures over 93% of that accuracy at less than 20% of the cost. For large-scale deployments processing thousands of queries, this cost differential compounds dramatically. Understanding the Pareto frontier is essential for budget-conscious enterprises.

The Retrieval Gap Is the Real Bottleneck

Perhaps the most important finding is that models struggle most with information retrieval — the foundational step of any financial analysis. This suggests that enterprises should invest as heavily in their data pipelines and RAG infrastructure as they do in model selection. A mediocre model with excellent retrieval may outperform a top model with poor data access.

Attribution & Methodology

Data Source

All benchmark data is sourced from the Vals.ai Finance Agent Benchmark. Vals.ai is an independent AI evaluation platform that provides standardized benchmarks for assessing AI model performance across professional domains.

Contributors

The benchmark questions were developed by researchers at Stanford University, practitioners at global systemically important banks, and financial industry subject matter experts. The evaluation framework was designed to test capabilities that directly mirror real-world financial analysis workflows.

Methodology Notes

Models were evaluated in an agentic setting with access to four tools (calculator, retriever, code interpreter, plotter). Accuracy reflects exact-match scoring on the full 537-question test set. Cost per query represents the average API cost at standard pricing as of February 2026. Analysis and editorial commentary by WorkWise Solutions.

Deploy AI-Powered Financial Analysis

Selecting the right AI model for your financial workflows requires more than benchmarks — it requires understanding your specific use cases, data infrastructure, and risk tolerance. Let us help you build a deployment strategy grounded in data.

Schedule Strategic Consultation

AI Model Evals

Model Leaderboard

Cost vs. Accuracy

Dataset & Methodology

Question Distribution

Provenance

Tools Provided to Agents

Question Taxonomy

Metrics & Ratios

Financial Data Retrieval

Time Series Basics

Equity Analysis

Credit Analysis

Corporate Actions

Options & Derivatives

Multi-Step Reasoning

Scenario Analysis

Top 3 Models by Category

Tool Usage Analysis

Real-World Example

What This Means for Financial Services

AI-Powered Due Diligence Is No Longer Hypothetical

Model Selection Is a Strategic Decision

Cost Is Not a Proxy for Quality

The Retrieval Gap Is the Real Bottleneck

Attribution & Methodology

Data Source

Contributors

Methodology Notes

Deploy AI-Powered Financial Analysis

AI Model
Evals