AI Model
Evals
Which AI models are actually good at financial analysis? We tested the top foundation models on 537 real finance questions to find out.
WorkWise Solutions | March 2026 | Data: Vals.ai Finance Agent Benchmark (v1.1)
Claude Sonnet 4.6 leads the benchmark with the highest overall accuracy score.
Curated by Stanford researchers, GSIB practitioners, and financial industry experts.
From equity analysis to options pricing, spanning easy, medium, and hard difficulty levels.
For PE and alternative investment firms picking AI for due diligence or portfolio analysis, these numbers matter. The gap between the best and worst models is over 26 percentage points. Pick the wrong one and you leave critical insights on the table. This benchmark gives you the data to choose well.
Model Leaderboard
Overall accuracy across the full 537-question benchmark. Results show each model's ability to reason through complex financial analysis.
| Rank | Model | Vendor | Accuracy | Cost / Query |
|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 63.31% | $0.45 |
| 2 | Claude Opus 4.6 (Thinking) | Anthropic | 62.01% | $1.82 |
| 3 | Gemini 3.1 Pro | 60.83% | $0.48 | |
| 4 | GPT 5.2 | OpenAI | 59.76% | $0.32 |
| 5 | GPT 5.4 | OpenAI | 59.14% | $0.38 |
| 6 | Claude Sonnet 4.5 (Thinking) | Anthropic | 57.31% | $0.98 |
| 7 | GPT 5.1 | OpenAI | 56.55% | $0.35 |
| 8 | Gemini 2.5 Flash | 52.18% | $0.12 | |
| 9 | GPT 4o | OpenAI | 46.82% | $0.09 |
| 10 | Llama 4 Maverick | Meta | 43.57% | $0.06 |
| 11 | DeepSeek R1 | DeepSeek | 37.24% | $0.04 |
Improvement trend: Top models have gained over 15 percentage points in accuracy in the last six months. Financial reasoning is improving fast. For enterprise deployments, today's pick is a moving target. Keep evaluating.
Cost vs. Accuracy
The Pareto frontier shows which models give you the best accuracy per dollar. Critical for enterprise-scale deployments.
What it means: Claude Sonnet 4.6 now leads on both accuracy and cost, sitting on the Pareto frontier at $0.45/query. GPT 5.2 and Gemini 3.1 Pro are strong runners-up at a competitive cost. Claude Opus 4.6 (Thinking) is a premium pick at $1.82/query, but delivers near-top accuracy. Worth it for high-stakes analysis where errors are expensive. For high-volume screening, Gemini 2.5 Flash gives you a strong cost-accuracy ratio.
Dataset & Methodology
Question Distribution
Provenance
Stanford researchers, GSIB practitioners, and financial industry experts wrote the benchmark. Questions come from real financial analysis scenarios and public financial data.
Tools Provided to Agents
Each agent got four tools during evaluation:
Question Taxonomy
Nine categories of financial analysis questions, sorted by difficulty. Each tests a different part of financial reasoning, from basic lookups to complex multi-step analysis.
Quantitative Retrieval
Pull specific numbers from SEC filings and financial statements.
"What was Tesla's total revenue in FY 2024?"
Qualitative Retrieval
Find and summarize qualitative information from financial documents and disclosures.
"What risk factors did Apple disclose in its latest 10-K?"
Numerical Reasoning
Run calculations on retrieved data: ratios, growth rates, basic math.
"Calculate Apple's current ratio from its Q3 2024 balance sheet."
Complex Retrieval
Pull information across multiple filings, exhibits, or time periods to answer a question.
"Compare Microsoft's segment revenue mix in 2023 vs. 2024."
Adjustments
Apply accounting adjustments: normalize earnings, restate figures, reconcile GAAP vs. non-GAAP.
"Calculate Meta's adjusted EBITDA excluding stock-based compensation."
Beat or Miss
Tell whether a company beat or missed analyst consensus estimates on key metrics.
"Did Netflix beat or miss Q4 2024 EPS consensus?"
Trends
Find and analyze multi-period trends, inflection points, and trajectory changes.
"Analyze the 3-year margin trend for Amazon's AWS segment."
Financial Modeling
Build or validate financial models: DCFs, LBO models, projection scenarios.
"Estimate the intrinsic value of MSFT using a DCF with 10% WACC."
Market Analysis
Judge market dynamics, competitive positioning, and macro impacts on financial performance.
"Assess the impact of rising rates on regional bank net interest margins."
Top 3 Models by Category
Tool Usage Analysis
How well models used each tool tells you where they're strong and where they struggle.
Key finding: Models were consistently strong at quantitative work (calculator and code interpreter) but weak at document retrieval. Retrieval is the tool that matters most in real-world financial analysis because you have to find the data before you can analyze it. The lesson for enterprise deployments: your RAG pipeline quality may matter more than raw model intelligence.
Real-World Example
One concrete example shows how models split on the same question. This Netflix Q4 2024 stock repurchase question shows the practical difference between a correct and incorrect answer.
"Based on Netflix's Q4 2024 10-Q filing, how many shares of common stock did the company repurchase during the quarter, and what was the total cost?"
Netflix repurchased 2.6 million shares for approximately $2.5 billion in Q4 2024, at an average price of ~$935 per share.
2.6M shares repurchased at a total cost of $2.5B. The program was authorized under the board's $15B repurchase plan.
Per the filing, Netflix bought back 2.6 million shares in Q4 2024, spending $2.5 billion under its existing authorization.
Netflix repurchased 1.8 million shares for $1.6 billion in Q4 2024.
Incorrect share count and total cost — likely hallucinated from older filing data.
This shows a common failure mode. Models with weak retrieval make up plausible but wrong numbers, pulling from training data instead of the provided documents. In financial due diligence, that's the difference between a sound thesis and a flawed one.
What This Means for Financial Services
AI-Powered Due Diligence Isn't Hypothetical Anymore
Top models now clear 60% accuracy on a benchmark built by financial professionals. AI agents have crossed from experimental to deployable for specific tasks. For PE firms running hundreds of screenings a year, the productivity gain from even semi-automated financial analysis is huge. But only if you pick the right model.
Model Selection Is a Strategic Decision
No single model wins across all categories. Claude Sonnet 4.6 leads overall with strong cost efficiency. Opus 4.6 (Thinking) is best at the hardest tasks. Gemini 3.1 Pro and GPT 5.2 are strong alternatives. For firms running AI across many workflows, a multi-model strategy (routing different question types to different models) often works best.
Cost Isn't a Proxy for Quality
Claude Sonnet 4.6 has flipped the cost-accuracy math. The highest accuracy at just $0.45/query. GPT 5.2 at $0.32/query gets you 94% of that for less. Claude Opus 4.6 (Thinking) at $1.82/query is the premium pick for the hardest work. For large deployments running thousands of queries, understanding the Pareto frontier matters.
The Retrieval Gap Is the Real Bottleneck
The most important finding: models struggle most with retrieval. That's the foundation of any financial analysis. Invest as much in your data pipelines and RAG as you do in picking models. A mediocre model with great retrieval will beat a top model with poor data access.
Attribution & Methodology
Data Source
All benchmark data comes from the Vals.ai Finance Agent Benchmark (v1.1). Vals.ai is an independent AI evaluation platform that runs standardized benchmarks across professional domains. Data last updated March 2026.
Contributors
The questions were built by researchers at Stanford University, practitioners at global systemically important banks, and financial industry experts. The framework was designed to test capabilities that mirror real financial analysis work.
Methodology Notes
Models were evaluated in an agentic setting with four tools (calculator, retriever, code interpreter, plotter). Accuracy is exact-match scoring on the full 537-question test set. Cost per query is the average API cost at standard pricing as of February 2026. Analysis by WorkWise Solutions.
Deploy AI-Powered Financial Analysis
Picking the right AI model for your financial workflows takes more than benchmarks. It takes understanding your specific use cases, data infrastructure, and risk tolerance. We'll help you build a deployment strategy grounded in data.
Solutions Powered by These Benchmarks
The models on this page are the engines behind our production tools. See how they perform at scale.
AI Deal Screener
Use these models for automated CIM analysis. Screen hundreds of deals with AI document intelligence.
Investor Reporting Engine
AI institutional reporting that turns raw financial data into investor-ready narratives and quarterly updates.
Public Markets Engine
Filing analysis, earnings intelligence, and competitive monitoring. The benchmark leaders running at production scale.