Close Menu
World Forbes – Business, Tech, AI & Global Insights
  • Home
  • AI
  • Billionaires
  • Business
  • Cybersecurity
  • Education
    • Innovation
  • Money
  • Small Business
  • Sports
  • Trump
What's Hot

Monzo launches Undo Payments tool to boost safety for digital bank transfers

May 15, 2025

Chrome 136 Update Patches Vulnerability With ‘Exploit in the Wild’

May 15, 2025

Harvey reportedly in discussions to raise $250M at $5B valuation

May 15, 2025
Facebook X (Twitter) Instagram
Trending
  • Monzo launches Undo Payments tool to boost safety for digital bank transfers
  • Chrome 136 Update Patches Vulnerability With ‘Exploit in the Wild’
  • Harvey reportedly in discussions to raise $250M at $5B valuation
  • South Africa’s Ramaphosa to meet Trump in US next week amid rising tensions | Politics News
  • FIFA targets $1 billion revenue from Women’s World Cup – Sport
  • Players stuck in middle with IPL, national teams on collision course – Sport
  • HRW warns of migrant worker deaths in 2034 World Cup hosts Saudi Arabia – World
  • Harvard thought it had a cheap copy of the Magna Carta. It turned out be extremely rare
World Forbes – Business, Tech, AI & Global InsightsWorld Forbes – Business, Tech, AI & Global Insights
Thursday, May 15
  • Home
  • AI
  • Billionaires
  • Business
  • Cybersecurity
  • Education
    • Innovation
  • Money
  • Small Business
  • Sports
  • Trump
World Forbes – Business, Tech, AI & Global Insights
Home » The rise of AI ‘reasoning’ models is making benchmarking more expensive
AI

The rise of AI ‘reasoning’ models is making benchmarking more expensive

adminBy adminApril 10, 2025No Comments4 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email
Post Views: 18


AI labs like OpenAI claim that their so-called “reasoning” AI models, which can “think” through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such as physics. But while this generally appears to be the case, reasoning models are also much more expensive to benchmark, making it difficult to independently verify these claims.

According to data from Artificial Analysis, a third-party AI testing outfit, it costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.

Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid” reasoning model, on the same set of tests cost $1,485.35, while testing OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis.

Some reasoning models are cheaper to benchmark than others. Artificial Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on average, they tend to be pricey. All told, Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount the firm spent analyzing over 80 non-reasoning models ($2,400).

OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet — Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41.

Artificial Analysis co-founder George Cameron told TechCrunch that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models.

“At Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these,” Cameron said. “We are planning for this spend to increase as models are more frequently released.”

Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI benchmarking costs.

Ross Taylor, the CEO of AI startup General Reasoning, said he recently spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique prompts. Taylor estimates a single run-through of MMLU Pro, a question set designed to benchmark a model’s language comprehension skills, would have cost more than $1,800.

“We’re moving to a world where a lab reports x% on a benchmark where they spend y amount of compute, but where resources for academics are recent post on X. “[N]o one is going to be able to reproduce the results.”

Why are reasoning models so expensive to test? Mainly because they generate a lot of tokens. Tokens represent bits of raw text, such as the word “fantastic” split into the syllables “fan,” “tas,” and “tic.” According to Artificial Analysis, OpenAI’s o1 generated over 44 million tokens during the firm’s benchmarking tests, around eight times the amount GPT-4o generated.

The vast majority of AI companies charge for model usage by the token, so you can see how this cost can add up.

Modern benchmarks also tend to elicit a lot of tokens from models because they contain questions involving complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI, which develops its own model benchmarks.

“[Today’s] benchmarks are more complex [even though] the number of questions per benchmark has overall decreased,” Denain told TechCrunch. “They often attempt to evaluate models’ ability to do real-world tasks, such as write and execute code, browse the internet, and use computers.”

Denain added that the most expensive models have gotten more expensive per token over time. For example, Anthropic’s Claude 3 Opus was the priciest model when it was released in May 2024, costing $75 per million output tokens. OpenAI’s GPT-4.5 and o1-pro, both of which launched earlier this year, cost $150 per million output tokens and $600 per million output tokens, respectively.

“[S]ince models have gotten better over time, it’s still true that the cost to reach a given level of performance has greatly decreased over time,” Denain said. “But if you want to evaluate the best largest models at any point in time, you’re still paying more.”

Many AI labs, including OpenAI, give benchmarking organizations free or subsidized access to their models for testing purposes. But this colors the results, some experts say — even if there’s no evidence of manipulation, the mere suggestion of an AI lab’s involvement threatens to harm the integrity of the evaluation scoring.

“From [a] scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore?” wrote Taylor in a follow-up post on X. “(Was it ever science, lol)”.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
admin
  • Website

Related Posts

Harvey reportedly in discussions to raise $250M at $5B valuation

May 15, 2025

Grok is unpromptedly telling X users about South African ‘white genocide’

May 14, 2025

OpenAI brings its GPT-4.1 models to ChatGPT

May 14, 2025

OpenAI pledges to publish AI safety test results more often

May 14, 2025

Stability AI releases an audio-generating model that can run on smartphones

May 14, 2025

DeepMind claims its newest AI tool is a whiz at math and science problems

May 14, 2025
Add A Comment
Leave A Reply Cancel Reply

Don't Miss
Billionaires

Here’s How Much Selena Gomez-Actress, Singer, Entrepreneur-Is Worth

May 13, 2025

Contrary to reports of her 10-figure status, Forbes estimates the Disney star turned business mogul’s…

Looking Back At Trump’s Years-Long Obsession With Oversized Airplanes

May 13, 2025

Selena Gomez’s Mental Health Startup Wondermind Lays Off Nearly Two-Thirds Of Its Employees

May 13, 2025

Billionaires And CEOs Are Seeking Personal Security At Record Rates

May 9, 2025
Our Picks

Monzo launches Undo Payments tool to boost safety for digital bank transfers

May 15, 2025

Chrome 136 Update Patches Vulnerability With ‘Exploit in the Wild’

May 15, 2025

Harvey reportedly in discussions to raise $250M at $5B valuation

May 15, 2025

South Africa’s Ramaphosa to meet Trump in US next week amid rising tensions | Politics News

May 15, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to World-Forbes.com
At World-Forbes.com, we bring you the latest insights, trends, and analysis across various industries, empowering our readers with valuable knowledge. Our platform is dedicated to covering a wide range of topics, including sports, small business, business, technology, AI, cybersecurity, and lifestyle.

Our Picks

Harvey reportedly in discussions to raise $250M at $5B valuation

May 15, 2025

Grok is unpromptedly telling X users about South African ‘white genocide’

May 14, 2025

OpenAI brings its GPT-4.1 models to ChatGPT

May 14, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA Policy
  • Privacy Policy
  • Terms & Conditions
© 2025 world-forbes. Designed by world-forbes.

Type above and press Enter to search. Press Esc to cancel.