Close Menu
World Forbes – Business, Tech, AI & Global Insights
  • Home
  • AI
  • Billionaires
  • Business
  • Cybersecurity
  • Education
    • Innovation
  • Money
  • Small Business
  • Sports
  • Trump
What's Hot

Farmers’ Almanac to cease publication after 2 centuries of predicting the weather

November 7, 2025

Rockefeller Christmas tree begins journey to NYC from upstate

November 6, 2025

What to do if your airport is on the FAA’s flight cut list

November 6, 2025
Facebook X (Twitter) Instagram
Trending
  • Farmers’ Almanac to cease publication after 2 centuries of predicting the weather
  • Rockefeller Christmas tree begins journey to NYC from upstate
  • What to do if your airport is on the FAA’s flight cut list
  • Why autoimmune diseases mostly strike women and are often misdiagnosed
  • Why autoimmune diseases mostly strike women and are often misdiagnosed
  • How A $500 Million Cash Infusion From Wall Street Adds Billions To Ripple’s Founders’ Net Worths
  • Thousands of miles of lost Roman roads are uncovered using aerial photos
  • Toy Hall of Fame recognizes Slime, Battleship, Trivial Pursuit
World Forbes – Business, Tech, AI & Global InsightsWorld Forbes – Business, Tech, AI & Global Insights
Friday, November 7
  • Home
  • AI
  • Billionaires
  • Business
  • Cybersecurity
  • Education
    • Innovation
  • Money
  • Small Business
  • Sports
  • Trump
World Forbes – Business, Tech, AI & Global Insights
Home » The rise of AI ‘reasoning’ models is making benchmarking more expensive
AI

The rise of AI ‘reasoning’ models is making benchmarking more expensive

By adminApril 10, 2025No Comments4 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email
Post Views: 87


AI labs like OpenAI claim that their so-called “reasoning” AI models, which can “think” through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such as physics. But while this generally appears to be the case, reasoning models are also much more expensive to benchmark, making it difficult to independently verify these claims.

According to data from Artificial Analysis, a third-party AI testing outfit, it costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.

Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid” reasoning model, on the same set of tests cost $1,485.35, while testing OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis.

Some reasoning models are cheaper to benchmark than others. Artificial Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on average, they tend to be pricey. All told, Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount the firm spent analyzing over 80 non-reasoning models ($2,400).

OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet — Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41.

Artificial Analysis co-founder George Cameron told TechCrunch that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models.

“At Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these,” Cameron said. “We are planning for this spend to increase as models are more frequently released.”

Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI benchmarking costs.

Ross Taylor, the CEO of AI startup General Reasoning, said he recently spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique prompts. Taylor estimates a single run-through of MMLU Pro, a question set designed to benchmark a model’s language comprehension skills, would have cost more than $1,800.

“We’re moving to a world where a lab reports x% on a benchmark where they spend y amount of compute, but where resources for academics are recent post on X. “[N]o one is going to be able to reproduce the results.”

Why are reasoning models so expensive to test? Mainly because they generate a lot of tokens. Tokens represent bits of raw text, such as the word “fantastic” split into the syllables “fan,” “tas,” and “tic.” According to Artificial Analysis, OpenAI’s o1 generated over 44 million tokens during the firm’s benchmarking tests, around eight times the amount GPT-4o generated.

The vast majority of AI companies charge for model usage by the token, so you can see how this cost can add up.

Modern benchmarks also tend to elicit a lot of tokens from models because they contain questions involving complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI, which develops its own model benchmarks.

“[Today’s] benchmarks are more complex [even though] the number of questions per benchmark has overall decreased,” Denain told TechCrunch. “They often attempt to evaluate models’ ability to do real-world tasks, such as write and execute code, browse the internet, and use computers.”

Denain added that the most expensive models have gotten more expensive per token over time. For example, Anthropic’s Claude 3 Opus was the priciest model when it was released in May 2024, costing $75 per million output tokens. OpenAI’s GPT-4.5 and o1-pro, both of which launched earlier this year, cost $150 per million output tokens and $600 per million output tokens, respectively.

“[S]ince models have gotten better over time, it’s still true that the cost to reach a given level of performance has greatly decreased over time,” Denain said. “But if you want to evaluate the best largest models at any point in time, you’re still paying more.”

Many AI labs, including OpenAI, give benchmarking organizations free or subsidized access to their models for testing purposes. But this colors the results, some experts say — even if there’s no evidence of manipulation, the mere suggestion of an AI lab’s involvement threatens to harm the integrity of the evaluation scoring.

“From [a] scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore?” wrote Taylor in a follow-up post on X. “(Was it ever science, lol)”.



Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
admin
  • Website

Related Posts

After Klarna, Zoom’s CEO also uses an AI avatar on quarterly call

May 23, 2025

Anthropic CEO claims AI models hallucinate less than humans

May 22, 2025

Anthropic’s latest flagship AI sure seems to love using the ‘cyclone’ emoji

May 22, 2025

A safety institute advised against releasing an early version of Anthropic’s Claude Opus 4 AI model

May 22, 2025

Anthropic’s new AI model turns to blackmail when engineers try to take it offline

May 22, 2025

Meta adds another 650 MW of solar power to its AI push

May 22, 2025
Add A Comment
Leave A Reply

Don't Miss
Billionaires

How A $500 Million Cash Infusion From Wall Street Adds Billions To Ripple’s Founders’ Net Worths

November 6, 2025

The company behind the world’s fourth largest crypto is reinventing itself as a conglomerate. Two…

World’s Largest Bubble Tea Chain Mixue Mints Two Newcomers To China’s 100 Richest List

November 5, 2025

Combined Wealth Surges Nearly A Third To $1.35 Trillion; Bottled Water Billionaire Zhong Shanshan Is No. 1

November 5, 2025

The Biggest Billionaire Donors To HBCUs

November 5, 2025
Our Picks

Farmers’ Almanac to cease publication after 2 centuries of predicting the weather

November 7, 2025

Rockefeller Christmas tree begins journey to NYC from upstate

November 6, 2025

What to do if your airport is on the FAA’s flight cut list

November 6, 2025

Why autoimmune diseases mostly strike women and are often misdiagnosed

November 6, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to World-Forbes.com
At World-Forbes.com, we bring you the latest insights, trends, and analysis across various industries, empowering our readers with valuable knowledge. Our platform is dedicated to covering a wide range of topics, including sports, small business, business, technology, AI, cybersecurity, and lifestyle.

Our Picks

After Klarna, Zoom’s CEO also uses an AI avatar on quarterly call

May 23, 2025

Anthropic CEO claims AI models hallucinate less than humans

May 22, 2025

Anthropic’s latest flagship AI sure seems to love using the ‘cyclone’ emoji

May 22, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA Policy
  • Privacy Policy
  • Terms & Conditions
© 2025 world-forbes. Designed by world-forbes.

Type above and press Enter to search. Press Esc to cancel.