The rise of AI 'reasoning' models is making benchmarking more expensive - World Forbes

Post Views: 87

AI labs like OpenAI claim that their so-called “reasoning” AI models, which can “think” through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such as physics. But while this generally appears to be the case, reasoning models are also much more expensive to benchmark, making it difficult to independently verify these claims.

According to data from Artificial Analysis, a third-party AI testing outfit, it costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.

Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid” reasoning model, on the same set of tests cost $1,485.35, while testing OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis.

Some reasoning models are cheaper to benchmark than others. Artificial Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on average, they tend to be pricey. All told, Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount the firm spent analyzing over 80 non-reasoning models ($2,400).

OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet — Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41.

Artificial Analysis co-founder George Cameron told TechCrunch that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models.

“At Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these,” Cameron said. “We are planning for this spend to increase as models are more frequently released.”

Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI benchmarking costs.

Ross Taylor, the CEO of AI startup General Reasoning, said he recently spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique prompts. Taylor estimates a single run-through of MMLU Pro, a question set designed to benchmark a model’s language comprehension skills, would have cost more than $1,800.

“We’re moving to a world where a lab reports x% on a benchmark where they spend y amount of compute, but where resources for academics are recent post on X. “[N]o one is going to be able to reproduce the results.”

Why are reasoning models so expensive to test? Mainly because they generate a lot of tokens. Tokens represent bits of raw text, such as the word “fantastic” split into the syllables “fan,” “tas,” and “tic.” According to Artificial Analysis, OpenAI’s o1 generated over 44 million tokens during the firm’s benchmarking tests, around eight times the amount GPT-4o generated.

The vast majority of AI companies charge for model usage by the token, so you can see how this cost can add up.

Modern benchmarks also tend to elicit a lot of tokens from models because they contain questions involving complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI, which develops its own model benchmarks.

“[Today’s] benchmarks are more complex [even though] the number of questions per benchmark has overall decreased,” Denain told TechCrunch. “They often attempt to evaluate models’ ability to do real-world tasks, such as write and execute code, browse the internet, and use computers.”

Denain added that the most expensive models have gotten more expensive per token over time. For example, Anthropic’s Claude 3 Opus was the priciest model when it was released in May 2024, costing $75 per million output tokens. OpenAI’s GPT-4.5 and o1-pro, both of which launched earlier this year, cost $150 per million output tokens and $600 per million output tokens, respectively.

“[S]ince models have gotten better over time, it’s still true that the cost to reach a given level of performance has greatly decreased over time,” Denain said. “But if you want to evaluate the best largest models at any point in time, you’re still paying more.”

Many AI labs, including OpenAI, give benchmarking organizations free or subsidized access to their models for testing purposes. But this colors the results, some experts say — even if there’s no evidence of manipulation, the mere suggestion of an AI lab’s involvement threatens to harm the integrity of the evaluation scoring.

“From [a] scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore?” wrote Taylor in a follow-up post on X. “(Was it ever science, lol)”.

Source link

What's Hot

Farmers’ Almanac to cease publication after 2 centuries of predicting the weather

Rockefeller Christmas tree begins journey to NYC from upstate

What to do if your airport is on the FAA’s flight cut list

The rise of AI ‘reasoning’ models is making benchmarking more expensive

After Klarna, Zoom’s CEO also uses an AI avatar on quarterly call

Anthropic CEO claims AI models hallucinate less than humans

Anthropic’s latest flagship AI sure seems to love using the ‘cyclone’ emoji

A safety institute advised against releasing an early version of Anthropic’s Claude Opus 4 AI model

Anthropic’s new AI model turns to blackmail when engineers try to take it offline

Meta adds another 650 MW of solar power to its AI push

How A $500 Million Cash Infusion From Wall Street Adds Billions To Ripple’s Founders’ Net Worths

World’s Largest Bubble Tea Chain Mixue Mints Two Newcomers To China’s 100 Richest List

Combined Wealth Surges Nearly A Third To $1.35 Trillion; Bottled Water Billionaire Zhong Shanshan Is No. 1

The Biggest Billionaire Donors To HBCUs

Farmers’ Almanac to cease publication after 2 centuries of predicting the weather

Rockefeller Christmas tree begins journey to NYC from upstate

What to do if your airport is on the FAA’s flight cut list

Why autoimmune diseases mostly strike women and are often misdiagnosed

Our Picks

After Klarna, Zoom’s CEO also uses an AI avatar on quarterly call

Anthropic CEO claims AI models hallucinate less than humans

Anthropic’s latest flagship AI sure seems to love using the ‘cyclone’ emoji

What's Hot

The rise of AI ‘reasoning’ models is making benchmarking more expensive

Related Posts

Subscribe to Updates