Close Menu
World Forbes – Business, Tech, AI & Global Insights
  • Home
  • AI
  • Billionaires
  • Business
  • Cybersecurity
  • Education
    • Innovation
  • Money
  • Small Business
  • Sports
  • Trump
What's Hot

Farmers’ Almanac to cease publication after 2 centuries of predicting the weather

November 7, 2025

Rockefeller Christmas tree begins journey to NYC from upstate

November 6, 2025

What to do if your airport is on the FAA’s flight cut list

November 6, 2025
Facebook X (Twitter) Instagram
Trending
  • Farmers’ Almanac to cease publication after 2 centuries of predicting the weather
  • Rockefeller Christmas tree begins journey to NYC from upstate
  • What to do if your airport is on the FAA’s flight cut list
  • Why autoimmune diseases mostly strike women and are often misdiagnosed
  • Why autoimmune diseases mostly strike women and are often misdiagnosed
  • How A $500 Million Cash Infusion From Wall Street Adds Billions To Ripple’s Founders’ Net Worths
  • Thousands of miles of lost Roman roads are uncovered using aerial photos
  • Toy Hall of Fame recognizes Slime, Battleship, Trivial Pursuit
World Forbes – Business, Tech, AI & Global InsightsWorld Forbes – Business, Tech, AI & Global Insights
Friday, November 7
  • Home
  • AI
  • Billionaires
  • Business
  • Cybersecurity
  • Education
    • Innovation
  • Money
  • Small Business
  • Sports
  • Trump
World Forbes – Business, Tech, AI & Global Insights
Home » Debates over AI benchmarking have reached Pokémon
AI

Debates over AI benchmarking have reached Pokémon

By adminApril 14, 2025No Comments2 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email
Post Views: 96


Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town

119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x

— Jush (@Jush21e8) April 10, 2025

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.





Source link

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
admin
  • Website

Related Posts

After Klarna, Zoom’s CEO also uses an AI avatar on quarterly call

May 23, 2025

Anthropic CEO claims AI models hallucinate less than humans

May 22, 2025

Anthropic’s latest flagship AI sure seems to love using the ‘cyclone’ emoji

May 22, 2025

A safety institute advised against releasing an early version of Anthropic’s Claude Opus 4 AI model

May 22, 2025

Anthropic’s new AI model turns to blackmail when engineers try to take it offline

May 22, 2025

Meta adds another 650 MW of solar power to its AI push

May 22, 2025
Add A Comment
Leave A Reply

Don't Miss
Billionaires

How A $500 Million Cash Infusion From Wall Street Adds Billions To Ripple’s Founders’ Net Worths

November 6, 2025

The company behind the world’s fourth largest crypto is reinventing itself as a conglomerate. Two…

World’s Largest Bubble Tea Chain Mixue Mints Two Newcomers To China’s 100 Richest List

November 5, 2025

Combined Wealth Surges Nearly A Third To $1.35 Trillion; Bottled Water Billionaire Zhong Shanshan Is No. 1

November 5, 2025

The Biggest Billionaire Donors To HBCUs

November 5, 2025
Our Picks

Farmers’ Almanac to cease publication after 2 centuries of predicting the weather

November 7, 2025

Rockefeller Christmas tree begins journey to NYC from upstate

November 6, 2025

What to do if your airport is on the FAA’s flight cut list

November 6, 2025

Why autoimmune diseases mostly strike women and are often misdiagnosed

November 6, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to World-Forbes.com
At World-Forbes.com, we bring you the latest insights, trends, and analysis across various industries, empowering our readers with valuable knowledge. Our platform is dedicated to covering a wide range of topics, including sports, small business, business, technology, AI, cybersecurity, and lifestyle.

Our Picks

After Klarna, Zoom’s CEO also uses an AI avatar on quarterly call

May 23, 2025

Anthropic CEO claims AI models hallucinate less than humans

May 22, 2025

Anthropic’s latest flagship AI sure seems to love using the ‘cyclone’ emoji

May 22, 2025

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Facebook X (Twitter) Instagram Pinterest
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA Policy
  • Privacy Policy
  • Terms & Conditions
© 2025 world-forbes. Designed by world-forbes.

Type above and press Enter to search. Press Esc to cancel.