A Timeless Gaming Classic Approaches 40 Years and Supports Contemporary Research: Super Mario Encounters AI


# Not a New Benchmark, but Nonetheless Intriguing: Mario as a Metric for AIs

## AI Metrics: From Logic to Interactive Gaming

Artificial intelligence (AI) models are frequently assessed using metrics that evaluate their mathematical, logical, and analytical skills. However, the team at the University of California San Diego (UCSD) has adopted a unique strategy: rather than relying on conventional assessments, they permitted AI models to engage in *Super Mario Bros.* This innovative trial, as reported by *TechSpot*, underscores a vital element of AI performance—timing and immediate decision-making.

## The Study: GamingAgent as an AI Operator

The research group at UCSD’s Hao AI Lab created **GamingAgent**, a framework that enables AI models to manage Mario through Python code. This configuration was based on an emulated incarnation of *Super Mario Bros.* for the NES.

The AIs received straightforward instructions such as **”Leap over this foe”** and were shown screenshots for context. The objective was to assess how effectively the models could devise their strategies and adjust in real-time.

## Unexpected Findings: Claude 3.7 Shines, GPT-4o Faces Challenges

The findings of the study were surprising. **Anthropic’s Claude 3.7** excelled beyond all other models, exhibiting accurate jumps, adept enemy evasion, and assertive gameplay. Its predecessor, **Claude 3.5**, also performed commendably, although not as impressively.

In contrast, **OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro encountered considerable difficulties**. Despite their advanced logical reasoning capabilities, these models often struggled with fundamental game actions. They frequently mistimed leaps, plunged into pits, or collided with adversaries—indicating that their decision-making processes were too sluggish for the game’s rapid pace.

## Quickness Trumps Logic

One crucial insight from this experiment is that **swift reflexes and real-time adaptability outweigh intricate logic—at least when it comes to playing Mario**.

Although some AI models tried to “reason out” situations, this tactic resulted in **delays in implementation**. In *Super Mario Bros.*, mere milliseconds can make the difference between success and failure. The researchers hypothesize that models like GPT-4o waste too much time calculating before taking action, resulting in subpar in-game performance.

## Can Retro Games function as AI Metrics?

This study poses a compelling question: **Can classic video games serve as useful AI metrics?**

While excelling at *Super Mario Bros.* does not guarantee that an AI can tackle intricate real-world challenges, the evaluation offers significant insights. It emphasizes the necessity of **quick, instinctive decision-making**, which is essential in numerous real-world scenarios, including robotics, autonomous navigation, and real-time strategy development.

## Final Thoughts

The UCSD research illustrates that AI effectiveness depends not only on sheer computational capacity or logical reasoning but also on **how swiftly and adeptly an AI can respond to changing environments**. Though *Super Mario Bros.* may not evolve into a standard AI benchmark, it provides an intriguing method for assessing AI adaptability and responsiveness.

As AI continues to advance, unconventional evaluations like this could aid researchers in better grasping the strengths and limitations of various models, paving the way for more sophisticated and proficient AI systems in the future.