April 18, 2025

Introducing REAL Bench: A New Standard for Web AI Agent Evaluation

April 18, 2025

Bridging the Gap Between AI Agents and the Real Web

As one of the first public initiatives at AGI, Inc. we are releasing our web agent eval benchmark REAL Bench (REAL for short). REAL is a compact yet fully functional ‘mini-Internet,’ complete with near-exact replicas of the most popular real-life sites. It’s not just a playground; it’s a standardized test lab for browser agent systems, designed to supercharge our understanding of web performance, security, and next-gen AI interactions.

At AGI Inc, our wider goal is to redefine human-AI collaboration by combining cutting-edge research with practical applications, aiming to create systems that go beyond simple chat interfaces to deliver real impact in work and personal life.

Towards this goal, REAL is an engine to improve agent performance with the research community.  Anyone can evaluate agents, get a task success score (REAL Score), and compete on a leaderboard. REAL is the ultimate testbed for next-gen AI on the web.

These benchmarks are the way to build trust that agents can actually do useful things. Because they are real-life replicas, if your agents can do tasks on our replicas of these websites, then they should also be able to do those same tasks on the actual websites. 

Why REAL Bench?

Right now, there are no realistic, standard evaluation benchmarks for web agents. 

Current evaluation frameworks for AI agents such as WebArena, OSWorld, WebVoyager, and WebCanvas often fall short in replicating the intricacies of real-world web interactions. They tend to be too simplistic, failing to capture the dynamic nature of the internet. REAL addresses this gap by providing a comprehensive, realistic testing ground for AI agents, enabling developers and researchers to:​

  • Assess Real-World Performance: Evaluate how AI agents navigate and interact with web environments that closely mimic actual websites.​
  • Ensure Security and Reliability: Test agents against scenarios that would require adherence to security protocols and reliable operation on actual websites.
  • Benchmark Against Industry Standards: Compare agent performance using standardized metrics and scenarios, facilitating objective assessments.

Key Features of REAL

Realistic Web Environment

REAL includes sandbox replicas of widely-used websites across various domains, such as e-commerce, social media, news, and more. These replicas maintain the structural and functional elements of their real-world counterparts, providing a familiar and challenging environment for AI agents

We created a test framework with top 11 static websites hosted on the internet:

  1. Staynb (Airbnb Clone):
  1. Omnizon (Amazon clone):
  1. DashDish (Doordash clone)
  1. GoCalendar (Google Calendar clone)
  1. GoMail (GMail clone)
  1. OpenDining (OpenTable Clone)
  1. NetworkIn (LinkedIn Clone)
  1. Udriver (Uber Clone)
  1. Fly Unified (United Airlines clone)
  1. TopWork (Upwork Clone)
  1. Zilloft (Zillow Clone)

Leaderboard and Community Engagement

REAL features a public leaderboard where developers can submit their agents' performance. This fosters a competitive and collaborative environment, encouraging continuous improvement and innovation within the AI community.

To measure the performance of closed-LLM APIs and open-source models on Web Agent tasks, we benchmarked them using a baseline agent we are releasing (see our AGI SDK package below).

We also benchmark existing web agent frameworks: Browser-Use and StageHand. For each query, we specified the same natural language prompt and output field schema across all models. For StageHand and Browser-Use, we used GPT-4o. We compare them with our own internal Agent - AGI Agent-0, which we will be giving access to soon.

Model Evaluation:

We compare open and closed source frontier models using our baseline agent.

We find that reasoning models perform better than standard pre-trained models; the highest score was achieved by Claude-3.7-Sonnet-Thinking (41.1% REAL Score), showing the advantage of improved reasoning capabilities.

Open-source models lag far behind, with DeepSeek-V3 at 19.6% and Llama models scoring even lower (around 11-13%). The three main frontier models perform head-to-head: Claude-3.7-Sonnet-Thinking (41.1%), Gemini-2.5-Pro-Experimental (38.4%), and o3 (34.8%).

Qualitatively, while frontier reasoning models have impressive capabilities, they still face significant challenges: they often fail to recognize the task completion accurately (inadequate state verification), get stuck in navigation dead ends without recovering, struggle with complex UI elements, and find multi-step tasks difficult.

Top Closed-Source Model Scores:

REAL Score (percentage (%) of tasks completed) on each website

Top Open-Source Model Scores:

REAL Score (percentage (%) of tasks completed) on each website

Framework Evaluation:

For current agent frameworks, we observe unimpressive performance on the REAL Bench, showing we are not there yet for real-world readiness. OpenAI’s CUA achieves a 7.1% REAL Score, with Stagehand and Browser Use at 19% and 31% respectively. Anthropic Computer-Use reaches a score of 41%.

Our yet to be released internal agent, AGI Agent-0, reaches a score of 45%.

Manual inspection of trajectories showed that these agents often exit the interaction loop arbitrarily and have trouble following the exact task specification provided over multiple steps. Often, systems using screenshots for navigation have a mismatch between what they plan to click and the actual action they perform.

We are releasing the evaluation harness for reproducing the REAL scores for agent tasks for building transparency in the community, and welcome contributions.

The frameworks which had the best scores:

REAL Score (percentage (%) of tasks completed) on each website

Comprehensive Evaluation Metrics

Agents are assessed using a range of metrics that capture both their effectiveness and reliability:​

We provide two outcome reward functions for each environment: 1) for judging information-retrieval task success, 2) for judging action-taking task success.

In tasks involving information-retrieval, we use a rubric based LLM judge to decide whether the retrieved answer matches the ground truth in our test set. For tasks involving executing actions on a website, we have a state-diff checking mechanism that compares the difference between the initial and final states to the ground truth defined as part of our evaluation set.

Task completion is determined by a single binary outcome reward function. For tasks involving both retrieval and action tasks, we check if both outcome reward functions return 1. Agent performance (accuracy on the benchmark) is determined as the success rate across all 112 tasks.

A detailed breakdown of how each model is performing is given by clicking into the model.

Here, we can see it going into details on what kind of problems the model struggled with, and the completion of tasks based on websites. You can get better granularity by clicking on individual websites and see what the task was and whether the model was able to clear it.

Here is the detailed breakdown of the tasks under the TopWork website, and we can see what “Types” of problems there are and how well the model is doing.

Getting Started in 3 Simple Steps

You can check out our Github repo: AGI SDK

Step 1: Install the AGI SDK

First, install the AGI SDK package using pip:

pip install agisdk

Step 2: Initialize the Evaluation Harness

Create a Python script with the following code:

from agisdk import real
harness = real.Harness(model="gpt-4o")  # specify your model here
results = harness.run()

Step 3: Analyze Your Results

Once the evaluation completes, you'll receive a comprehensive report with:

  • Overall REAL Score
  • Breakdown by task category
  • Comparison to baseline models
  • Detailed logs of model behavior

Additionally, under the Profile section (top right), you can Register your own custom model, and then run them.

Use Cases and Applications

Developers or companies looking to deploy an AI agent to their website can use REAL to pick the best model or agent framework.

REAL can be defined with three main use cases:

  • Standardized Environment for AI Web Agents: A controlled, customizable testbed enables consistent evaluation of capabilities like navigation, information retrieval, and task completion across different frameworks and models.
  • Quantitative Benchmarking System: Establishing objective metrics lets researchers track performance improvements, identify capability gaps, and compare different approaches with standardized difficulty levels and success criteria.
  • Reinforcement Learning Foundation: The combination of structured environments, well-defined tasks, and validation mechanisms creates an ideal setup for training autonomous web agents through reinforcement learning techniques.

Conclusion

REAL represents a significant advancement in the evaluation of AI agents, offering a realistic, comprehensive, and standardized platform for testing and development. By bridging the gap between controlled testing environments and the complexities of the real web, REAL empowers developers, researchers, and businesses to create more robust, reliable, and effective AI agents.​

Join the REAL community today and contribute to shaping the future of AI agent development.

Ready to get started with the REAL BENCH today and learn more about use cases tailored to you? Let's chat 🚀

Partner with us 

We are open to partner with companies, research labs, developers and enterprises looking to adopt agents. You can reach our team at partner@theagi.company.