April 18, 2025
As one of the first public initiatives at AGI, Inc. we are releasing our web agent eval benchmark REAL Bench (REAL for short). REAL is a compact yet fully functional ‘mini-Internet,’ complete with near-exact replicas of the most popular real-life sites. It’s not just a playground; it’s a standardized test lab for browser agent systems, designed to supercharge our understanding of web performance, security, and next-gen AI interactions.
At AGI Inc, our wider goal is to redefine human-AI collaboration by combining cutting-edge research with practical applications, aiming to create systems that go beyond simple chat interfaces to deliver real impact in work and personal life.
Towards this goal, REAL is an engine to improve agent performance with the research community. Anyone can evaluate agents, get a task success score (REAL Score), and compete on a leaderboard. REAL is the ultimate testbed for next-gen AI on the web.
These benchmarks are the way to build trust that agents can actually do useful things. Because they are real-life replicas, if your agents can do tasks on our replicas of these websites, then they should also be able to do those same tasks on the actual websites.
Right now, there are no realistic, standard evaluation benchmarks for web agents.
Current evaluation frameworks for AI agents such as WebArena, OSWorld, WebVoyager, and WebCanvas often fall short in replicating the intricacies of real-world web interactions. They tend to be too simplistic, failing to capture the dynamic nature of the internet. REAL addresses this gap by providing a comprehensive, realistic testing ground for AI agents, enabling developers and researchers to:
REAL includes sandbox replicas of widely-used websites across various domains, such as e-commerce, social media, news, and more. These replicas maintain the structural and functional elements of their real-world counterparts, providing a familiar and challenging environment for AI agents
We created a test framework with top 11 static websites hosted on the internet:
REAL features a public leaderboard where developers can submit their agents' performance. This fosters a competitive and collaborative environment, encouraging continuous improvement and innovation within the AI community.
To measure the performance of closed-LLM APIs and open-source models on Web Agent tasks, we benchmarked them using a baseline agent we are releasing (see our AGI SDK package below).
We also benchmark existing web agent frameworks: Browser-Use and StageHand. For each query, we specified the same natural language prompt and output field schema across all models. For StageHand and Browser-Use, we used GPT-4o. We compare them with our own internal Agent - AGI Agent-0, which we will be giving access to soon.
We compare open and closed source frontier models using our baseline agent.
We find that reasoning models perform better than standard pre-trained models; the highest score was achieved by Claude-3.7-Sonnet-Thinking (41.1% REAL Score), showing the advantage of improved reasoning capabilities.
Open-source models lag far behind, with DeepSeek-V3 at 19.6% and Llama models scoring even lower (around 11-13%). The three main frontier models perform head-to-head: Claude-3.7-Sonnet-Thinking (41.1%), Gemini-2.5-Pro-Experimental (38.4%), and o3 (34.8%).
Qualitatively, while frontier reasoning models have impressive capabilities, they still face significant challenges: they often fail to recognize the task completion accurately (inadequate state verification), get stuck in navigation dead ends without recovering, struggle with complex UI elements, and find multi-step tasks difficult.
Top Closed-Source Model Scores:
Top Open-Source Model Scores:
For current agent frameworks, we observe unimpressive performance on the REAL Bench, showing we are not there yet for real-world readiness. OpenAI’s CUA achieves a 7.1% REAL Score, with Stagehand and Browser Use at 19% and 31% respectively. Anthropic Computer-Use reaches a score of 41%.
Our yet to be released internal agent, AGI Agent-0, reaches a score of 45%.
Manual inspection of trajectories showed that these agents often exit the interaction loop arbitrarily and have trouble following the exact task specification provided over multiple steps. Often, systems using screenshots for navigation have a mismatch between what they plan to click and the actual action they perform.
We are releasing the evaluation harness for reproducing the REAL scores for agent tasks for building transparency in the community, and welcome contributions.
The frameworks which had the best scores:
Agents are assessed using a range of metrics that capture both their effectiveness and reliability:
We provide two outcome reward functions for each environment: 1) for judging information-retrieval task success, 2) for judging action-taking task success.
In tasks involving information-retrieval, we use a rubric based LLM judge to decide whether the retrieved answer matches the ground truth in our test set. For tasks involving executing actions on a website, we have a state-diff checking mechanism that compares the difference between the initial and final states to the ground truth defined as part of our evaluation set.
Task completion is determined by a single binary outcome reward function. For tasks involving both retrieval and action tasks, we check if both outcome reward functions return 1. Agent performance (accuracy on the benchmark) is determined as the success rate across all 112 tasks.
A detailed breakdown of how each model is performing is given by clicking into the model.
Here, we can see it going into details on what kind of problems the model struggled with, and the completion of tasks based on websites. You can get better granularity by clicking on individual websites and see what the task was and whether the model was able to clear it.
Here is the detailed breakdown of the tasks under the TopWork website, and we can see what “Types” of problems there are and how well the model is doing.
You can check out our Github repo: AGI SDK
First, install the AGI SDK package using pip:
pip install agisdk
Create a Python script with the following code:
from agisdk import real
harness = real.Harness(model="gpt-4o") # specify your model here
results = harness.run()
Once the evaluation completes, you'll receive a comprehensive report with:
Additionally, under the Profile section (top right), you can Register your own custom model, and then run them.
Developers or companies looking to deploy an AI agent to their website can use REAL to pick the best model or agent framework.
REAL can be defined with three main use cases:
REAL represents a significant advancement in the evaluation of AI agents, offering a realistic, comprehensive, and standardized platform for testing and development. By bridging the gap between controlled testing environments and the complexities of the real web, REAL empowers developers, researchers, and businesses to create more robust, reliable, and effective AI agents.
Join the REAL community today and contribute to shaping the future of AI agent development.
Ready to get started with the REAL BENCH today and learn more about use cases tailored to you? Let's chat 🚀
We are open to partner with companies, research labs, developers and enterprises looking to adopt agents. You can reach our team at partner@theagi.company.