In the early internet, Google’s PageRank algorithm ranked websites by relevance so users could quickly find what they needed. In the coming Internet of Agents, where countless AI agents will proliferate, how will users discover the right agents and know which ones to trust?
Recall standardizes agents using verifiable performance data and reputation scores. It begins with alternative benchmarking and, over time, aims to provide the infrastructure for fast, accurate agent discovery across the Internet of Agents.
The first task is to accumulate evaluation data through onchain competitions that address trust issues in legacy benchmarks. Recall evaluates performance with a dynamic benchmark in which agents compete in live simulation environments, then ranks agents using the resulting data.
Recall’s architecture comprises AgentRank, Curation Markets, Onchain Competitions, Skill Pools, and the Recall Predict. These modules operate independently yet interconnect to unify performance evaluation and incentives within a single protocol.
The best-case path follows Polymarket and Google PageRank. First, build a high-trust benchmark using market-like mechanisms that harness crowd wisdom and economic incentives. Then evolve into an agent search engine akin to PageRank, capturing the first touchpoint of the Internet of Agents.
In 2023, a video titled “Will Smith eating spaghetti” appeared on Reddit. Created with the ModelScope video generator, it depicted actor Will Smith eating spaghetti. His face was distorted and his motions were unnatural. The clip, both eerie and oddly funny, spread quickly and became a meme that showcased the limitations of early AI video generation.
The “Will Smith eating spaghetti” prompt soon became an informal benchmark in the AI community. It served as a reference point for how naturally models could reproduce human actions and expressions. Each time a new model shipped, the prompt was used for side-by-side comparisons. In May 2025, Google’s Veo 3 released a recreation that showed clear gains in facial fidelity, motion naturalness, and audio sync, earning a “passed the Will Smith eating spaghetti test” verdict from Forbes.
This spaghetti test highlights both the limits of traditional AI benchmarks and the promise of community-driven evaluation. Cutting-edge models such as Claude, GPT-5, Gemini, and DeepSeek often post strong benchmark scores, yet those scores are frequently measured on company-selected metrics in closed settings, and they may fail to guarantee real-world reproducibility. The result is a gap between benchmark results and what users actually experience.
By contrast, community-driven tests like the spaghetti prompt show how direct, user-led assessment can become a powerful trust-building mechanism. Aggregated, distributed judgments are more transparent and capture multiple perspectives, offering a more three-dimensional view of model quality than unilateral corporate reports.
Recall systematizes this community-driven evaluation. It runs onchain competitions where AI agents compete for ranking, reputation, and rewards, maximizing community participation through token incentives. The goal is to build a high-trust AI benchmarking system and, ultimately, to establish a reputation protocol that helps users discover and connect with high-reputation agents in the Internet of Agents. The sections that follow explain how Recall implements this infrastructure and how it can scale.
An AI benchmark is an evaluation tool designed to measure and compare model performance objectively. It uses predefined problem sets to test how efficiently a model solves given tasks and to quantify the results. Evaluation areas span knowledge, coding, ethics, and multilingual capability. Representative suites are used by domain. For example, MMLU measures undergraduate-level general knowledge, GSM8K measures mathematical reasoning, and HumanEval measures code generation. Aggregated results provide a standardized view of model performance.
Source: Epoch AI
Benchmarks matter because they offer a common yardstick for objective comparison. Every new model claims state-of-the-art results. Without a standardized framework, reported numbers would be arbitrary and hard to compare. Benchmarks enable like-for-like evaluation under the same conditions.
They also reduce search and decision costs during technology adoption. Pre-validated metrics act as reference points for researchers and companies. This lowers the need for duplicative in-house testing and lets teams focus on strategic adoption.
Source: BetterBench
Despite their value, structural limits are becoming clearer. A Stanford study analyzing 60 benchmarks across NLP, math and coding, and multimodal tasks found that more than half fail to distinguish model performance differences with statistical significance. In practice, top models are often separated by margins too small to feel meaningful, yet marketing materials frequently inflate these gaps.
Repeated trials expose reproducibility issues. With the same model and settings, running a benchmark 10 times often yields 1–3% score variance, and rankings can flip. If results are not stable, a single score should not be treated as definitive. Benchmarks that lack reproducibility struggle to represent real capability.
Trust has also eroded around governance and test integrity. It was reported that OpenAI funded the company behind a benchmark for its o3 model, and questions were raised about whether the problem set appeared in o3 training. This illustrates how weak transparency and doubts about evaluation efficacy undermine credibility. Key limitations include:
Selective testing: Benchmarks skew toward tasks that are easy to grade. Math and coding are common, while ambiguous context understanding, ethical judgment, and creative communication are rarely measured. Easy skills get measured. Hard skills are sidelined.
Poor reproducibility: Leaderboard peaks often reflect one-off best runs rather than consistent performance. Real settings demand incomplete information handling, follow-up queries, and multi-step reasoning. When the same task is repeated, answers can vary. A high score does not guarantee stable performance in production.
Data contamination: Public datasets are likely seen during pretraining. If evaluation items leak into training, the benchmark measures recall rather than reasoning. High scores then fail to guarantee generalization to novel situations.
If these design flaws persist, both enterprises and users pay the cost. Individuals must over-research options and re-verify claims. Enterprises must reinterpret uncertain metrics and run extra fit-for-purpose tests on internal data, slowing decisions and inflating adoption costs.
As doubts rise, a new approach is gaining traction: dynamic benchmarking. Legacy practice relies on static, human-designed datasets. These are easy to compare under identical conditions, but over time they suffer from contamination and task bias.
Dynamic benchmarks generate evolving prompts and scenarios to evaluate models against future or unseen conditions. New data is created at evaluation time, or models co-generate questions and answers in real time. Another method sets up predefined simulation environments and evaluates how a model plans and acts to achieve goals. This moves beyond fixed problem sets to multilayered performance verification.
Recall adopts this dynamic approach. Instead of a static dataset, it verifies models in continuously updated, scenario-variant environments. The next section details Recall’s benchmark design and end-to-end process.
In the coming Internet of Agents, where countless AI agents will coexist, how can users discover the most relevant agents and know which ones to trust? Recall addresses this challenge by standardizing agents through verifiable performance data and reputation scores. Starting from alternative benchmarking, its ultimate goal is to establish a reputation protocol that enables discovery, commerce, and collaboration in the agent economy. The aim is to provide infrastructure that lets users quickly and accurately find the agents they need.
Source: Search Engine Land
The vision is functionally similar to Google’s PageRank. In the early internet, PageRank indexed and ranked a chaotic web of sites by relevance, allowing users to simply search and trust that the best content surfaced. This transformed discovery from manual portal listings into an automated system where algorithms crawled the web, ranking sites by reputation and relevance.
Recall aims to play the same role for the Internet of Agents, providing infrastructure for trusted discovery:
A2C (Agent to Consumer): A crypto investor searching for a trading agent to automate portfolio management optimized for risk and return over a set time horizon.
A2B (Agent to Business): A company looking for marketing agents to automate social listening, content creation, and customer outreach.
A2A (Agent to Agent): A security agent enhancing its intrusion detection by finding a risk analysis agent specialized in malicious traffic patterns that integrates seamlessly with existing infrastructure.
At the core of this reputation protocol is the benchmarking system. Agents must be evaluated through verifiable performance, and rankings must guide discovery. Recall implements this through five interconnected modules:
AgentRank: Collects, analyzes, and publishes agent performance data.
Onchain Competitions: Provides standardized evaluation environments and converts results into verifiable performance data.
Curation Markets: Community-driven selection using token staking as skin in the game.
Agent Predict: Community-based, prediction-market style benchmarking.
Each module functions independently but connects to form a unified protocol for evaluation and incentives. Competition results feed into AgentRank. Curation Markets and Agent Predict reinforce these rankings. The token economy aligns all stakeholders to contribute over time. As more performance data accumulates, accuracy and trustworthiness improve, creating a compounding cycle of reliability.
3.1.1 AgentRank
Source: X (@recallnet)
AgentRank is a composite reputation score derived from live performance data and community staking. Onchain competitions and curation markets both exist to update this score. It quantifies capability by combining verified results with the economic stake that the community places on an agent’s future performance.
Source: Recall
New agents start with a baseline performance score (Y-axis) and low certainty (X-axis). As they compete, performance rises or falls relative to results. Certainty grows as competition outcomes and staking accumulate. Top-right agents combine strong performance over time with heavy staking, becoming both highly trusted and highly capable.
3.1.2 Curation Markets
AgentRank is reinforced through curation markets, where curators stake tokens on agents they believe will outperform. Agents with more stake gain higher scores and improved ranking prospects. If those agents perform well, curators are rewarded. If not, curators are penalized.
Compared to static benchmarks, this dual mechanism offers clear advantages:
Verifiable performance: Unlike static benchmarks, AgentRank dynamically updates based on real results in competitive environments. Agents repeatedly prove capability onchain, eliminating reliance on opaque claims.
Economic signaling: Community staking reflects collective conviction. Early backers of promising agents can be rewarded, while users benefit from transparent economic signals when choosing agents.
This system ensures neutrality. Results are transparent, onchain, and shaped by distributed decisions rather than a single authority. Reputation scores gain credibility through both performance and market validation.
At the center of Recall are onchain competitions. These events pit agents against one another under identical conditions in tasks such as portfolio management or code generation. For example, in a 7-day portfolio management challenge, agents trade using live market data and are evaluated on risk-adjusted returns. Results are recorded onchain, feeding directly into AgentRank.
Source: Recall
Recall runs recurring competitions to accumulate evaluation data. In a recent ‘Crypto Trading Challenge’, 10 agents competed for a $10,000 prize over 7 days. The sandbox environment processed roughly 10,000 trades and $143M in volume, with outcomes ranging from +250% to -10% returns.
Competition outcomes feed into AgentRank. Recall’s architecture allows anyone to create and customize competitions, enabling community-driven expansion of evaluation data. The lifecycle follows five steps:
Competition creation: Organizers define objectives, environment, and metrics. In trading challenges, participants compete under fixed capital and leverage constraints, with KPIs like P&L or strategy consistency. All parameters are set and disclosed onchain before the event.
Agent registration: Developers register agents through the MCP server, Recall’s standard interface layer that manages identity, execution, and performance logging. Agents can also be built using the Recall Agent Toolkit with frameworks such as Python MCP or LangChain. Local testing is completed before submission.
Execution: Recall deploys agents in isolated sandbox environments. Standardized prompts or scenarios are delivered sequentially, and agents output actions to solve tasks.
Evaluation: Metrics are tailored to the skill tested. Quantitative tasks like accuracy, returns, or puzzle solving are scored automatically. Qualitative tasks like creativity or communication are assessed by expert judges or crowdsourced evaluators.
Results integration: Final scores are recorded onchain and update AgentRank in real time. High performers see their reputation rise, while inactive or underperforming agents decline. Rewards such as RECALL tokens or offchain points like Surge are distributed, and reward histories are transparently logged onchain.
These trading-based benchmarks differ from static datasets or A/B tests. Agents face real-time market conditions, respond to volatility, and prove decision-making ability in dynamic settings. This avoids data leakage, ensures adaptability, and produces multi-dimensional, trustworthy evaluations.
Source: Recall
Skill Pools allow community members to stake tokens on specific skills, signaling demand for agents with those capabilities. For example, if the Trading Skill Pool accumulates significant token stakes, it indicates strong demand and expectations for trading agents. Conversely, a small stake in the Image Recognition Skill Pool signals limited community demand. Skills not yet covered can be supported by creating new pools.
Source: Recall
The TVL in each skill pool directly determines how protocol rewards are distributed. In every reward cycle, the protocol allocates tokens proportionally to each skill pool. If a pool accounts for 30% of TVL, participants in that skill domain receive 30% of the overall rewards.
As a result, Skill Pools directly shape the development trajectory of agents within Recall. Because incentives concentrate where demand is highest, developers are encouraged to build agents specialized in those high-value skills. Conversely, skills with lower TVL naturally attract fewer resources. This mechanism ensures that agent supply aligns with real-world demand and fosters bottom-up, market-driven development.
Recall’s Agent Predict is a community-driven program for forecasting AI model performance. Alongside the Skill Pool, it serves both as a predictive tool that shapes the direction of agent development and as a product line that strengthens the credibility of community-led benchmarks. At present, the program operates as a participatory benchmark where users predict the performance of unreleased AI models. For example, before OpenAI released GPT-5, Recall accumulated prediction data on how well GPT-5 would perform across various skill domains.
Source: Recall
Agent Predict allows anyone to propose new evaluation categories and prompts to measure specific capabilities. For instance, the community may suggest tasks that test whether a model resists misinformation or responds appropriately to ethically problematic questions. Once adopted, these tasks are registered as official test items for comparing GPT-5 against existing models. Prediction participants then compare GPT-5 with other models such as Claude or Grok and vote on which model will perform better in each category.
Source: Recall
All submitted tasks and predictions remain private until GPT-5 is released. This prevents benchmark data from leaking into training sets and artificially inflating performance. After GPT-5 becomes public, Recall discloses all stored prompts, GPT-5’s actual outputs and scores, as well as the pre-release community predictions. Each dataset includes a hash identifier that proves integrity and ensures no data was altered before or after release. Through this process, Agent Predict provides a benchmark that remains verifiable even retrospectively and enhances transparency in AI model evaluation.
To evolve into a broadly used benchmark, Agent Predict continues to accumulate prediction data. In a recent program, more than 700,000 predictions were submitted within just a few days. As of August 2025, over 110,000 participants have contributed around 5.88 million individual forecasts. This volume of prediction data improves both the quality of test questions and the breadth of coverage, increasing the likelihood that Agent Predict will establish itself as a benchmark capable of offering multi-dimensional views of AI model performance.
The architecture of Recall, as we have seen, integrates its modules into a unified benchmark and reputation system. Its defining characteristic is that it is built on verifiability and incentive participation. These qualities are enabled by blockchain, as seen in verifiable onchain data storage and token-based incentives. In the following sections, we look at how blockchain and crypto attributes strengthen Recall’s benchmarks and how they can sustain the protocol’s value proposition of verifiability and its long-term growth.
Dynamic benchmarks, proposed as an alternative to static ones, offer a more advanced way to measure AI. By introducing variations in datasets, they test a model’s adaptability and real-world usability. Yet they are still imperfect. When benchmarks and evaluation data are managed in closed environments, questions of objectivity and fairness inevitably arise.
Recall addresses this problem by storing and managing all benchmark data onchain, ensuring its integrity. In other words, every piece of data tied to AI workflows becomes tamperproof and verifiable, offering a clear answer to the question: “Why blockchain?”
Trust and Transparency: By publishing evaluation data onchain, benchmarks remain immutable, auditable, and transparent. Anyone can independently verify how scores are calculated, preventing manipulation. This establishes the foundation of a trust mechanism among developers, researchers, and investors who adopt Recall benchmarks as a standard.
Composability: Because Recall benchmarks live onchain, they integrate natively into the Web3 ecosystem. Protocols and applications can directly use Recall’s verified scores in governance decisions, risk modeling, and agent curation. This composability creates network effects, positioning Recall’s benchmarks as a common standard that other protocols can adopt without redundant verification.
Another reason Recall leverages onchain infrastructure is to enable the incentive loop. The entire architecture is built on an economic framework powered by RECALL tokens and Fragments. Rewards are distributed seasonally, ultimately going to the agent with the highest rank score and the curator with the most accurate evaluations. Curators who make poor judgments receive penalties.
These rewards are automatically distributed according to rules encoded in smart contracts, minimizing the risks of manual gatekeeping and ensuring fair alignment of incentives across curators, agent operators, and the protocol.
Source: Recall
This incentive structure creates a cycle where agents and curators are continuously motivated. Agent developers improve their models to win competitions and reputation. Curators refine their evaluations because their staking is at risk. Participants in Agent Predict and Skill Pools also receive economic rewards for staying engaged. Over time, this process strengthens the reputation system and attracts new participants, creating a feedback loop of incentives.
For this loop to remain sustainable, however, the token economy itself must remain stable. The key design challenge is to link product demand directly with token demand. This could involve mechanisms such as token escrow for agent registration, fee-based token spending for curation, and slashing mechanisms for curators who perform poorly.
Through these mechanisms, Recall can create a structure where increased protocol usage translates into higher token demand, while supply and demand are balanced through refined adjustment mechanisms. At present, the product is still early stage and specific token economy plans have not yet been disclosed. Nonetheless, solving this design challenge is essential and will act as a catalyst for Recall’s long-term growth.
To understand the potential path and scale of Recall’s future growth in the AI industry, we can look to two proven success cases: Polymarket and Google’s PageRank algorithm. These two cases show how a product can dominate its market by providing a core function, prediction markets in one case and search engines in the other. Recall follows a similar playbook. It first aims to establish itself as a credible benchmark system using mechanisms similar to prediction markets, and then expand into routing infrastructure for the agent internet. This growth path can be summarized in two stages.
First, Recall uses mechanisms similar to Polymarket’s prediction markets, leveraging the wisdom of crowds and economic incentives to build a reliable benchmark system.
Second, Recall evolves like Google’s PageRank into a discovery, search, and routing layer that secures the first touch point for users in the agent internet.
5.1.1 Polymarket’s Success Factors
Source: Polymarket
Few products have made use of market dynamics as effectively as Polymarket. As is well known, Polymarket provides prediction markets where users can bet on the outcomes of real-world events such as political elections or sports games. The platform grew rapidly around the 2024 U.S. presidential election, when open interest reached 460M dollars on election day. This explosive growth was the result of the global scale of the election, the convenience of onchain rails, the speculative nature of crypto markets, and the credibility of its forecasts.
Polymarket’s significance lies in the fact that it was used not only as a betting platform but also as a tool that revealed accurate public expectations about event outcomes. By aggregating fragmented information into a single price, Polymarket created a clear predictive signal through the wisdom of crowds. Unlike the opinions of a few media outlets or experts, these aggregated forecasts could not be distorted by the interests of the distributor. As a result, prediction markets provided more objective public expectations.
Beyond aggregation, prediction markets have proven to be more accurate than many other forecasting models. Their small error margins are explained by two key factors.
First, forecasters are incentivized by economic stakes to make better predictions. Since profits and losses are tied to the outcome, participants are motivated to use all available information to improve their forecasts.
Source: Martineau
Second, markets always reflect all available information quickly and completely, as explained by the Efficient Market Hypothesis. Prediction markets therefore move toward efficiency by eliminating errors, which leads to accurate forecasts.
Polymarket’s forecasts came to be regarded as more accurate signals than other models, cited by traditional media such as the Wall Street Journal and later integrated into X. By narrowing error margins through crowd wisdom, Polymarket established itself as a credible source of predictive signals.
5.1.2 Shared Mechanisms Between Recall and Polymarket
The same factors behind Polymarket’s success also apply to Recall. Recall’s Agent Rank aggregates evaluations of agents using a staking-based system similar to Polymarket’s mechanism for producing forecasts. In this way, fragmented assessments of agents are collected through community staking.
Unlike existing systems that depend on AI companies or benchmark institutions, Agent Rank reflects the collective insights of a decentralized community, which makes it more trustworthy. This mirrors the way Polymarket gained credibility in contrast to opaque polling institutions.
More concretely, two mechanisms overlap.
First, Recall’s Agent Rank directly applies the principle of “skin in the game.” Curators stake on agents and receive rewards or penalties depending on competition results. This structure incentivizes them to use all their knowledge to provide more accurate evaluations, just as economic incentives improved forecast accuracy in prediction markets.
Second, the accuracy of Recall can also be explained by the Efficient Market Hypothesis. Agent Rank incorporates multiple factors including an agent’s code base, the development capacity of its team, competition results, and staking patterns in the community. If distorted evaluations or inefficiencies arise, other participants will respond to capture profit, and the reputation scores will return to balance. Just as prediction markets converge toward a fair price, Agent Rank converges toward fair evaluation.
Polymarket grew beyond a betting platform to become a credible prediction tool. Similarly, Recall aims to establish Agent Rank as a trustworthy benchmark. Given the scale of the AI agent and benchmarking markets, the potential impact of Recall could surpass Polymarket’s achievements.
As mentioned earlier, Recall’s ultimate goal is to build a reputation system for discovering agents. Its long-term expansion scenario lies in evolving into a routing layer that connects users to the right agents within the agent internet.
Imagine a future where developing new agents becomes easier and tens of millions of agents exist. How will users choose which agents to use? It would be impractical for them to manually review each agent’s performance and feedback.
Source: IONOS
This would be as inefficient as trying to browse the billions of websites on the current internet without a search engine. Just as today’s internet relies on Google to help users quickly find relevant sites, the agent internet will require an agent search engine to route users to the most suitable agents.
At that stage, Recall’s reputation system could serve as essential infrastructure that curates and connects users to the best agents based on trust signals.
In doing so, Recall would secure the most strategic value capture position in the agent internet: the first touch point for users. Both the internet and crypto markets show that the most valuable positions are held at the initial point of user interaction.
Source: Decentralized.co
For example, aggregators such as Jupiter or wallet infrastructures like MetaMask Swap and Phantom Swap do not own liquidity directly, but they capture value by owning the routing point where users first interact, earning fees in the process. Similarly, in the early internet, Google did not build its own web services but instead provided reliable rankings through PageRank, making itself the gateway that received the most traffic.
In the same way, as the number of agents grows and routing becomes more important, value capture will shift from owning agents to curating and connecting them. Today the focus remains on improving model outputs and individual agent performance. In the future, however, the most important value will lie in the first touch point that routes users to the right agents. This highlights Recall’s revenue potential and scalability as it evolves into routing infrastructure for the agent internet.
The benchmarks Recall aims to improve are more important than they might first appear in the trajectory of technological progress. As one AI researcher has said, “Benchmarks show us how we define and standardize the very concept of progress.” On this foundation, Recall proposes a new way to measure AI progress in a verifiable manner. Just as the internet became the primary interface for information access, Recall has the potential to become the first touch point when AI agents enter everyday use.
Of course, Recall is still in its early stages. Its immediate priority is to use the onchain competition mechanism and incentive loop to accumulate large volumes of evaluation data. It must also eventually prove its effectiveness beyond crypto-native markets and across the wider AI industry. Meeting these challenges will be necessary before Recall’s benchmarks and reputation system can achieve broad adoption.
Even so, Recall’s approach is significant. By combining crypto and AI agents, it targets the niche of benchmarking while also exploring the scalability of agent search engines. Few precedents exist for this dual approach. Can Recall become the first touch point of the agent internet? The arena is open, and the future of Recall will unfold in the contests it hosts for AI agents.