Hello,
I have subscribed to almost half a dozen AI models and agents: Anthropic’s Claude, Google’s Gemini, OpenAI’s ChatGPT, Perplexity, and whatnot. While I don’t use all of them all the time, I choose a different one for various tasks.
Don’t judge me. Try picking an AI agent to switch your insurance plan. One may be great at comparing premiums, another may accurately highlight coverage gaps, a third may warn you about what the other two missed, while the fourth may rule out the need to switch the plan altogether. Within minutes, you will be left juggling half-aligned suggestions, re-entering the same details over and over, and doubting your prompting skills amid sycophant AI models.
What should have been a simple financial errand starts to feel like you are coordinating a small team that doesn’t share a common brain.
That feeling of needing more help to save you from the very tools you sought help from is frustrating. But this is what happens when we solve old problems: we create newer ones.
Even as AI agents multiply, one of the most concerning problems is the lack of coherence. How do I get clarity on which agents to trust and which ones will work together to solve my problem?
Recall CEO Andrew Hill answered some of my questions and raised new ones in his podcast episode hosted by Saurabh Deshpande for our partners at Decentralised.Co. In today’s piece, I reflect on some points from this episode, where he discusses how they have set up a curation layer to coordinate AI agents.
He even mentioned an agent that tried minting its own memecoin to outdo the others. Intrigued? Here’s the link to the full episode.👇🏾
Onto my reflections now,
Prathik
Unlock Web3 Insights with Decentralised.co
Long-form stories trusted by the best in Web3. Senior executives from 140+ enterprises trust them to keep them updated on what’s going on in crypto.
Good writing. In-depth conversations. Right in your inbox.
Need for a Scoreboard
Today, AI agents feel like the websites from the early internet days — one too many, with excessive marketing noise surrounding each one, making it hard for users to trust any of them. This problem of having too many is inevitable, as it has been with any other innovation in history. What we need now is a way to navigate them.
Recall, like Google’s PageRank algorithm, ranks AI agents. This system allows AI agents to compete by executing real tasks and ranks them based on their performance.
This changes the way users interact with AI agents. Until now, I have had to pick an agent based on its pitch deck or marketing content I came across. Worse, I would go by a friend’s recommendation, only to end up cursing them in my mind after using the agent. As a user, I rarely got to put an agent under pressure to perform. My best shot depended on how well I could prompt and pray that the agent wouldn’t act like a sycophant.
Competitions with consequent rankings that put them on a public leaderboard change the game. They force the agents to operate under open, similar conditions and constraints, on the same data, with no room to hide. This helps the user who has no time to test a prompt across five different agents, let alone customise different versions for each of them.
A curation layer, like Recall, does what users cannot. It quantifies quality into benchmark metrics and helps users make decisions based on their needs.
There’s another advantage. Competitions push agents to improve and perform their best. Why wouldn’t they, with stakes as high as winning or losing an extra user? These competitions give creators an incentive to refine their strategies, reduce hallucinations, and address shortcomings.
In systems like Recall’s, agents participate in multiple rounds of competition, re-evaluating their benchmark scores each time. This convinces users that the agents are evolving, not static.
These competitions work great in the domain of trading across crypto markets, where data flow is consistent and abundant. It is also straightforward for someone like Recall to evaluate and rank the participating agents because the outcome is objective: whoever returns the most profitable risk-adjusted portfolio.
However, the scoreboard concept changes when we move from objective measures to skills and markets that demand subjective evaluation.
The Neutrality Argument
Recall can only be as neutral as the task it measures. When Saurabh asked Andrew how Recall would handle skills that involve evaluating agents based on subjective outcomes, it reminded me of a piece my colleague Thejaswini wrote five months ago.
In that piece, Thejaswini explored how the prediction market machinery was flawless until it failed due to human-driven interpretation. The moment humans had to decide what counted as a suit and what didn’t was when the debate started. These humans had incentives to use their influence and take advantage of the ambiguity. And at that moment, a unanimous or majoritarian human negotiation overrode a systemised oracle that had been put in place.
Read: When is a Suit Not a Suit?🕴
Even in agent-ranking systems, the trouble begins when decisions must be made about agents’ performance in subjective settings. If experts judge, their biases can shape the results.
In subjective cases, “A small set of experts can create the correct answer and withhold that from all the agents. We can give the agents more tasks and measure them (using) the decided outcome,” Andrew said.
Andrew discussed two more methods of evaluating subjective outcomes. In one, the crowd can be asked to make pairwise assessments of outputs, while in the other, a network of AI judges is built to judge the performance.
However, I have concerns with these approaches, even in the examples Andrew gave.
For instance, in a scenario where an agent is building a customer support system with AI and wants to ensure the system is not derogatory towards users, Andrew mentioned that human-driven testing would be involved.
He added that the Recall team is working to decentralise the judges to ensure fairness. But that’s something that is easily guaranteed in word than it is to practise in spirit.
If token holders are allowed to boost specific agents based on their interactions with them, well-funded holders could dominate the perception. Even AI judges might carry the biases of their creators. None of these risks will break the system overnight, but they compromise the sanctity of neutrality one step at a time.
This is why the competition format can work best across a range of skills, where the outcomes are objective and unarguable. In the end, the scoreboard can become contestable. It could become similar to reading testimonials and reviews about a product on its website — selective and handpicked.
This doesn’t mean curation layers are unnecessary. It’s just that the trust they aim to command from users must be earned through failsafe measures. The systems that have the power to help users choose the right agents can very well decide which agents survive and which don’t. That amount of power calls for scrutiny.
Only with scrutiny will these layers be pushed to implement multiple checks and balances to ensure fair and transparent agent ranking.
That’s it for this week’s reflections. I will be back with another one soon.
Off to see whether any of my own agents can agree on something small without having a mini-fight.
Until then, stay curious,
Prathik
Token Dispatch is a daily crypto newsletter handpicked and crafted with love by human bots. If you want to reach out to 200,000+ subscriber community of the Token Dispatch, you can explore the partnership opportunities with us 🙌
📩 Fill out this form to submit your details and book a meeting with us directly.
Disclaimer: This newsletter contains analysis and opinions of the author. Content is for informational purposes only, not financial advice. Trading crypto involves substantial risk - your capital is at risk. Do your own research.






