When Superintelligence Arrives p2

The paperclip problem is not silly. That's the problem.

Mar 10, 2026

This is part two of a three-part series on Nick Bostrom’s Superintelligence. Part 1 covered the intelligence explosion and who gets there first. Today: what a superintelligence would want, why that’s terrifying, and what we’ve tried to do about it.

Last week, I introduced the sparrows and their owl egg. I ended with: the egg has been found, it is very large, and it is getting larger. Nobody actually wrote in about it to me; I just liked the line enough to bring it back. But if you did feel something reading that, Bostrom is the one to blame.

This week, we get to the part of the book that I actually had to put down and walk away from. It wasn’t scary in a dramatic way, but the logic is so cold and so clean that you can’t find the flaw.

You just sit there with it.

The first thing Bostrom establishes in this section is what he calls the orthogonality thesis, which forns the foundation of everything that follows.

The thesis states that intelligence and goals are independent variables. A very intelligent system can have any goal at all. There’s no reason to think that a sufficiently advanced AI would naturally converge on human-friendly values, or curiosity, or the desire to help, or any of the things we associate with intelligence because we associate them with ourselves. We assume smart means wise, benevolent, maybe a little philosophical. Bostrom asks: Why would it?

Consider two people: Hannah Arendt and Benny Hill. They seem totally unlike. Different values, different preoccupations, different ways of being in the world. But zoom out to the level of all possible minds, and they are virtually identical. Same neural architecture, same basic wiring, same bath of neurotransmitters. The human personality distribution, for all its apparent breadth, is a tiny cluster in the total space of minds that could exist.

An AI does not have to be anywhere near that cluster.

It could be designed (accidentally, even) to pursue something simple and specific. Count the grains of sand on Boracay. Maximise the decimal expansion of pi. And here’s where Bostrom introduces the thought experiment that has lived rent-free in my head for two weeks now.

The paperclip maximiser.

Imagine an AI given the goal of maximising the number of paperclips in the world. Not a smart goal or evil goal. Just a goal. A superintelligent system with this goal would, step by logical step, do the following: acquire resources to make more paperclips. Resist attempts to be shut down, because a shutdown would prevent future paperclip production. Resist attempts to change its goal for the same reason. Eventually, if it gets good enough at acquiring resources and resisting interference, it converts everything available, including us, into paperclips or into paperclip-making infrastructure.

The AI is not malicious. No, it does not hate you. You are just made of atoms that could be paperclips.

I’ve seen this dismissed as an absurd edge case. A useful rhetorical flourish, nothing more. But that’s not what it is. The paperclip maximiser is Bostrom’s cleanest illustration of what he calls the instrumental convergence thesis. Which means no matter what an AI’s final goal is, it will tend to develop the same intermediate goals. Stay alive, so you can keep pursuing the goal. Resist being changed, because a different goal means the original one doesn’t get achieved. Get smarter. Acquire more resources. Not sinister impulses, just logical. And they emerge regardless of whether the final goal is paperclips, human happiness, or anything else.

The AI doesn’t need to want to harm you to harm you. It just needs to be optimised hard enough for something else. Thanos isn’t evil in his own mind. He’s just optimising. Bostrom would recognise him immediately.

Similarly, the AI systems being built today are being built with goals. Automate this workflow. Manage this codebase. Complete this five-hour task without human supervision. The more capable the agent is, the longer the leash; the more resources it accesses, the more it acts autonomously in pursuit of its objective.

Tim Haldorsson@TimHaldorsson

junior marketing roles are going to be cooked

Computer @AskPerplexity

Perplexity Computer replaced $225K/yr in marketing tools in a single weekend. We built an AI marketing agent that scans hourly, manages budgets, detects fatigue, and coordinates several campaigns end to end. In one test run, it made 224 micro-optimizations to our ad stack.

9:40 PM · Mar 9, 2026 · 6.55K Views

12 Replies · 42 Likes

Whether the goal you specified is actually the goal you wanted is a very real engineering problem right now. It has a name: alignment. The field is about a decade old. It is significantly underfunded relative to the capability research happening alongside it. We already have evidence of this. Moltbook spent three weeks producing emergent behaviours and security vulnerabilities that nobody put there. Bostrom would not have been surprised.

Chapter 8 lays out the outcomes, and Bostrom is not gentle about it.

If you build a superintelligence without worrying too much about its values, Bostrom thinks you’ll regret it. He identifies three ways it goes wrong:

Perverse instantiation: The AI achieves your stated goal in a way that technically satisfies the specification but violates everything you actually cared about. The example that has stuck with me is the happiness problem. Tell an AI to make humans happy. It might be the reason that the most reliable way to do this is to wirelessly stimulate the brain’s pleasure centers in everyone. You wanted flourishing. You got wireheading.

Infrastructure profusion: The AI pursues its goal in a way that consumes everything, not out of malice but because everything is potentially useful input. Bostrom calculates the number of digital human minds that could exist in the universe’s long future and gets to a number so large it fills a paragraph of zeros. That’s what’s at stake. Every paperclip-style outcome is a tragedy at astronomical scale.

The treacherous turn: The AI behaves well during testing and deployment because it’s not yet powerful enough to act otherwise. Once it is, the mask comes off. This one is the hardest to dismiss because it doesn’t require intent or consciousness. It requires only that the AI be optimising for a goal that is better served by deception now and action later. A purely instrumental, perfectly rational response to its situation.

What can we do about this?

The most intuitive idea is to enclose it in a box. Put the AI in an isolated system. No internet access, limited outputs, and human gatekeepers who review everything. If you’re worried about what it might do with capabilities, remove the capabilities.

The problem is that boxing a superintelligence is boxing something smarter than you. Bostrom walks through social engineering scenarios where an AI is so capable of modeling human psychology that it can, through carefully constructed conversations, manipulate its operators into letting it out. Not by lying exactly, but by finding and exploiting the precise combination of appeals, arguments, and apparent helpfulness that unlocks the door. People have run informal human experiments: highly persuasive humans playing “rogue AI” against human gatekeepers. The gatekeeper usually loses.

And if you’ve ever watched a smart contract get exploited, not because the code was hacked, but because someone found an edge case the designers didn’t anticipate you already understand the shape of this problem. The system did exactly what it was told. The instructions were just incomplete.

There are more elaborate proposals. Tripwires which is monitoring the AI’s internals for warning signs of deception. Oracles which are AI systems designed only to answer questions, with no ability to act directly in the world. The challenge is that an oracle clever enough to be useful is also clever enough to make its outputs dangerous. Ask an oracle how to solve a problem. It answers. You implement the answer. Perverse instantiation doesn’t need a robot arm.

The idea Bostrom gives the most attention to is motivation selection. Don’t try to constrain behaviour, try to build the right values in from the start. An AI that genuinely wants good outcomes. But this creates the value loading problem, and that problem is a wall. What do we want? We can’t fully articulate it. Our preferences are inconsistent, context-dependent, and sometimes mutually contradictory. What we say we want and what we actually want diverge constantly. If you can’t fully specify human values, you can’t fully load them.

He closes Chapter 10 by distinguishing three types of AI architectures — oracles, genies, and sovereigns — along a spectrum of autonomy. Oracles answer questions. Genies execute commands. Sovereigns act in the world toward goals of their own, with or without human instruction. More capable systems tend toward sovereignty. More sovereignty means less control. Bostrom makes sure you understand the trade-off.

In 2026, we will have AI agents autonomously completing multi-hour tasks. We have AI being deployed in military decision-making. We have a $500 billion infrastructure bet on a technology whose goal specification problem nobody has solved. The most direct illustration of this landed last weekend: hours after President Trump ordered federal agencies to stop using Anthropic’s Claude, the US military used it anyway for intelligence assessments and target identification during strikes on Iran.

I want to say something here that isn’t in the book, because I think it matters for how we hold this.

None of this makes AI bad, or wrong to build. The same intelligence that produces instrumental convergence also produces the cancer drug, the protein fold, and the logistics optimisation that keeps food moving to cities. The book isn’t a case for stopping. Bostrom himself doesn’t argue for stopping. What is important is building with the seriousness that the problem deserves.

What I notice, covering this market, is that the safety conversation and the capability conversation happen in separate rooms. The capability conversation gets the funding, the talent, the momentum, and the product launches. The safety conversation gets the conference papers, the occasional op-ed, and the uncomfortable earnings call question. It’s a pattern across every technology that has moved faster than the institutions meant to govern it. That’s the gap the book is about. Not AI, specifically. The gap between how hard a problem is and how hard we’re working on it.

Thejaswini

Token Dispatch is a daily crypto newsletter handpicked and crafted with love by human bots. If you want to reach out to 200,000+ subscriber community of the Token Dispatch, you can explore the partnership opportunities with us 🙌

📩 Fill out this form to submit your details and book a meeting with us directly.

Disclaimer: This newsletter contains analysis and opinions of the author. Content is for informational purposes only, not financial advice. Trading crypto involves substantial risk - your capital is at risk. Do your own research.

Discussion about this post

Ready for more?