AI Search: The Bitter-er Lesson

What if we could start automating AI research today? What if we didn’t have to wait for a 2030 supercluster to cure cancer? What if ASI was in the room with us already?

Granting foundation models ‘search’ (the ability to think for longer) might upend Scaling Laws and change AI’s trajectory.

The Death of Leela

In 2019, a team of researchers built a cracked chess computer. She was called Leela Chess Zero — ’zero’ because she started knowing only the rules. She learned by playing against herself billions of times. She made moves that overturned centuries of human chess canon. She was inventive and made long-term sacrifices. Leela played with her food and exhibited weird human tendencies. She won the world championship. And then she was brutally usurped.

I loved Leela. I had sunk years into knowing, benchmarking, and researching her. As a kid, I always wondered what it would be like to meet superintelligent aliens and have them tell us how they play chess. There was a moment, watching Leela play, when I found out.

Leela’s magic, of course, was in deep learning. By teaching herself, she gained deeper chess knowledge than humans could ever hard-code. Years later, I still think Leela is the best example of The Bitter Lesson. Leela won by putting aside human arrogance; she figured things out on her own.

Leela also proved scaling laws before they were cool. In 2018, others on the team and I noticed that larger networks consistently outperformed smaller ones, position-for-position. We even observed remarkable emergent properties—larger networks seemed to ‘look ahead’ several moves without explicit instruction or search.

In 2020, armed with deep learning and scaling laws, the Leela team raced to train larger networks. She sourced compute from corporate donors and friends’ GTX 1070s. We feverishly tracked self-play metrics like many track Wandb loss curves today. Just before the world championship, Leela’s largest model came out of the oven. And then we lost.

The Age of Stockfish

Stockfish was the dominant chess-playing program in the 2010s and a relic of old-world AI. In 2019, Stockfish’s code was hand-crafted by humans who distilled their game knowledge into clever math. Leela stunningly overthrew Stockfish in 2019 using deep learning and tabula rasa heuristics. So how did Stockfish regain the crown just as Leela used bigger networks?

Stockfish had better search.

My dad taught me chess when I was young (though I didn’t play much until high school). I asked him, at age 4, how his friend, an ornery Croatian Grandmaster, was so good. My dad responded simply with “he sees many moves ahead.”

Seeing many moves ahead is my favorite definition of ‘search.’ That’s what chess computers do: they evaluate a board and then look many moves ahead. Humans do this all the time, too, outside of chess. Whenever you spend a few minutes cracking a tricky math problem, you also use some form of search.

Stockfish always had clever search algorithms, but in 2019, its ability to grind out billions of positions didn’t matter because its understanding of each position was kneecapped by human creators.

To fix this, the Stockfish team heisted Leela’s deep learning techniques and trained a model hundreds of times smaller than the top Leela model.

After they trained their tiny model, they threw it into their search pipeline, and Stockfish crushed Leela overnight.

Stockfish performance over time. Deep learning was added right before v12.

The Stockfish team utterly rejected scaling laws. They went backward and made a smaller model. But, because their search algorithm was more efficient, took better advantage of hardware, and saw further, they won.

The Bitter-er Lesson is that, in a world of fancy deep learning, you shouldn’t discount the power of AI search.

Leela searched poorly, and it cost her the world championship. We may be close to adding search to LLMs, and I fear nobody is paying attention.

The Age of Search

Foundation models, like GPT-4, lack search. You can’t ask GPT-4 to think about a problem for a month with the expectation that it’ll give you a better answer. Today, asking the model to “think step-by-step” might improve performance, but returns quickly diminish. But what if we granted existing models search?

📖

Going forward, I’ll define foundation model search as the ability for a model to spend more inference compute (search), not training compute (model scaling), to better solve a problem. (I’m not explicitly arguing for chess-style MCTS or AlphaBeta search; human introspection and collaboration fall under this definition.)

Every AI researcher, economist, and CEO I’ve talked to is massively underrating the proximity and importance of granting foundation models search.

There are a few reasons I’ll briefly outline:

We don’t need scale to crack search.

Search allows for targeted compute allocation.

Search will allow us to start automating AI research.

No Scale Needed

The prevailing assumption is that we need larger models to enable LLM search. Some, like Sholto Douglas, claim that we need more nines of LLM reliability from scale to tackle long-horizon thought. Others, like Leopold Aschenbrenner, think that pre-training may already contain the necessary ingredients for search and that we merely need “a little bit more scaling” and some extra tokens to go all the way. But there’s proof against scale as a prerequisite for search.

DeepMind recently studied chess algorithms without search and noted that search behavior (looking moves ahead) naturally emerges from such algorithms without external scaffolding. While that’s neat, researchers note that scaling is senseless because chess does have search algorithms. Why wait for inefficient look-ahead to accidentally emerge from large models when existing search algorithms do the trick?

Moreover, the brilliant Scaling Scaling Laws with Board Games show that “for each additional 10× of train-time compute, about 15× of test-time compute can be eliminated” even down to single-neuron models. Recall that Stockfish beat Leela with a model 3 orders of magnitude smaller.

From Scaling Scaling Laws with Board Games. Notably, they found sigmoidal returns with search, but this is with fixed training flops. We should expect models of every size to become more efficient, in addition to our largest models getting larger. — From **Scaling Scaling Laws with Board Games**. Notably, they found sigmoidal returns with search, but this is with fixed training flops. We should expect models of every size to become more efficient, in addition to our largest models getting larger.

Today's models may be sufficiently large (perhaps needlessly large) to enable search.

Targeted Compute

The trade-off between train time and inference time compute is approximately equal. At first, that seems disappointing. Aidan’s claiming LLM search is going to eat the world! How does that happen if it’s just as costly as training?

Simply:

Search allows for targeted compute allocation: you only pay for your domain.

Imagine I’m Pfizer, and I want to use AI to research a new drug. In a world with AI search, I have two options:

Pfizer waits until 2030 for OpenAI to release a four orders of magnitude larger model.

Pfizer uses four orders of magnitude more inference compute today.

Of course, Pfizer would opt for the second choice!

With search, they don’t have to wait for Sam Altman to raise $7 trillion, Microsoft’s stock to appreciate, or Congress to approve an AI spending bill. Pfizer pays out of pocket on inference to get GPT-8 capabilities today.

📈

Let’s assume that Pfizer already spends $100k annually on GPT-4. If they want to use search to access 2030 ASI capabilities today, they’d need to increase their AI budget by four orders of magnitude to $1 billion a year. While that’s a lot for any company, Pfizer’s R&D budget is already $12 billion! Again, that’s $1 billion for access to superintelligence. It would cost OpenAI trillions to train a model of equal capability. And, if you’re worried we lack the infrastructure for this, note that OpenAI is already pulling in billions in revenue. Microsoft and Google seem poised to make even more. Yes, we’ll still need to aggressively scale compute, but we won’t need the 100GW cluster just yet.

If you’re Pfizer, this is great because you need not wait for OpenAI to train the next big model for good results. If you’re OpenAI, this is great because you’re making way more money on inference profit. Everyone wins.

Premature Superintelligence

Let’s take a moment to appreciate how a search universe differs from a scale universe. Here’s how the 2030 ASI gets built, thoughtfully detailed by Leopold Aschenbrenner here:

Companies pay out of pocket for big clusters.

Those big clusters drive revenue up.

Companies take out massive corporate loans to build bigger clusters.

Those bigger clusters drive revenue up.

The government steps in and builds the biggest clusters.

Only then do models hit the escape velocity size required for them to do their own AI research

Leopold’s model seems plausible—in a world without search.

Here’s my vision:

Search is discovered and works with existing models.

Big labs and the government realize they can immediately apply search to further AI research or foreign espionage.

Inference compute is limited—the government or big labs limit it to security or AI research.

Search-driven AI advances uncover more efficient search algorithms and model architectures.

The ‘data wall’ problem dissolves as search doesn’t require more training data.

The intelligence explosion begins next year, not 2030.

While some acknowledge that, yes, search may enable several OOM gains in narrow domains, they seem to forget that we’ll likely apply search to AI research first.

Many portend frenzied takeoff dynamics when AI gets good enough to research itself. But, just as search allows Pfizer to research better drugs without waiting for GPT-8, search will allow AI labs to research AI without waiting for larger models.

Sure, we may need more unhobbling to replace autonomous and agentic human AI researchers. But I suspect a mere chatbot, given GPT-8 intelligence, would be enough to accelerate capabilities.

📈

There’s an enormous economic upside to automating AI research with search. While I suspect early search-enhanced models won’t have human agency (using tools or running tests), they might be superhuman ‘armchair theorists’ that drive algorithmic progress. If it takes GPT-4 a trillion tokens ($15 million) to discover an algorithm that reduces training costs by 3% or increases search efficiency by 10%, it will have paid for itself. Suppose Pfizer uses search to discover a better drug; that new drug won’t help Pfizer’s further research. Yet, for AI research, better algorithms will help you build better AI. Self-improving AI isn’t a new concept, but it may be much closer than many think.

Conclusion

To take my predictions seriously, one must trust:

There exist foundation model search algorithms that enable similar performance gains seen in RL systems.

That search converts existing capital into intelligence more efficiently than model scaling.

I suspect that both claims may seem far-fetched to many in AI. But, unlike the well-studied scaling laws of the 2020s, we lack great evidence for search’s performance and economics. I extrapolate here from my work on game reinforcement learning.

In 2019, AI researchers wanted to make a computer play better chess. In months, they learned:

The Bitter Lesson

Scaling Laws

The Power of Search

The broad realization of Points 1 and 2 has driven historically unparalleled progress in the 2020s. We’re not done yet.

(Thank you to James Campbell and Cosmo for feedback on earlier drafts!)