The paradox of the "scaling laws" for large models: The stronger RL (reinforcement learning) becomes, the further AGI (artificial general intelligence) seems?

The paradox of the "scaling laws" for large models: The stronger RL (reinforcement learning) becomes, the further AGI (artificial general intelligence) seems?

In the race towards Artificial General Intelligence (AGI), the currently most sought-after path—Reinforcement Learning (RL)—may be leading us astray. The stronger RL becomes, the further we may actually be from true AGI.

On December 24, renowned tech blogger and Dwarkesh Podcast host Dwarkesh Patel released a thought-provoking video, hitting directly at the pain points of current large model development. In the context of Silicon Valley’s broad optimism regarding the Scaling Law and RL, Patel puts forward a counterintuitive and sharp view: Excessive reliance and investment in RL (Reinforcement Learning) may not only fail to offer a shortcut to AGI, but actually signal its clear absence for the foreseeable future.

Patel's core argument is that top AI labs are spending vast sums to "pre-fabricate" a large number of specific skills for large models through RL based on verifiable results, such as operating Excel or browsing the web. However, this approach itself creates a contradiction. He sharply points out: “If we are truly approaching a human-like learner, then this entire method of training on verifiable results is doomed to fail.”

In Patel’s view, this “pre-installed” skill strategy precisely exposes the fundamental flaws of current models. Humans are valuable at work because we do not need to create specialized “tedious training loops” for every tiny aspect of our jobs. A true intelligent agent should be able to learn independently through experience and feedback, rather than relying on pre-scripted rehearsals. If AI cannot accomplish this, its generality greatly diminishes—and true AGI is all the farther away.

Therefore, Patel believes the real driving force towards more powerful AI is not endless RL, but “continual learning”—the ability to learn from experience like humans do. He predicts the continual learning problem will not be solved in a one-off achievement, but as a gradual process, similar to how models have evolved in “in-context learning” abilities. This process may take “5 to 10 years to perfect,” ruling out the possibility of any single model gaining a ‘runaway advantage’ by cracking the problem first.

Summary of key points:

The paradox of pre-fabricated skills: Current models rely on “implanting” skills (like using Excel or browsers) ahead of time, proving they lack the general learning capabilities humans possess—AGI is not imminent.Lessons from robotics: The problem in robotics is fundamentally an algorithm problem, not hardware. If we had human-like learning capabilities, robots would be ubiquitous; no need for repeated million-times training in specific environments.The “technology diffusion” excuse: The notion that “tech adoption takes time” is self-comfort (“cope”). If models had human-like intelligence, they would be instantly absorbed by enterprises, being lower risk and requiring no training compared to humans.Gap between revenue and capabilities: Global knowledge workers create tens of trillions in value annually, but model revenues lag far behind, proving models have not reached the threshold to replace humans.Continual learning is critical: The real bottleneck to AGI is “continual learning,” not brute-force RL. True AGI may require another 10 to 20 years to achieve.

Full transcript of the video (AI-generated translation):

Dwarkesh Patel 00:00
I’m perplexed. Why do some people think AGI is coming soon, while at the same time betting big on scaling RL for top models? If we’re really close to building a human-like learner, this whole approach of “training on verifiable results” is doomed.

Currently, labs are trying to bake in masses of skills during mid-training in these models. Now there’s a whole supply chain of companies making virtual environments that teach models to browse the web or use Excel for financial modeling. The situation is: Either these models soon learn on the job in a self-directed way, making all this “pre-baked” work pointless; or they don’t, meaning AGI is not imminent. Humans don’t need a special training phase or to rehearse every piece of software they may use at work.

Dwarkesh Patel 00:45

Baron Millage made an interesting point in a recent blog: “When we see frontier models advancing on benchmarks, we shouldn’t just think of scaling and clever ML research, but of paying billions to PhDs, MDs, and other experts to write problems and provide example answers and reasoning for specific abilities.”

Dwarkesh Patel 01:07
You can see this tension most clearly in robotics. At its core, robotics is an algorithm problem—not hardware or data. Humans need very little training to operate current hardware usefully. So if you truly had a human-like learner, robotics would largely be solved. Instead, we have to go into 1,000 households, practice a million times how to pick up a plate or fold clothes.

Dwarkesh Patel 01:32
Now, from those predicting an AI takeoff in five years, I hear: we need all this clumsy RL work to build a superhuman AI researcher. Then, a million automated “Ilya” (referring to Ilya Sutskever, OpenAI’s former chief scientist) copies can figure out how to robustly, efficiently learn from experience. This feels like the old joke: “We lose money on every sale, but make it up in volume.” So this automated researcher will solve the AGI algorithm—a problem humans have wracked their brains over for most of the century—yet it doesn’t even possess basic learning abilities a child has. That seems highly unlikely.

Dwarkesh Patel 02:09
And even if you buy that, it doesn’t explain how labs are currently doing RL through “verifiable rewards.” To automate “Ilya,” you don’t need to implant consultant skills for making PPT slides. So labs’ actions imply a worldview where models will continue to underperform in generalization and on-the-job learning, so it’s necessary to pre-build economically valuable skills into the models.

Dwarkesh Patel 02:36
Another argument: Even if models could learn job skills on the fly, building them once during training is more efficient than building them again for every user or company. Baking in fluency with common tools (like browser, terminal) makes sense; sharing knowledge across copies is a huge AGI advantage. But people underestimate the number of company- and context-specific skills needed for most work. There’s no robust, efficient way for AI to acquire these skills yet. I recently had dinner with an AI researcher and a biologist, and the biologist expected AGI to take a long time. We asked why. She said, “Recently in the lab part of the work involves looking at slides and deciding if that dot is a macrophage or just looks like one.” As expected, the AI researcher replied: “Look, image classification is a textbook deep learning problem. We can train models for that.”

Dwarkesh Patel 03:45
I found this exchange fascinating—it clarified a key difference with folks expecting transformative economic impact soon. Human workers are valuable precisely because we don’t need tedious training loops for every tiny piece of their job. Given how the lab prepares slides, creating a bespoke training pipeline to identify macrophages, then another for the next micro-task, and so on, isn’t a net productivity gain. What you really need is an AI able to learn from semantic feedback or self-directed experience, then generalize like a person. Every day you do 100 things requiring judgment, context, skills, and background knowledge learned on the job. These tasks change by person and day. Automating one job with a set of predefined skills is impossible, let alone all jobs.

Dwarkesh Patel 04:46
In fact, people vastly underestimate how big a deal true AGI will be, because they just imagine a continuation of current trends. They don’t envision billions of server-bound human-like intelligences, able to copy and merge all learning. For clarity, I expect this to happen—in other words, I expect real brain-like intelligence within the next ten or twenty years, which is already crazy.

Dwarkesh Patel 05:09
Sometimes people say AI isn’t more broadly deployed in business and valuable beyond coding because technology diffusion takes time. I think this is “cope”—an excuse to mask the fact that these models just lack the capabilities needed for widespread economic value.

Dwarkesh Patel 05:28
If these models were truly like people on servers, they’d spread with incredible speed. In fact, they’re even easier to integrate and onboard than human employees. They could read all your Slack logs and be up to speed in minutes. They can instantaneously distill all skills of other AI employees you have. Also, human recruiting is like a “lemon market”—an asymmetrical info market. It’s hard to know who’s truly talented beforehand. Hiring the wrong person is massively costly. If spinning up another instance of a proven API model, this is not a dynamic you need to worry about.

Dwarkesh Patel 06:05
For these reasons, I predict deploying AI labor in business would be easier than hiring people. And companies are always hiring.

Dwarkesh Patel 06:14
If abilities really reached AGI level, people would happily spend trillions per year on these models’ tokens. Global knowledge workers earn tens of trillions per year, yet model lab revenue is several orders of magnitude lower, because these models’ capabilities lag far behind humans’. You might say: “Why did the standard suddenly get raised to labs earning tens of trillions per year? Wasn’t the debate just about reasoning, common sense, or whether models are just doing pattern matching?” True, AI optimists criticizing goalpost-moving by critics are right. This is fair; people can easily underestimate AI’s progress over the last decade. But some goalpost moving is reasonable. If you showed me Gemini 3 in 2020, I’d be convinced it could automate half of knowledge work. So we keep solving what we think are the key bottlenecks to AGI. We have models with general understanding, few-shot learning, reasoning—and still we don’t have AGI.

Dwarkesh Patel 07:24
So what’s the rational response? I think it’s fair to look at all this and say, “Oh, actually, intelligence and labor involve much more than I previously realized.” In many respects, we’re closer than I ever expected—even surpassing my previous definition of AGI.

Dwarkesh Patel 07:41
Labs not earning trillions in AGI-implied revenue shows my previous definition was too narrow—and I expect this will remain true. I expect by 2030 labs will make major strides in “continual learning,” and annual model revenue will reach hundreds of billions, but models still won’t automate all knowledge work. I’d say: “Look, we’ve made huge progress, but haven’t reached AGI. We still need other capabilities.”

If models’ capabilities improve as fast as short-timers predict, but practical utility rises as slow-timers predict, the key question is: What are we scaling? In pre-training, loss improves super clearly and generally as we scale compute—across orders of magnitude. This is a power law (not as strong as exponential)—but it works. But people are trying to transfer pre-training’s scaling reputation (nearly like a law of nature) to RL with verifiable rewards, where no such known trend exists. When researchers do try to patch together scarce public data on RL scaling, results look grim. For instance, Toby Bord’s excellent piece connects points across various O-series benchmarks.

He concludes: “We’d need to scale total RL compute by about a million-fold to get the same gain as a single GPT-level jump.” So a lot of time is spent pondering “software singularity”—AI models writing code for smarter successors—or “software + hardware singularity,” with AI improving its hardware. But all these scenarios overlook what I think will be the main driver of top API (i.e., AGI): continual learning. Again, how did humans grow so vastly competent? Primarily by experience in relevant fields.

In conversation, Baron Millage made an interesting point—future may look like a hive mind of continual-learning agents going out to do different jobs and create value, then bringing back their learnings for mass distillation. The agents may be specialized, built of Karpathy’s “cognitive core” plus job-relevant knowledge and skills. Solving continual learning will not be a one-shot deal; rather, it’ll feel like the evolution in “in-context learning.” GPT-3 proved as far back as 2020 that in-context learning can be powerful—the paper was titled “Language Models are Few-Shot Learners.” Yet when GPT-3 launched, we hadn’t solved context learning. From understanding to context length, there’s still much progress to make.

Dwarkesh Patel 10:50
I expect continual learning will progress similarly. Labs may launch something next year called continual learning, and it’ll count as a step forward. But human-level “learning on the job” may need another 5–10 years. That’s why I don’t expect first-to-crack continual learning to bring runaway returns—it’ll be more incremental, gradually deployed and improved.

Dwarkesh Patel 11:16
If you fully solve continual learning and it drops from the sky, as Satya (Microsoft CEO) told me in the podcast, that’s “Game, Set, Match.” But probably that won’t happen. Instead, one lab will figure out how to get initial traction, then through tinkering the implementation becomes clear and other labs quickly copy and improve. I also have a prior—the competition between model companies will remain fierce. Previous “flywheel effects”—whether chatbot engagement, synthetic data, etc.—haven’t led to widening competitive gaps between model companies. Every month or so a top model company rotates on the podium, others not far behind. Some force—talent headhunting, rumor mills, NSF, or plain reverse engineering—has offset any runaway advantage so far.

Dwarkesh Patel 12:14
This is a narration of my original blog post at dwarkesh.com. More posts to come; this is actually very useful for clarifying my thoughts before interviews. If you want updates, subscribe at dwarkesh.com. Or see you next podcast. Cheers.

Note: The translation may not be 100% accurate.

Risk Warning & DisclaimerThe market carries risk; investments must be approached cautiously. This article does not constitute personal investment advice and does not consider individual users’ specific investment objectives, financial situations, or needs. Users should consider whether any opinions, views, or conclusions herein suit their circumstances. Investing based on this information is at your own risk.