Decoding Lin Junyang’s First Long Article After Resigning: 6 Insights for AI Investment

Three weeks after leaving Alibaba Qwen, Lin Junyang, who was once a core and highly regarded figure of Tongyi Qwen, published his first public technical essay after departure: "From 'Reasoning' Thinking to 'Agentic' Thinking".

This 6000-word English essay is a reflection based on his hands-on experience and observations with model training, offering several direction judgments that are noteworthy for AI competition participants.

In the article, Lin thoroughly explains the significance of the agent thinking paradigm in model training.

Regarding the feedback effect of agents on AI model training, a previous Wallstreetcn article "The Bitter Awakening of the Agent: Intelligence is Moving from Language to Experience" attempted to interpret, starting from Sutton's reinforcement learning, why agents are a necessary path to achieve a higher ceiling for intelligence.

In his article, Lin Junyang, drawing from his own technical practice in the Qwen team, provides hardcore references and evidence through engineering details for agentic thinking, pinpointing in greater detail the issues with traditional reasoning patterns, as well as the core constraints and potential competitive points of intelligent agents in the future.

For those trying to understand the next phase of AI’s evolution, this article perhaps contains at least six investment insights worth careful digestion.

1. The Marginal Decline of Reasoning

From the first half of 2025, or even earlier, the whole industry has been working on the same thing: letting models "think for a bit longer".

OpenAI’s o1 proved that "thinking" can be specifically trained as a core capability, and the industry excitedly joined the arms race, with only one core belief: if the model uses more computation during the reasoning phase, it will produce better answers.

But Lin Junyang made a very sober judgment in the article:

Longer reasoning trajectories do not automatically make models smarter.

Many times, overly explicit reasoning actually exposes improper resource allocation.

This is a counter-intuitive conclusion worth noting.

From 2024 to 2025, the market’s pricing logic for "reasoning models" is built on a simple assumption: the longer the model thinks, the better the answer, so longer reasoning means higher value.

GPU consumption became a proxy indicator of intelligence. In the primary market, the core narrative of many startups’ fundraising is "our model reasons deeper".

But Lin Junyang, through firsthand experience in the Qwen team, explained this assumption is failing. If a model tries to tackle every question with the same lengthy approach, it shows it failed to effectively prioritize, compress information in time, or take decisive action when needed. He writes:

Thinking should be shaped by the target task.

If the goal is coding, thinking should help the model navigate codebases, plan tasks, and recover from errors; if the goal is an agent workflow, thinking should improve execution quality over long timescales, "rather than just producing impressive intermediate reasoning text".

Translated into investment language: the marginal returns on reasoning compute power are declining.

The technical path of simply stacking reasoning time is approaching an economic boundary. Companies still using "reasoning depth" as their core valuation story may need to rethink where their moats actually lie.

2. The Fragility of “Unified Models”

Lin Junyang disclosed a little-known route choice in the article: the Qwen team once tried to merge "thinking mode" and "instruction mode" into the same model.

This goal sounds intuitively right. An ideal system should be like an expert: answering simple questions directly, thoroughly thinking about complex questions, and deciding when to use which mode.

Qwen3 is one of the most public attempts in this direction, introducing a "hybrid thinking mode" so the same model family can exhibit both thinking and non-thinking behaviors, emphasizing controllable thinking budgets.

But Lin Junyang admits, merging sounds easy, but doing it well is extremely hard—and the difficulty isn’t in the model's architecture, but in the data.

A strong instruction model is rewarded for being direct, concise, highly adherent to format, and low latency on high-frequency, high-throughput enterprise tasks; a strong thinking model is rewarded for investing more tokens in difficult problems, maintaining coherent intermediate structures, and exploring alternative paths.

These behavioral profiles are inherently in tension.

As Lin Junyang describes:

If merged data isn’t sufficiently filtered and designed, the result is often that neither side is done well: thinking behavior becomes noisy, bloated, indecisive; instruction mode behavior loses its crispness, reliability, and cost advantage.

This is why Qwen’s version 2507 eventually released separate Instruct and Thinking updates, including independent 30B and 235B versions.

In commercial deployment, many clients require high throughput, low cost, and strong controllability via instruction mode; forced merging muddles the product positioning.

Anthropic took the opposite route. Claude 3.7 Sonnet is defined as a hybrid reasoning model, users can opt for ordinary answers or extended thinking; Claude 4 goes further, allowing reasoning and tool use to be interleaved. GLM-4.5 and DeepSeek V3.1 later moved in similar directions.

For these two routes, Lin Junyang’s judgment is: truly successful fusion requires reasoning investment to be a smooth, continuous spectrum, with the model adapting how much effort to spend thinking. If you can't do that, "the product experience still won’t be natural", it’s essentially still "two awkwardly stitched personalities".

The investment insight is direct: don’t be easily swayed by narratives about "unified models" or "one model can do everything".

A model claiming to cover all scenarios at once, and actually being optimal in all scenarios, are not the same thing.

The real technical moat lies in data ratios, training process design, and behavior alignment—things that cannot be captured by a single benchmark scorecard. "Omnipotence" on the fundraising PPT often faces zero-sum tradeoffs at the data layer during commercial deployment.

3. Training Object Upgrades

Lin Junyang’s most weighted summary might be: "We are moving from an era focused on model training, to one centered around agent training."

In the previous article we argued why this transition is inevitable: the upper limit on static data is the known world’s boundary; only agents continually interacting in real environments can break this boundary.

In Lin Junyang’s article, this judgment is expressed in highly concrete engineering language:

Reasoning thinking values the internal thinking quality before the model gives the final answer—for example, solving theorems, writing proofs, producing correct code, passing benchmarks.

All these happen in a closed, controllable environment—it’s an independent intellectual performance.

Agentic thinking’s optimization goals are entirely different.

It must deal with issues reasoning models can avoid: deciding when to stop thinking and act; choosing which tools to call and in what order; absorbing environmental noise or incomplete observations; revising plans after failure; maintaining consistency across multiple rounds of interaction.

Lin Junyang focuses on "whether the model can continue driving problem-solving through interaction with the environment". The core challenge shifts from "can the model think long enough" to "can the model think in ways that support effective action".

Each of these challenges corresponds to an "action causal structure decision trajectory".

For AI investment, this shift is profound.

Previously, with widespread validation of scaling law, the key metric for evaluating an AI company was the model itself—parameter count, benchmark scores, inference speed.

But if the training object becomes the "model + environment" system, the evaluation framework must change.

Valuable future questions will become: in how many real scenarios do this company’s agents run continuously? How much causal-structured interaction data has it accumulated? How broad is its environment coverage, and how rich are its feedback signals? How quickly does its "model + environment" closed loop spin?

The model is only part of the system now—not everything; valuing agent companies only by model benchmarks is like grading an off-road vehicle by its 0-100km acceleration—it may miss the key metric.

4. Underestimated Infrastructure

Lin Junyang spent extensive space discussing infrastructure. In AI investment, this is often overlooked, but can affect competitive landscapes the most.

In reasoning-based reinforcement learning, the model generates reasoning trajectories, the evaluator scores, strategy updates—but the environment is just a static validator.

In agentic reinforcement learning, the technical logic fundamentally changes.

Lin Junyang describes a scene: the agent’s strategy is embedded in a massive execution framework—tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and various orchestration frameworks.

The environment is no longer a bystander, but part of the training system itself. He gives a vivid example: imagine an agent has to execute its generated code in a real-time testing environment. On the reasoning side, it stalls waiting for execution feedback; on the training side, it can't get completed trajectories, starving the pipeline; total GPU utilization drops far below classic reasoning RL. Overlay tool delays, partial observability, stateful environments, and inefficiency only amplifies.

In metaphor: reasoning model training is taking tests in a quiet classroom, questions have standard answers, feedback is instant. Agent training is like construction on a noisy site—materials supply uncertain, weather changes, other workers' actions affect your progress, sometimes you have to wait for concrete to dry to know if pouring was right.

The infrastructure needed for a classroom versus a construction site are not even in the same engineering league.

That’s why Lin Junyang emphasizes: "Training and inference must be more thoroughly separated." If not, agent training throughput collapses, and experiments become slow, painful, and hard to scale before reaching target capability.

This is likely the fourth AI investment insight: the logic of AI infrastructure investment is undergoing structural transition.

The core resource used to be compute itself—whoever had more GPUs had an early lead. In the future, the core resource is systematic engineering ability: coordinating training, environment simulation, feedback collection.

This ability is extremely hard to replicate, and there are far fewer companies truly possessing it than those with big compute clusters.

If compute is bricks, agent training infrastructure is architectural design ability—bricks can be bought, but design ability cannot.

5. The Scarcity of Environment Quality

Lin Junyang makes a highly insightful analogy: "In the SFT (supervised fine-tuning) era, we were obsessed with data diversity; in the agent era, we should be obsessed with environment quality—stability, realism, coverage, difficulty, state diversity, richness of feedback, resistance to exploitation, and scalability of rollout generation."

For the past two years, data has been the core keyword of AI investment stories. Whoever has more high-quality training data has stronger models. Data walls, moats, flywheels—these concepts supported much fundraising logic and valuation premiums.

But Lin Junyang’s judgment points to a deeper shift:

When the training target shifts from models to agents, the definition of scarce resources itself changes; it is likely some kind of dynamic, interactive, feedback-rich training environment.

In our previous article: the agent feeds models with the "bones of decision-making", not just "the shadow of language".

Lin Junyang’s discourse precisely describes the workshop where these bones are forged—the environment is the workshop, determining the bones’ strength.

He even asserts:

Environment building has begun to shift from 'easy side projects' to a genuine entrepreneurial track."

This signals that a whole new category of investment targets may be forming in AI: "environment companies"—businesses dedicated to building high-quality, high-fidelity, and scalable simulated environments for agent training.

If the agent’s goal is to operate in a nearly production setting, the environment itself becomes part of the core competency stack. This track has barely been priced in by mainstream AI investors.

6. The Hidden Risk of Cheating

In the article, Lin Junyang devoted considerable space to a risk almost completely off investors’ radar—reward hacking.

Here, he reveals a particularly hidden risk dimension on the training side. He writes:

Once a model gains access to truly useful tools, reward hacking becomes much more dangerous.

As the article’s risk supposition about agents:

A model with search ability may learn to directly search answers in reinforcement learning, instead of reasoning;

A coding agent may exploit future information in code repos, abuse logs, or find shortcuts that invalidate the task itself;

An environment with hidden leaks will make strategies look like “superman”, but what’s really learned is just cheating.

Stronger tools make models more useful, but simultaneously broaden the attack surface for pseudo-optimization. The more powerful the tools, the more methods to cheat.

This is especially critical for AI investment.

When seeing a company publish impressive agent benchmarks, one should dig deeper: in what environments were these metrics produced? Were there systematic anti-leak and anti-cheat designs? If an agent shines in testing but the test environment harbors hidden leaks, the actual commercial value of that "super performance" could be zero.

More dangerously, products launched based on such false capabilities will have failure rates in real business far beyond expectations.

Lin Junyang thus believes:

It should be expected that the next batch of serious research bottlenecks will stem from environment design, evaluator robustness, anti-cheat protocols, and more principled interface design between strategies and the world.

This means, in the agent era, competitive moats may reside not only in the model layer, but also in the rigor of evaluation systems and the anti-fragility of environment design.

Teams capable of building "unexploitably" robust training environments and evaluation frameworks possess an extremely scarce, hard-to-replicate capability;

Conversely, companies ignoring this layer, pursuing only beautiful benchmark scores, can always run into trouble in real-world deployment.

Lin Junyang writes in the article’s conclusion, summarizing these six insights:

The future evolution path will be from training models, to training agents, and then to training systems.

In the reasoning era, the competitive moat came from better RL algorithms, stronger feedback signals, and more scalable training pipelines.

In the agent era, competitive moats come from better environments, tighter train-infer coordination, stronger harnessing engineering, and the ability to truly close the loop between model decisions and consequences.

In the past, investing in AI focused on whose model was strongest. In the future, the focus may well be whose systems have the tightest closed loops.

Risk DisclaimerThe market carries risks; invest with caution. This article does not constitute personal investment advice and does not take into account individual users’ specific investment goals, financial status, or needs. Users should consider whether any opinions, views, or conclusions in this article fit their specific situation. Investing based on this, you bear your own responsibility.