NVIDIA's Jim Fan: The field of robotics is still in a state of chaos, and even its development direction may be wrong.

```

Recently, Jim Fan, head of Nvidia's robotics business and co-head of the GEAR Lab, published a lengthy post on social media sharply criticizing the current state of the robotics industry. He believes that despite significant advances in hardware technology, the industry as a whole remains in a state of confusion regarding software iteration, standard setting, and the choice of technical roadmaps.

Jim Fan pointed out that the mainstream Visual-Language-Action (VLA) model technology route "doesn't feel right;" its pre-training method based on Visual-Language Models (VLMs) is fundamentally misaligned with the actual needs of robotics. He says he is betting on video world models as an alternative solution.

This statement has drawn industry attention. Against the background of rapid development in other fields of artificial intelligence, the fundamental problems of robotics highlight that the industry is still far from commercialization, potentially impacting investors' valuation expectations for related companies.

Jim Fan summarizes three lessons learned in the robotics field in 2025, covering key issues such as hardware reliability, industry standards, and technical roadmaps, providing a frontline perspective for understanding current bottlenecks in the robotics industry.

Hardware reliability becomes the biggest obstacle to software iteration

Jim Fan pointed out that although robots such as Optimus, e-Atlas, Figure, Neo, and G1 have demonstrated exquisite engineering, hardware reliability severely limits software development speed. He stated that the most advanced AIs today have not fully utilized the full capabilities of these frontier hardware—"the body’s ability surpasses the brain’s command."

Unlike humans, robots cannot self-repair after damage. Issues such as overheating, motor failure, and firmware glitches occur daily; mistakes are irreversible and intolerable. Caring for these robots requires the support of an entire operations team.

Jim Fan lamented: "The only thing that grows as we scale is my patience." This statement reveals the reality of high labor costs and low iteration efficiency in robotics R&D.

Lack of industry standards leads to a chaotic evaluation system

Jim Fan called the state of benchmarking in the robotics field an "epic disaster." He pointed out that unlike the large language model field where consensus standards like MMLU and SWE-Bench have already formed, the robotics industry still lacks unified standards for hardware platforms, task definitions, scoring metrics, simulators, or real-world settings.

A common phenomenon in the industry is that each company temporarily defines its own benchmarks when announcing news releases and claims to have reached "state-of-the-art" (SOTA) performance. Even more seriously, demonstration videos are often the best run picked out of 100 attempts.

Jim Fan called for: "In 2026, we must do better—stop treating reproducibility and scientific discipline as second-class citizens." This criticism directly targets the industry’s fundamental lack of scientific rigor.

Mainstream technical roadmap faces fundamental doubts

Jim Fan raised fundamental doubts about the dominant VLA models. The common practice in VLA models is to graft an action module onto a pre-trained visual-language model, but this route has two core issues.

First, most parameters in VLMs serve language and knowledge, not physics. Second, to achieve high-level understanding, visual encoders actively discard low-level details, but those minor details are crucial for dexterous robotic operation.

Jim Fan believes that VLMs are highly optimized for visual question answering and other benchmarks, and their pre-training objectives are misaligned with robotics needs. "There’s no reason to believe VLA performance will scale with increasing VLM parameters." He says he is betting on video world models as a pre-training objective more suitable for robotic policy.

Jim Fan's views have sparked industry discussion. Netizen Stewart Alsop questioned if video world models are superior, why models that have been actually delivered, like Helix, GR00T N1, and π0, are still built on VLMs, and world models are currently mainly used for policy evaluation and synthetic data rather than direct motion control.

Jim Fan replied that these are 2025 models and that he looks forward to the next generation of large models in 2026.

Risk Warning and DisclaimerThe market has risks; investment needs caution. This article does not constitute individual investment advice and does not take into account the special investment objectives, financial status, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article fit their specific situation. Any investment made accordingly is at the user's own risk. ```