Core "Shovel Sellers" in the Era of Physical AI: Is Data Collection the Next Big Opportunity for Robots?
```
The foundational logic of artificial intelligence is shifting from “language understanding” by large language models to “world prediction” by world models. In this transition, the quality and acquisition capability of physical data have become core to development. As the key to solving the “data fuel” problem for world models and embodied intelligence, embodied data collection is now launching the next generation of data infrastructure wave.
The latest report from Guotai Haitong points out that the biggest barrier to embodied intelligence development is no longer algorithms, but rather the data gap. Its data demand is growing exponentially, far surpassing traditional AI models. In this context, data suppliers and infrastructure providers that can take the lead in filling the data gap will become the “shovel sellers” of the physical AI era, occupying the core value nodes of the industrial chain, and are expected to benefit from significant valuation premiums.
In terms of technical routes, real data, simulated/synthetic data, and video data each have their pros and cons: pure real data is too expensive, while pure simulation data faces the “Sim2Real” gap. The mainstream path for the future is becoming clear: simulation/video data for large-scale pre-training + real data for fine-tuning and reinforcement learning.
As the mainstream technical route becomes clearer, capital is accelerating into data collection toolchains (motion capture, teleoperation), video data upscaling platforms, and simulation test grounds—these data collection infrastructures are truly becoming the "hotspot" and "shovel" business in the robotics industry.
Paradigm Shift: World Models Reshape the AI Foundation, Data Demand Expands to EB Level
AI is shifting from “data correlation” driven to “physical causality” driven, and 2025 has become the inaugural year for the application of world models. The demand and complexity of data required for embodied intelligence is witnessing explosive exponential growth.
Traditional neural networks and large language models essentially rely on pattern recognition and probabilistic associations, while the core of “world models” lies in embedding physical laws (such as gravity, inertia) and possessing the ability to predict spatiotemporal evolution. Starting from 2025, this field has seen concentrated breakthroughs: Meta’s V-JEPA 2, Google’s Genie, OpenAI’s Sora, and World Labs’ RTFM have been launched successively.

World models will empower three core domains: gaming, autonomous driving, and embodied intelligence. Among them, the surge in embodied intelligence raises unprecedentedly stringent requirements for data. Different from large language models and autonomous driving (with PB-level data, primarily text or visual), embodied intelligence must adapt to hardware platforms of various forms, with data demand reaching EB level, and places great emphasis on physical interactions (force, touch, joint feedback). The industry is still in its early stages, with severe shortages in pre-training data. “Data silos” and the challenge of heterogeneous data fusion have become the core bottlenecks restricting industry explosion.


Three Mainstream Data Collection Solutions Each With Pros and Cons, Video Data Becomes New Industry Focus
Building an efficient data closed loop is the core of the leap in embodied intelligence capabilities. Currently, capital and technology are mainly revolving around three data collection schemes:
- Real Data (High Fidelity but Extremely Expensive): Collected directly via teleoperation, wearable motion capture, etc. The advantage is no Sim2Real gap; the fatal shortcoming is high cost, poor scalability, and difficulty covering long-tail edge scenarios.

- Synthetic/Simulated Data (High Cost-Performance but Has Transfer Gap): Generated in virtual environments using physics engines. Cost is extremely low and comes with perfect labels, but faces a significant "Sim2Real Gap" (differences in dynamics, perception, control, etc.), causing model performance degradation in the real world.

- Video Data (Wide Source but Difficult to Apply Directly): The latest focus in the industry, leveraging massive internet videos via upscaling technologies. Cost is low and scale is large, but lacks physical interaction attributes (such as gravity, friction), high noise, and missing precise 3D annotations.

Industry development trend: GEN-0 models (≥7B parameters) of Generalist AI have proved that with massive real interactive data, model performance grows in a power law. Before the cost of real data is completely reduced, a hybrid scheme of “simulation/video data pre-training + real data fine-tuning/reinforcement learning” will be the absolute mainstream.
Meanwhile, foundational data infrastructure is being accelerated by national efforts and the open-source ecosystem: Shanghai has implemented the nation’s first national-level standardization pilot (the “1+N” model training field) in the field of embodied intelligence; Beijing has established the first real-scene-based data training base. Google, Xinghaitu, Fourier, Zhiyuan, and others have successively released open-source datasets, and the China Academy of Information and Communications Technology has taken the lead in formulating the country’s first embodied intelligence dataset quality evaluation standard.

“Data Positioning” and Strategic Divergence Among Robot Body Manufacturers
Due to high costs of real data, the transfer gap of simulated data, and the noise in video data, mainstream domestic and foreign robot body manufacturers have shown obvious differentiation in their data strategies. This differentiation, in turn, provides the most direct industrial validation for the direction of data collection infrastructure.
- Real Data Priority: Advocates believe only real interactions can cross the Sim2Real gap. Zhiyuan Robot uses 100% real machine data in large model training, with simulation only for engineering iterations; Independent Variable Robot does not use simulation data in complex physical interaction scenarios at all; 1X Technologies also regards “large-scale real-world data” as their core barrier.


- Synthetic and Simulated Data Priority: Bets on cost and scale. Galaxy General uses 99% synthetic data and 1% real data for training, aiming to approach real distribution at extremely low cost.

- Video Data Strategic High Ground: Giants like Tesla and Figure AI are accelerating deployments, with the core logic being that the scale of internet video far exceeds any single robot platform’s real data collecting ability. Tesla Optimus has abandoned early motion capture and teleoperation, turning to mine internet video deeply; 70% of Qianxun AI Spirit v1’s pre-training comes from internet videos; Figure AI launched Project Go-Big, exploring zero-shot transfer from human videos to robots; Xingdong Jiyuan and Zhuji Dongli respectively adopted “video pre-training + real machine fine-tuning” and multi-source data combination strategies.


The coexistence of these three routes shows that: there is currently no single data source that can independently solve the data bottleneck for embodied intelligence. Regardless of where the route ultimately converges, data collection toolchains, simulation platforms, and video upscaling technologies—the “shovel sellers” of the physical AI era—will all be definite beneficiaries.
Panorama of Data “Shovel Sellers”
With the exponential rise in volume and complexity of data demanded by embodied intelligence, suppliers capable of effectively solving data acquisition cost and efficiency issues are ushering in a round of revaluation. This covers four key directions: video data conversion, simulation platforms, multimodal hardware collection, and integrated data services.
- Video Data Conversion: The key breakthrough is in converting massive internet videos into robot-usable training data at low cost. Some solutions have already reduced overall collection costs to below one five-thousandth of the industry average.
- Simulation Platform: End-to-end synthetic data systems generate vast virtual datasets with perfect labels at extremely low cost, and are gradually narrowing the Sim2Real gap.
- Real Data Collection Hardware: Sensor fusion gloves, electronic skin, and other sensors combined with high-quality open-source datasets are building a high-fidelity foundation.
- Real Data Ecosystem & Teleoperation: Large-scale self-built collection scenarios and high-precision teleoperation devices have become important sources of mainstream fine-tuning data.
From a secondary market perspective, comprehensive data service providers are building embodied intelligence data training grounds and engineering platforms through diverse solutions (teleoperation, motion capture, synthetic data); simulation platform companies are breaking through the virtual-real data barrier through acquisitions and integration, offering full life-cycle physical AI solutions.
On the whole, whether video conversion, simulation generation, hardware collection, or comprehensive service—suppliers that can significantly improve data “accessibility” and “cost efficiency” are moving from the edge of the industry to the center of valuation.
Risk Disclosure and DisclaimerMarkets are risky, and investments must be made cautiously. This article does not constitute personal investment advice, nor does it take into account the special investment objectives, financial situation or needs of individual users. Users should consider whether any opinions, views or conclusions in this article are suitable for their particular situation. Investing accordingly is at your own risk. ```