Alibaba releases the Qwen-Robot series of embodied large models; the three major models enable robots to "walk, observe, and think simultaneously."
Alibaba is extending the competition of large models from the digital world to the physical world. On June 16, Alibaba released the Qwen-Robot series of embodied intelligence large models, simultaneously launching three models covering operation, movement, and world modeling, establishing the first complete embodied intelligence model system in the Qwen family. The three models respectively endow robots with dexterous manipulation, autonomous navigation, and environmental cognition capabilities. They can be deployed independently or work collaboratively, truly enabling robots to "walk, see, and think" at the same time, providing a reliable "general foundation" for robots of various forms to enter real-world scenarios. The new series achieved leading results in third-party real-machine evaluations. In the RoboChallenge Table30 v1 assessment, which spans 30 real-world tasks over 4 robot platforms, the two versions of the Qwen-Robot operation model took the top two spots, completing tasks such as turning a faucet, plugging in a network cable, and double-arm pouring fries — all complex operations. Notably, the model was trained entirely with open-source data, breaking the industry’s reliance on private data collection. Currently, the global embodied intelligence industry is at a critical turning point, shifting from laboratory R&D to real-world commercial applications. How to reliably execute complex commands in unfamiliar environments is the key barrier to commercialization in this field. The release of the Qwen-Robot series reflects an accelerating trend for domestic large model vendors to extend technical capabilities to robotics hardware scenarios. **Unified Representation Enables Robots to "Cross Hardware Migration"; Relative Perception Helps Manipulation "Adapt Flexibly"** The VLA (Vision-Language-Action) model is one of the core foundational models in embodied intelligence, aiming to integrate visual perception, language understanding, and action decision-making, empowering robots with intelligence to "see and act." Traditional VLA models are mainly limited by insufficient transfer capability; performance often drops sharply when switching hardware or operation scenarios. This newly released VLA manipulation model, Qwen-RobotManip, tackles this problem from two dimensions. First, the model adopts a unified 80-dimensional action representation that defines a universal "body language" for different hardware platforms, allowing the model to learn fundamental physical laws and manipulation logic rather than mechanically memorizing specific action sequences. Second, the model abandons dependence on cumbersome absolute coordinates, instead directly generating operation instructions based on the relative positions in camera views, enabling faster and more accurate responses to environmental changes. When deployed on new hardware, the model only needs a small amount of interactive feedback to quickly adapt, significantly reducing cross-platform migration costs. During training, Qwen-RobotManip underwent over 38,000 hours of large-scale pre-training. In the global RoboChallenge real-machine multi-task evaluation, its two versions named "Lira" and "Atlas" took the top two rankings. **Memory Strategy Adaptation Makes Robot Navigation No Longer "Get Lost"** If the manipulation model solves the question of "how robots act," the VLN (Vision-Language Navigation) navigation model, Qwen-RobotNav, focuses on "how robots recognize routes and run errands." This model is built on Qwen-VL, unifying five major task groups including language instruction navigation, target search, and autonomous driving into a single framework, eliminating manual model switching for complex tasks. Traditional VLN models generally face the issue of rigid memory strategies — too little memory causes robots to get lost, too much leads to confusion. Qwen-RobotNav introduces a task-adaptive observation mechanism to flexibly adjust memory strategies based on task type. More importantly, the model features a universal interface design that can be called directly by upper-layer models, making it one of the few VLN models in the industry natively supporting multiple agent frameworks. For example, with the Unitree Go2 quadruped robot equipped with this system, when given the instruction "help me find the suitcase I can't remember where I put," the robot can reason visually while patrolling autonomously and smoothly complete the object search navigation task. **Understanding Physical Laws, Simulating Action Trajectories, Teaching Robots to "Think"** Qwen-RobotWorld is the third major model in the Qwen-Robot series, positioned as the embodied intelligence world model. Based on physical law cognition, it can infer and simulate the robot’s next movements and states, providing a rehearsal basis for real-world actions. This model has dual value: first, it can generate video data for training, alleviating data shortage challenges in embodied intelligence; second, it can predict trajectories before performing actions, improving manipulation precision and completion quality. The three major models together form the Qwen embodied intelligence system, which can be deployed independently or work collaboratively under unified language instructions, enabling robots to truly "walk, see, and think." Risk Disclaimer and Exclusion Clause The market has risks, and investments must be made cautiously. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their particular circumstances. Investing accordingly is at your own risk.