The voice entry battle for the Claw intelligent agent has quietly begun.

```

Recently, Xiaomi announced the launch of "miclaw" based on the MiMo large model and its deployment on speakers and other terminals, supporting voice wakeup and multi-turn conversation.

This is expected to break through the understanding bottleneck of Xiao Ai, such as "cannot understand, cannot keep up", and, through nearly "zero threshold" natural semantic interaction, to achieve a substantial upgrade in the AI experience for the mass market.

Leveraging its massive IoT device base, Xiaomi is expected to capture high-value "decision trajectory data" at scale, providing a training ground for the invocation of the MiMo large model.

From an industry perspective, this trend is not unique. In addition to Xiaomi, Huawei, Baidu, etc. are actively integrating claw audio interactive features to enhance user engagement with Agent calls.

Under the logic of 'entrance-driven data generation and interaction feeding back model optimization', a competition centered around voice entrances, execution capabilities, and data closed loops is accelerating.

The Scarcity of Trajectory Data

Smart speakers or voice assistants are no longer a new species.

The embarrassing reality the industry faces is that, constrained by traditional technology, voice assistants like Xiao Ai have often only been used as one-way command tools for tasks such as "setting alarms" or "changing songs".

Once the user's expression is vague or their request is complex, these voice assistants easily expose their shortcomings of "not understanding, not keeping up", leading to significantly reduced smart experience.

With the application of large model technology, this industry status quo is undergoing substantial change.

Xiaomi's "miclaw" based on the MiMo large model not only covers PC and Mac endpoints, but is also deployed on its smart screen speakers.

The primary pain point addressed by the audio version of "miclaw" is to enhance the intelligence level of product experience.

The latest Xiaomi speaker to launch miclaw now supports users issuing complex task commands in a single sentence, with voice wakeup and multi-turn conversation functions, and supports calls to mobile phones and PCs for execution.

This means that in the future, Xiaomi speakers will no longer be just mechanical "question-answer" command receivers. They are expected to combine contextual memory, deeply mine and understand the user's "implicit meaning", and then execute more complex tasks in complicated, everyday, even colloquial contexts.

Besides Xiaomi, Baidu's Xiaodu speakers, Huawei's Xiaoyi claw, etc., have also integrated voice interaction features in different dimensions.

In the eyes of many industry insiders, the business logic behind large companies incorporating audio version claw in hardware is this nearly "zero threshold" interaction that requires no learning menus or staring at screens, which maximizes reduction of AI interaction barriers and truly penetrates the mass market.

"This makes the entrance more natural, reduces the user threshold, so all family members can experience it, and AI can quickly integrate into daily life," explained an architect from a major company in Beijing to All Weather Technology.

In fact, to support such nearly "zero threshold" natural interaction, Xiaomi itself is also actively joining low-level training for multi-dimensional data like audio.

Back in a December 2025 article entitled "Xiaomi MiMo-VL-Miloco Technical Report", Xiaomi stated clearly: in the future, Xiaomi will further leverage its hardware ecosystem, bringing audio, millimeter wave signals, and more perception modalities into a unified multimodal learning framework. Through joint inference of various heterogeneous perception inputs, it will ultimately achieve comprehensive understanding of home scenarios and refined spatial perception.

To fully land multimodal perception and terminal-side deployment, the massive data soil and application environment provided by hardware devices are indispensable—and this is indeed Xiaomi's advantage.

By the end of 2025, the number of IoT devices connected to the Xiaomi AIoT platform (excluding smartphones, tablets, and laptops) reached 1.079 billion units, a year-on-year increase of 19.3%, with Mi Home APP and Xiao Ai’s monthly active users respectively at 113 million and 160 million.

The scale effect brought by the huge device base makes it easier for Xiaomi to capture and continuously accumulate high-value "decision trajectory data" at scale.

In the real physical world, decision trajectory data from Agent tool calls and device control execution is extremely scarce.

Traditional software systems or basic smart homes often only record the final "execution state", but what really drives AI to operate autonomously is capturing the "why do this" decision chain.

High-value decision trajectory data not only includes execution results but also covers the complete context that triggered the action.

Ideally, a system records: "Because the light sensor identified the environment as dim, and the door lock log shows the user just returned home, therefore the living room lights are turned on and the curtains drawn."

This complete information, integrating multimodal environmental input, trigger rules, and action outputs, is key material guiding Agents in complex decisions.

To obtain such data, the system must be embedded in the user's "execution path" to capture it at the moment the decision occurs.

Xiaomi's massive AIoT device network essentially constitutes a consumer-grade physical world execution path with extremely broad coverage. Through the daily coordination of massive devices, these individual decision trajectories are continuously deposited, expected to intertwine into a dynamic "contextual map".

This can objectively present users’ living habits, temperature preferences, and cross-device usage habits across different times and spaces. As the data loop continuously improves, the system gains higher predictive capability.

However, the actual production rate of effective data still depends on user usage—for example, whether users have enough motivation to set up complex automation scenarios and so on.

A New Entrance War

All kinds of claw products are accelerating landing around interaction entrances such as voice.

Baidu claw, Huawei Xiaoyi claw, etc., have achieved voice interaction capability access in different hardware and are gradually evolving from single-round command response towards multi-round conversation and task execution.

Tmall Genie from Alibaba, though not named "claw", deeply integrates Tongyi large model capabilities in its Whole House Smart 2.0 solution, building "spatial intelligent Agents" for smart decision-making.

When voice entrances gradually become Agent-ified, absence means losing a key position in the next-generation human-computer interaction system.

This round of concentrated deployment is a pre-competition centered on "usage threshold and data accumulation".

As the interaction method closest to natural language, voice essentially reduces user cost and increases penetration, making device interaction more seamless.

Only when users frequently use Agents in daily scenarios can each company's model continuously obtain real task requests and execution feedback, thereby continuously optimizing decision and execution capabilities.

Therefore, the core at this stage is whether users can "start using" it, form a data closed loop through high frequency usage, and then drive capability iteration in reverse.

In this process, the entrance evolves into a key infrastructure connecting user behavior and model evolution, which has already emerged in some product forms.

In practices by some leading manufacturers, voice no longer merely triggers single devices or functions, but starts to undertake consecutive tasks across devices.

For example, a user issues a relatively vague request in one sentence, the system parses the intent in the background and connects multiple terminals to complete a whole set of actions.

In this process, it is no longer a specific device being called, but an entire execution chain organized by the system.

When interaction shifts from "point commands" to "task chains", the role of voice is not only the entrance that lowers usage thresholds but also becomes the starting point for actual task scheduling.

Users no longer explicitly select apps or devices, but entrust their needs to the system for unified distribution.

This also shifts the focus of entrance competition. Manufacturers now fight over not just whether users initiate voice commands, but who parses these requests and decides the invocation path.

Once this link is handled by a third party—even if the hardware remains with the original manufacturer—service distribution and user decision paths may gradually shift externally.

However, amid the multi-party competition, differences in the bottom capabilities of manufacturers are being amplified.

Similar to Xiaomi, Huawei’s major advantage lies in its self-developed operating system and hardware ecosystem; back in 2024, the scale of HarmonyOS devices had already entered the 900-million level, and Xiaoyi capabilities covered phones, tablets, wearables, and smart home terminals, forming a unified cross-device interaction network.

This "entrance as data, device as execution" competitive logic is also reshaping the strategic choices of Internet companies in reverse.

For example, ByteDance has advantages in large models and applications, but is relatively weak in terminal entrance and system-level scheduling capabilities.

As Agents gradually move from "chat capabilities" toward "execution capabilities", relying solely on app forms lacks deep embedding in users' daily decision paths and struggles to obtain high-frequency, continuous task feedback data. Since last year, ByteDance has frequently negotiated cooperation paths for "Doubao phone" with phone manufacturers.

In 2026, the competition in AI capabilities is moving from "interaction competition" to "execution competition".

Risk Warning and DisclaimerThe market has risks, investment needs caution. This article does not constitute personal investment advice and does not take into consideration the special investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, viewpoints, or conclusions contained herein are suitable for their particular situation. Investing accordingly is at your own risk. ```