Wukong from DingTalk now has a Golden Staff.

Wukong from DingTalk now has a Golden Staff.

```

The true headache Sun Wukong caused the Heavenly Court was after he obtained the Golden Cudgel: his “life-bound magical weapon” allowed him to be even more powerful and unstoppable.

On March 17, DingTalk released an AI platform named "Wukong". It can take over your browser, search things for you, and operate your computer when you’re away—it has hands and feet, it can execute tasks.

And Alibaba Qwen just launched Qwen3.5-Omni, a full-modality model that can watch videos, listen to audio, and decompose audio/video into structured data ready for direct use—a lot like Sun Wukong’s Golden Cudgel.

At the moment, the monkey and the cudgel haven’t fully merged yet.

But once they do, this thing will be very powerful.


1. What Can Wukong Do

DingTalk’s Wukong is a powerful yet rule-abiding enterprise “lobster”.

(1) One-Click Price Comparison Across the Web

I asked it to search for "DJI Osmo Pocket 3" on Taobao, JD.com, and Pinduoduo, compare prices and sales, take screenshots, and organize the results into Excel.

It took over my browser—opened Taobao, entered the keyword, scrolled through, saved screenshots; then did the same for JD.com; then Pinduoduo.

After all three platforms, an Excel file appeared on my desktop: the 5 cheapest and highest-selling items, sorted by platform, store, price, and link; lowest price highlighted in red.

It didn’t “tell” me which option was cheapest—it actually “did” the price comparison, screenshots, and table for me. All I did was type one sentence.

Of course, there are rough edges—you need to be logged in to all platforms; otherwise, CAPTCHAs will block it.

(2) Content Radar

The second very practical scenario doesn’t happen at your computer.

I used my phone to send Wukong a message on DingTalk: set up a daily timed task at 9am to automatically open the browser, search “latest AI developments, generate an AI-related topic,” extract 3 headlines with source links, and send to my phone.

Wukong called the relevant Skill, automatically created the task. A few minutes after 9am the next day, a morning digest appeared on my phone—neatly formatted, with clickable links.

(3) Getting clients, building websites

I also had Wukong build a website, choosing skills from the official marketplace—for a runnable website and complete source code—the aesthetics still need work, but the ability to go from zero to one is there; the marketing department uses it to generate regular competitor monitoring; animators can produce complete data-driven videos with one sentence.

There were more radical demos at the launch. An auto repair shop manager told Wukong, “Help me get 100 customers,” and the AI automatically completed the full chain from competitor analysis, studying top selling products, social media posting to guiding comments.

If these scenarios can run stably in daily practice, it shows AI is moving from “executing orders” to “getting things fully done” for you.

Of course, there are inevitable instabilities in the early stages. Official data: one user reported, making a PPT used about 270 million tokens. After AI moves from conversation to execution, file handling, repeated edits, cross-system calls—token usage increases dramatically.

Wukong’s RealDoc file system officially claims fivefold token efficiency improvement, which is in the right direction, but for cost-conscious SMEs, a more stable system and better skills are still needed for a clear and calculable ROI.


2. What Does the Golden Cudgel Look Like

Wukong has hands and feet, but for now it lacks something: eyes and ears. It can operate the browser, read documents, execute across platforms, but it still can’t understand what’s happening in a video, or who’s speaking and their tone in an audio recording.

You’ve surely experienced this: a two-hour meeting recording sleeps quietly in your cloud drive, never rewatched—because reviewing costs almost as much as holding another meeting. You see a viral product video and sense there’s value to learn in its logic, but no time to analyze it frame by frame. English podcasts, dialect customer service recordings—you hear them and move on. Large amounts of valuable audio and video content, after “seeing/hearing”, lead to nothing further.

Qwen3.5-Omni, just released by Alibaba Qwen, aims to turn “once you’ve seen it, it’s over” into “break it down, and make it useful.”

Here's our testing experience:

We used it to break down a viral TikTok sales video from Yiwu.

After inputting the video, the model structured the analysis into Hook, selling points sequence, visual proof points, subtitle strategy, emotional pacing, CTA timing, and target audience. The core insight impressed me—“This video isn’t selling the product but certainty”: a three-level physical evidence chain builds trust, “20,000 SKUs + 20-cent average price” creates a numeric anchor, and a “babysitter” style promise reverses perceived risk.

Even more important is the transferability: asking it to write a script for a “T-shirt customization factory” with the same logic, it output a five-step actionable template—the Hook changed to “stretching the T-shirt to show elasticity,” proof of quality became “close-up inkjet on the printing machine, rub-proof color,” comment section engagement was even scripted.

Another test: “dictated coding.” Draw a deliberately crude app wireframe, turn on the camera, describe the needs aloud—it directly generates runnable React code. Continue with more oral modifications—sidebar, rounded corners, dark theme, press animations—after several iterations, the context is never lost. Watching, talking, and modifying as you go—this is the most natural form of human-computer interaction, and it handles it all.

What enables all this: hybrid attention MoE architecture, native multimodal pretraining on over 100 million hours of audio, SOTA in 215 third-party tests, outperforming Gemini-3.1 Pro in multiple benchmarks. 256K context window, supporting over 10 hours of audio. Speech recognition for 113 languages and dialects, TTS in 36. Pricing: less than 0.8 RMB per million tokens—less than a tenth the price of Gemini-3.1 Pro.

In short: Qwen3.5-Omni makes audio and video “decomposable”—not just “understood,” but broken into searchable, reusable, and immediately actionable data assets.


3. When Wukong Picks up the Golden Cudgel

Wukong can operate browsers, read and write files, execute across platforms, and use thousands of DingTalk functions, but without audio-visual processing, it can't be widely used in the most natural business scenarios. Qwen3.5-Omni can break videos into structured data by timestamps, transcribe multilingual audio, and understand mixed visual/audio inputs, perfectly making up for this gap.

If the two are successfully combined: you throw it a two-hour meeting recording. It doesn’t just generate a summary—it also detects who said what at what time, whether the tone was firm or hesitant, which statements were action items, then directly creates tasks in DingTalk, assigns them to the right people, and sets deadlines. From “understanding the meeting” to “executing outcomes,” no more human hands involved.

Ops teams won't need to manually monitor competitors’ short video accounts every day. AI now watches competitor videos, analyzes conversion strategies—like Qwen3.5-Omni did with the TikTok sales video—produces transferable script templates, then automatically posts adapted content to social media via Wukong, even going further to attract and acquire customers. From “analyzing competitors” to “producing content” to “customer acquisition,” the whole chain is covered.

On an even more routine level: customer service call quality inspection. Previously, people had to listen, record, score; only a limited number could be checked daily. With full-modality, AI listens to all recordings, outputs the emotional curve and script score for each call, tags problem calls, generates improvement suggestions, and writes results directly into the DingTalk management system.

The common logic in these examples: perception → understanding → execution, a full closed loop. Wukong solves execution, Qwen3.5-Omni solves perception, and Qwen3.5-Omni’s sub-0.8 RMB/million token pricing keeps the system affordable. All that’s left is merging the pieces.


Conclusion

In Journey to the West, Wukong was already strong when he leapt from the stone, but after getting the Golden Cudgel, recognizing his master, and starting his journey, he only grew stronger.

DingTalk’s Wukong has already leapt out. The Golden Cudgel has just been forged, but hasn’t been handed over yet. The journey is long—the cost of tokens must drop, the product needs polishing, and the understanding of 27 million enterprises must be developed one by one.

But the monkey, the cudgel, and the road are all here.


This article is from WeChat Public Account “Hard AI”. For more AI news please go here.

Risk Warning and DisclaimerThe market entails risks, and investments should be made cautiously. This article does not constitute individual investment advice, nor does it take into account the specific investment objectives, financial situation, or needs of individual users. Users should consider if any opinions, views, or conclusions in this article suit their own circumstances. All risks resulting from investments based on this article are borne by the user. ```