Qwen3.5-Omni Deep Experience: This is what "AI productivity" should look like!

```

You must have had this experience: after a two-hour meeting, the recording file quietly sits in the cloud drive, but no one is willing to rewatch it—because the cost of rewatching is almost the same as having another meeting.

You come across a viral product-promotion video and feel that its conversion logic is worth learning, but you neither have the time to dissect it frame by frame, nor know how to turn it into your own script even if you did.

There are also English podcasts, livestream press releases, customer service recordings that mix in dialects and need to be reviewed—such audio and video content is produced in large volumes every day, but for the vast majority of people, after being “seen” or “heard,” there’s nothing afterward.

In our daily lives, a vast amount of highly valuable audio and video content cannot be taken apart, indexed, or summarized and reused.

But with the release of Qwen3.5-Omni by Alibaba Qianwen, we think this problem has started to be solved.

It is Qianwen’s latest generation full-modality large model, using a MoE hybrid-attention architecture, pre-trained natively on massive text, visual data, and over 100 million hours of audio, achieving SOTA in 215 third-party benchmark tests, surpassing Gemini-3.1 Pro on numerous key metrics.

More worth mentioning than the benchmarks is what we actually experienced in our real tests—after several rounds of extremely challenging stress tests, this full-modality model truly shocked me:

We asked it to analyze a Dune movie trailer—not only did it do structured analysis with timestamps, it also inferred hidden relationships between characters, and generated a storyboard script with rhythm design and color grading advice;
We gave it a viral TikTok product video—it broke down the full conversion attribution and output a 5-step script template that could be directly transferred to other industries;
We described requirements to it using a very rough hand-drawn sketch—it directly generated a runnable React page, and as we verbally requested further modifications, it iterated step by step without losing context.

This means you can throw a two-hour meeting recording at it and get back a structured summary with timestamps and an action list; throw in a competitor's viral video and directly get a reusable script template; use it for quality inspection of customer service calls and get emotional trajectories and dialogue scores.

Its significance is far more than just another parameter upgrade in multimodal capability. It let me witness firsthand how what used to be audio and video content you would “just watch once and move on” are now disassembled into immediately usable ‘data assets’.

And if you connect your “lobster” (AI assistant/robot) to Qwen3.5-Omni, giving it “eyes” and “ears,” then you will have a true digital employee who can understand voice commands, comprehend video content, process audio information, and even operate a computer.

This, perhaps, is the long-awaited true productivity revolution of full-modality large models.

Next, let's look at the real test details, discuss how this model is changing things, and what big moves Alibaba is planning with it.

Dissecting Movies, Reviewing Product Videos, Coding by Voice: Full-Modality Capabilities Evolves

(1)Dune: Not Just “Understanding the Story”

We chose the Dune trailer without subtitles as our first test material to stress-test Qwen3.5-Omni's multimodal abilities.

Trailers are inherently the most challenging material for video understanding: dense shot changes, multithreaded narrative, lots of metaphors and visual hints, an extremely high density of audiovisual cues.

For Qwen3.5-Omni, extracting structured information in the first round was effortless: plot timeline, key shots, on-screen text, speakers and dialogue, character alignments, emotional curves—all were separated precisely by timestamp.

In the second round, we specified the dialogue at the 24th second and asked it to identify the corresponding shot, speaker, and emotion. It accurately pinpointed "She would need to be strong, like her mother," correctly recognizing it as Paul's voiceover, not an on-screen dialogue. The corresponding shot was Chani's side-profile in the desert in backlight, and the emotional assessment—gentleness, respect, expectation—perfectly matched the image.

The real challenge was in the third round: "deep reasoning and follow-up questions"—

We asked it to analyze the "hidden relationships" between characters with evidence from shots and lines, identify "foreshadowing" shots in the trailer and how they point to future plotlines, and generate a 45-second recreated video storyboard.

It accurately identified the "mirror nemesis" relationship between Paul and Feyd-Rautha, the "broken inheritance" tension between Paul and Jessica, and Chani’s role as the "anchor of humanity," with supporting evidence from visuals and dialogue.

The storyboard it produced was not a vague summary, but a three-part rhythm design of "slow lyrical→rapid cuts→epic burst," including suggestions for color grading, sound effects, and subtitle treatment.

Honestly, at this point, it's not just ‘understanding video’—it's actually dissecting like a director. It pushes the LLM’s video understanding capability from abstract summarization to interpreting filmic language and relational reasoning.

(2)Product Videos: Extracting the Underlying Logic of Conversion from a Viral Tiktok

For most people, the more practical question is: is it actually “useful” in the real world, at work?

We input a viral Yiwu supplier-promotion TikTok video, asking Qwen3.5-Omni to break it down and help us replicate it.

As a result, the model not only completed a structured breakdown across seven dimensions: hook, USP ranking, visual proof, subtitle strategy, emotional pacing, CTA time points, target audience; its attribution analysis was also insightful: a three-tier evidence chain built visual trust, "20,000 SKUs + $0.20 per" set up digital anchors, and a full-service promise achieved risk reversal.

In other words, it recognized: this video is selling certainty, not just products.

To verify that it wasn’t just mechanically repeating marketing jargon, we told it, "My factory sells T-shirts, design a script using this logic," and asked it to adapt the logic to a “T-shirt customization factory” setting.

The result: it not only successfully transferred the five-step conversion template to T-shirts, but naturally changed the hook to "stretching the T-shirt to show elasticity," switched the proof of strength to "inkjet close-up + rubbing shows no fading," and even included operational suggestions for guiding private messages via the comment section.

In other words, the large model is no longer just a content understanding tool, it can already serve as a tireless e-commerce analyst and social media operations expert.

(3)Verbal App Creation: Watch, Speak, Modify

The third test is essentially “Vibe Coding” upgraded—“audio-video Vibe Coding.”

We hand-drew a deliberately crude app wireframe, turned on the camera, held the sketch to the lens, and provided spoken instructions: "This is my interface sketch... please generate a full React code, ready to run."

It recognized the hand-drawn layout and generated React code. We continued verbally: "Change the navbar to a sidebar, enlarge the main button and make it rounded," while uploading a replacement image. Later, we tested dark mode, progress bar animation, press feedback, etc., all in iterative rounds, and it kept context without losing any previous changes.

A few rounds later, the website was launched successfully.

Overall, it was able to handle the most authentic human collaboration: watch, speak, modify. It’s not like before, where “AI generates code and you debug it yourself”—it feels like a seasoned developer is sitting right next to you.

(4)Looking at the Big Picture

From Dune’s complex narrative, to product campaign analysis, to the freeform interaction of verbal app-building, if you string together the above test cases, you'll find:

Qwen3.5-Omni has successfully proven that it can turn complex, chaotic, and continuous inputs into directly usable results.

Additionally, two use cases we tested but didn’t detail: generating commentary for gaming videos: web output copy, API produces TTS audio; “24-hour AI newsroom”—50-minute international press conference audio processed for information extraction, bilingual manuscript generation, and voice broadcast, all with good results. Those interested can try it out.

Fundamental Change: From "Understanding Content" to "Turning It into Assets”

The above three scenarios work not just because “abilities improved,” but because the underlying product design fundamentally changed: it forcibly disassembles continuous, mixed, hard-to-search audio and video streams into highly structured intermediates.

(1)How granular: Not summaries, but field-level structural assets

If you look at the official API doc, you’ll find Qwen3.5-Omni’s recommended output for audio/video is not a vague summary, but a three-layer hard structure:

Storyline (story arc with details of audio-visual events by timestamp);
Visible Text (screen text list with start/end time and visual features);
Speakers and Transcripts (verbatim transcript with speaker ID, accent, tone, emotion).

In other words, what you get is no longer a “blob of video” but a structured asset that can be invoked, searched, and executed by code. This is the underlying reason why the Dune test enables precision review, and the TikTok test outputs transferable templates.

The granularity is backed by solid foundational capabilities—a hybrid attention MoE architecture, natively multimodal pretraining with over 100 million hours of audio, intelligence on par with qwen3.5-plus, and SOTA performance on 215 third-party evaluations.

(2)How long: 超长上下文窗口

256K context window, supports over 10 hours of audio, more than 400 seconds of 720P video.

The true challenge of long content is never just “reading it all,” but cross-span referencing and evidence tracing—throw a 10-hour meeting recording in and ask, “The person mentioned at 5min, what did they say at 30min?”; input a product livestream and ask it to pinpoint exaggerated claims with video and subtitles; use for customer service QA, outputting emotion plots and speech scores.

Such labor-intensive and error-prone information organization work is what Qwen3.5-Omni is trying to take over.

(3)Interaction: It’s a dynamic interface

On live interaction, it supports intelligent semantic interruption—it won’t end the speech just because you cough or say “um,” filtering out meaningless background noise.

It natively supports online search via FunctionCall, and can autonomously determine when to fetch search results to answer real-time questions, while developers can see precise usage information in the callback. This alleviates the “timeliness and hallucination” headaches enterprises face when using large models.

On the voice layer, it now supports 113 languages and dialects for speech recognition, 36 for speech synthesis, with 47 multilingual and 8 dialectal voices built-in.

In our tests, whether it’s the customer service role “Tina” who claims her voice is “like warm milk tea,” or the Sichuanese dialect “Qing’er,” both character and product attributes are strong.

This isn’t just about “understanding more,” but packing enough ammo for high-frequency scenarios like overseas customer service, QA, audiobooks, or podcast dubbing.

In short,Qwen3.5-Omni makes audio and video “disassemblable”—not just “understandable,” but able to be indexed, reused, and ready-to-use as materials for work.

A Model Is Not All Alibaba Wants to Sell

Beyond product and tech, it’s worth stepping back from the model itself to look at Alibaba’s recent series of organizational and product moves—a clear business thread emerges.

Recently, Alibaba established Alibaba Token Hub (ATH), directly overseen by CEO Wu Yongming, with the core objective of “creating, delivering, and applying Tokens.” Notably, the newly unveiled “Wukong Division” is explicitly positioned as “an AI-native workplace platform for B-end (enterprise) use, deeply integrating model capabilities into business workflows.”

In DingTalk’s latest “Wukong” release, the focus has evolved from “communication equals generation” to “communication equals execution” (CLI-style, AI directly calls backend APIs). AI is no longer just your chat companion, but is required to go online, watch competitor videos, analyze Xiaohongshu viral content, cross-system data pulls, or even generate data animations itself.

Note these keywords: watch videos, listen to audio, cross-platform execution. When AI Agents grow ‘arms and legs’ and autonomously process massive amounts of audio and video, their need for full-modality understanding and Token consumption will far exceed the pure text chat era.

Within this context, Qwen3.5-Omni’s ultra-low pricing (less than 0.8 yuan per million Token input, less than 1/10th that of Gemini-3.1 Pro) and powerful structured audio/video abilities appear tailored to support enterprise-level Agents like Wukong rolling out at scale—reserving cost-effective, stable multimodal infrastructure.

Bear in mind, splitting hours-long audio/video into structured data used to require enterprises to patch together an entire pipeline—ASR transcription, text LLM, vision models, TTS—high cost, long chain, and many breakpoints.

Now, an end-to-end full-modality model flattens the threshold completely.

I believe what truly makes Qwen3.5-Omni memorable is not just that it can now understand a complex movie trailer, but that from this moment, it can start turning audio and video content into actionable, reusable digital assets in enterprise workflows.

The productivity revolution driven by full-modality large models is arriving.

This article is from WeChat public account “Hard AI”. For more AI news, please follow here

Risk Warning and DisclaimerThe market has risks, investment needs caution. This article does not constitute personal investment advice, nor does it consider the specific investment objectives, financial situation, or needs of individual users. Users should evaluate whether any opinions, views, or conclusions in this article are suitable for their particular situation. Invest accordingly at your own risk. ```