Under the multimodal "Deepseek Moment," major companies have diverged: ByteDance is striving for "efficiency," Kuaishou is focusing on "professionalism," and Alibaba is prioritizing "e-commerce"!

Under the multimodal "Deepseek Moment," major companies have diverged: ByteDance is striving for "efficiency," Kuaishou is focusing on "professionalism," and Alibaba is prioritizing "e-commerce"!

``` At the beginning of the year, this wave of multimodal updates came fast: On January 31, Kuaishou pushed Kling to 3.0, on February 7 ByteDance released Seedance 2.0, and on February 10, ByteDance's Seedream 5.0 and Alibaba's Qwen-Image-2.0 further enhanced the "text-to-image/image editing" capabilities. Yao Lei from Huachuang Securities Research Institute was very direct in her report on the 12th — video generation is no longer just a showcase of technology but is evolving into a tool that can enter workflows: “AI video generation is leaping from blind-box style entertainment to precise, industrialized production.” The bottleneck in commercialization, which has long stalled progress, is due to uncontrollable marginal costs brought by the “gacha” nature: the same demand requires repeated generation and revision, with a high rate of unusable outputs swallowing up both time and budgets. This time, the key focus of the upgrades in Kling 3.0 and Seedance 2.0 is not simply on image quality but prioritizing controllability: maintaining subject consistency across shots, semantic compliance with complex instructions, and post-generation editing capabilities—all aimed at reducing wasted outputs. The research report concludes: technological leaps are providing the foundation for AI video to scale into B-end workflows; e-commerce advertising, short dramas/animation production will feel the impact first. Looking further, the report splits the impact into two layers: first, product route differentiation—ByteDance is leaning toward “efficiency infrastructure”, while Kuaishou focuses more on “professional narration”; second, a supply-side revolution recalculating cost structure—marginal cost of content creation increasingly resembles computational cost. For investment direction, the report highlights content IP, content copyright, AI video tools/models, and inference-side demands for cloud and platforms as beneficiaries. What’s truly being solved is the uncontrollable cost caused by "gacha" The report repeatedly emphasizes a logic chain: in the past, AI video’s difficulty in commercialization wasn’t from “inability to create,” but from “inconsistent output.” With the same scripts, materials, and prompts, the final quality varied greatly—forcing creators to regenerate and revise multiple times, sending marginal costs out of control. The report considers the significance of new-generation models to be moving “generative capability” to the background, spotlighting “controllability”: through native multimodal architecture, instruction alignment, and reinforced subject consistency/semantic compliance, wasted output rates are brought down, thus lowering overall video production costs. The threshold for commercialization is redefined—from “can it be done” to “can it be stably delivered.” Kling 3.0 bets on "cinematic feel": prioritizing physical realism and long-term narrative logic The research report summarizes Kling 3.0 with two keywords: systemic upgrade of core capabilities, and integrated generation and editing (Omni). On the video side, Kling 3.0 upgrades mainly target: stronger consistency of subjects across multiple shots/continuous movements; more granular parsing of complex text prompts; reduced confusion in multi-person frames, with emphasis on “precise mapping between text and visual roles” (including multi-language, dialect/accent performance with natural lip-sync and expressions). The Omni mode is another highlighted change: locally controllable edits on generated content to reduce "starting over from scratch". Two capabilities more geared toward professional creation are also mentioned: first, creation of video subjects (extracting character features and original voice, for precise lip-sync and driving); second, native custom shot breakdown, with maximum single-generation duration raised to 15 seconds, allowing specification of duration, shot type, perspective, narrative content and camera movement at the shot level. On the image side, Kling Images 3.0 is also seen as a "workflow completion" tool: supporting up to 10 reference images to lock subject contour, key elements, and tonal base; flexible specification, addition, and deletion of elements across reference images; batch group output for storyboard/material pack production; with enhanced HD output and detail rendering. Seedance 2.0 makes video a "programmable" industrial tool The report positions Seedance 2.0 more as an “industrial standard”: at the foundational level, focusing on physical plausibility, natural movement, precision in instruction comprehension, consistent style; and highlighting three core capabilities—inconsistency optimization (from faces and clothing to font details, scene transitions, etc.); controllable replication of complex camera moves and actions; precise replication of creative templates/complex effects. More critical is the interaction paradigm. The research points out that Seedance 2.0’s use of "@material name" to specify image/video/audio roles essentially breaks down black-box generation into a controllable production process: the model can separately extract camera movements from @video, details from @image, rhythm from @audio, significantly lowering the rate of unusable outputs. Practical usage and limitations are also more production-oriented: supports ≤9 image inputs; video input ≤3 clips, total duration not exceeding 15 seconds; audio supports MP3s ≤3, total duration not exceeding 15 seconds; total mixed input limit is 12 files; generation duration ≤15 seconds (optionally 4-15 seconds); built-in sound effects/background music output provided. For file loading, “first & last frame” and “general reference” are different material organization methods. ByteDance follows "efficiency infrastructure", Kuaishou chases "professional narration", Alibaba leans toward e-commerce verticals The research report’s view on competitive landscape focuses less on “score rankings,” and more on strategic differentiation. The report summarizes ByteDance’s approach as low-threshold, low-cost, tool-oriented, generalized capability, similar to an advanced “CapCut,” aiming to reduce the overall content production cost throughout the Internet while feeding back into its ecosystem; Kuaishou’s Kling, on the other hand, bets on physical simulation, complex scene realism and character consistency, better suited for professional, continuity-demanding content like film demos and movie plots; Alibaba’s Qwen is more vertical (e-commerce) with high-fidelity updates in image modeling, enhancing product digitization capabilities. These three paths do not point to the same business model: one seeks mass throughput, one seeks high-quality narrative delivery, and one aims for “usable upon generation” in vertical industries. Content supply-side revolution: marginal cost converges with computing cost—IP becomes even scarcer In the commercialization forecast, the report presents a radical view of the “supply-side revolution”: after dual improvements in image and video foundation, the marginal cost of content creation will increasingly approach the cost of computing power. In the short term, it is more optimistic about two areas: increased efficiency in material production for marketing/e-commerce service providers, leading to improved gross margins; potential explosion in production capacity for animation/short drama industries. In the medium-to-long term, the contradiction shifts to the IP side—when content becomes easier to produce, the pricing of scarcity will concentrate even more on IP: top IPs and their derivatives value will increase, while mid-tier IPs may be revalued through AI videoization. Meanwhile, companies with strong computing infrastructure (cloud) and closed-loop traffic scenarios (platforms) will directly benefit from frequent inference-side calls. Risk Warning and Disclaimer The market presents risks, and investments require caution. This article does not constitute individual investment advice, nor does it take into account the particular investment goals, financial situation, or needs of any individual user. Users should consider whether any opinions, viewpoints, or conclusions in this article apply to their particular situation. Any investment made is at one’s own risk. ```