The "Shadow War" of AI Video Generation Begins

```

User payments have not yet been successfully adopted in large language models, but are quietly taking root and sprouting in the AI video generation sector.

In June of this year, annualized revenue for AI video generation startup Runway surpassed $90 million (about RMB 640 million); in the second quarter of the same year, Kuaishou (1024.HK)'s AI video generation app "Koling" generated over RMB 250 million in revenue.

Domestic startups are flocking to the stage.

Both "Vidu" from Beijing Shengshu Technology Co., Ltd. (“Shengshu Technology”), and "Paiwo" from Beijing Aishi Technology Co., Ltd. (“Aishi Technology”), have each surpassed 10 million users; as the first IPO among the "Hangzhou AI Six Dragons", Manycore Tech Inc. (“Qunhe Technology”) also plans to launch an AI video generation product for C-end consumers within the year.

The market's vision for AI video commercialization is not limited to individual creators generating short videos; there are also opportunities in film creation, embodied intelligence, and other fields.

However, problems such as spatial consistency and content stitching failures have led to controversy over the gap between "seller’s shows" and "buyer’s shows" in AI video generation models.

Although the "DeepSeek moment" for the AI video generation industry has yet to arrive, with major companies stepping up investment, the market has reason to believe that the development path will become increasingly clear in the future.

Race to Longer Videos

In February 2024, OpenAI launched Sora 1.0, making a breakthrough compared to Runway, which could previously only generate 3-4 second clips, becoming the world’s first AI video generation model supporting videos up to 60 seconds long.

Since then, domestic models have gradually caught up.

Currently, domestic players include Internet giants like ByteDance, Kuaishou, and Baidu, as well as startups like Shengshu Technology and Aishi Technology, all exploring the AI video generation application space.

A product manager from a southern technology company told Xin Feng that the biggest change in the AI video generation field this year lies in duration — namely, the ability to generate longer videos with AI.

Although currently AI video generation model companies typically output videos of 5-10 seconds at a time, by generating each shot in sequence, it’s already possible to assemble a coherent longer video.

The film and television industry is among the first to try it.

The 50-episode animated short series "Monday Tomorrow", launched in August this year, was generated using Shengshu Technology's Vidu AI video model.

In practice, the "Monday Tomorrow" production team had key characters hand-drawn by artists, then extended the animation by using Vidu’s image-to-video and reference-to-video functions.

Shengshu Technology told Xin Feng that about 80% of "Monday Tomorrow"'s content was generated by Vidu Q1's image-to-video and reference-to-video features, deeply involving all core steps from art direction to final animation. This enabled a team of fewer than ten people to complete all 50 episodes of the first season in just 45 days, averaging less than one episode per day, while traditionally producing a two-minute animated episode can take up to a week—a more than 7-fold increase in efficiency.

Kuaishou’s "Koling" is also important for film and TV production scenarios.

According to Kuaishou management in an earnings call, "Koling" currently serves a broad customer base, including mass creators (both professionals and nonprofessionals), e-commerce and advertising industry workers, and film studios.

Length limitations continue to be broken.

Recently, Baidu upgraded its AI video generation model "Baidu Steam Engine" to allow users to generate AI videos of unlimited length, smashing the previous ceiling where AI could only generate 5-10 second short clips or extend duration via keyframe control.

During use, users only need to input an image and a prompt to generate a video of any length.

The aforementioned product manager said that the breakthrough in video length is not just due to "stacking compute power", but rather driven by algorithmic optimization and increased data.

According to Baidu, the technical solution for long video generation mainly introduces an autoregressive diffusion model, combining the long-sequence capability of autoregression with the strong consistency of diffusion, allowing accurate generation of long videos highly consistent with physical laws.

Xin Feng participated in Baidu Steam Engine's internal testing, using a portrait as the first frame and giving prompts like "1-5s: camera follows, subject moves quickly; 6-10s: camera follows, subject approaches stairs; 11-15s: subject moves forward, camera follows, pans right; 16-20s: subject moves forward, camera follows, pans right, circles to front of subject," to generate a 20-second video. (See "Baidu Steam Engine" AI Video Generation Model)

In the video, while the transition of facial expressions looks like a face swap, and some objects appear/disappear out of nowhere, the subject’s motion is natural and the background remains stable.

Price War Smoke

Although domestic large language models have yet to establish a paid C-end user path, AI video generation model companies are already exploring commercial models.

Pricing varies widely among companies.

Just comparing standard versions: Koling and Shengshu Technology's Vidu cost 66 yuan and 59 yuan, respectively; Aishi Technology's Paiwo and ByteDance’s Jimeng both cost 79 yuan.

However, Vidu and Jimeng use a "more for the same price" model, generating up to 200 and 216 videos per month, respectively, while Koling and Paiwo only generate several dozen videos per month.

All have achieved some commercial results.

Kuaishou is one of the few large companies to disclose the commercialization results of AI video generation applications; in the second quarter of 2025, Koling's revenue has exceeded RMB 250 million.

Among startups, Shengshu Technology’s Vidu achieved an annual recurring revenue (ARR) of over $20 million (RMB 140 million) within 8 months of launch; Paiwo by Aishi Technology claims subscription revenue now covers costs.

However, leading companies have quietly started a price war to attract professional creators.

According to Baidu, its Steam Engine model has already been applied in search, marketing, and other scenarios, with pricing as low as 70% of the industry standard; recently, when Koling launched its 2.5 Turbo model, one key selling point was that it was nearly 30% cheaper than the 2.1 model at the same tier with even greater value for money.

The flip side of the price war is that many companies are eager to try, too.

Xin Feng has learned that Qunhe Technology, which is seeking an IPO on the Hong Kong Stock Exchange, is developing a 3D technology-based AI video generation product, expected to launch this year.

An insider at Qunhe Technology told Xin Feng that the product will be open to C-end users in the future.

Qunhe Technology’s strength lies in its massive and physically-correct indoor spatial dataset.

“In the process of developing (home design tools like Kujiale), we accumulated huge amounts of data, unlike directly AI-generated 3D models. Our models are physically correct and interactable, materials are also physically accurate, with physical coefficients on the surfaces, and there is structured information and annotation inside,” said Qunhe Technology Chairman Huang Xiaohuang.

This August, Qunhe Technology’s dataset, InteriorGS, even topped the trends chart of Hugging Face, the world’s largest AI open-source community, becoming the world's first large-scale 3D dataset for agent free movement.

These developments could put even more pressure on companies to further expand commercial boundaries.

Currently, the market imagines uses beyond film and advertising, such as robotics training and other scenarios.

Robotics training has long faced a scarcity of training data, limited scene coverage, and high collection costs, but AI video generation applications can provide virtual scenes for robots to train in, allowing them to better understand how the real world operates.

Some robot companies are developing their own algorithms. For example, in March this year, robot company Jujidongli released its embodied intelligence operation algorithm LimX VGM, which leverages video generation technology to help advance embodied AI.

A participant in that project told Xin Feng frankly that the generalization capability of current video generation models is still limited due to data volume.

However, this person remains optimistic and is bullish on AI video generation models for robot virtual environment training.

Previously, Kuaishou management stated in earnings calls that it planned to expand Koling’s application in game production, professional film, and visual production.

Buyer Show VS Seller Show

Although all AI video generation companies claim to have improved spatial consistency, Xin Feng found that during subject motion, there are still frequent issues of facial expression collapse and interwoven clarity/blurriness in backgrounds.

For example, using the image-to-video feature on Paiwo, Xin Feng generated a short video of a person dancing, but encountered problems with facial distortion and objects disappearing out of thin air. (See "Paiwo" AI Video Model Generation)

An industry insider in Hangzhou told Xin Feng that detail and background consistency issues under complex movement are a common industry challenge, mainly due to the difficulty of precisely modeling long-term motion trajectories and multi-scale semantic continuity.

Qunhe Technology product manager Long Tianze believes it's related to the source of training data.

“The key is that now AI video algorithms learn from 2D image sequences, so they cannot truly understand 3D space and rules. The model learns to make the previous frame look more like the next, but doesn’t understand real 3D spatial relationships or the logic of how the physical world operates,” said Long Tianze.

Currently, the main approaches to solving spatial consistency are algorithm optimization and building better datasets.

Shengshu Technology told Xin Feng they are mainly optimizing via three approaches: 1) Unified spatiotemporal attention mechanism based on their self-developed U-ViT architecture to enhance the model’s ability to predict subject motion and background correlation; 2) Building a massive, high-quality video training dataset to specifically reinforce semantic understanding of complex motion patterns; and 3) Introducing dynamic masks and consistency compensation algorithms to repair inter-frame errors in real-time during post generation.

“Our reference-to-video feature has already achieved multi-level consistency improvement from face to full subject, and the next breakthrough will focus on stability during large movements,” said Shengshu Technology.

Qunhe Technology, for their part, is pushing workflow development for 3D video generation, which may help reduce visible model and distortion issues when the scene changes.

However, such approaches require users to understand input data formats for video generation and other technicalities.

Privacy Boundaries

High-quality datasets are highly sought-after training material for many current AI video generation companies.

Some foreign tech giants have even resorted to downloading adult films as training material to improve model consistency for human subjects.

Meta, for instance, has faced such accusations.

In July this year, two American adult entertainment companies, Strike 3 Holdings and Counterlife Media, took Meta to court claiming Meta secretly downloaded 2,396 adult films to train its AI models.

"It is a very new case, involving copyright infringement. Meta will probably argue 'fair use,'" an intellectual property lawyer practicing in the U.S. told Xin Feng. "Currently there is no unified rule for this kind of training material — the industry is advancing in controversy."

By comparison, domestic platforms might have more flexibility regarding training materials, especially video platforms.

Although video platforms do not have exclusive rights to user-uploaded videos, they usually have the right to use them.

For example, Kuaishou's "Basic Functions Privacy Policy" makes it clear that, to deliver and evaluate advertising more effectively, they may need to share some user information and data with third-party advertisers, service providers, and suppliers.

This may mean that video platforms like Kuaishou and Douyin will have a bigger data advantage in the AI video generation sector than other companies.

As the AI video generation sector develops, the boundaries of data use may become clearer as well.

Risk Warning and DisclaimerThe market has risks, investment requires caution. This article does not constitute personal investment advice, nor does it consider the specific investment objectives, financial situations, or needs of individual users. Users should decide whether any opinions, views, or conclusions in this article suit their particular situation. Any investment based on this is at their own risk. ```