Musk's xAI joins the "world model" race—will "vision models" be the next "large language models"?

Musk's xAI joins the "world model" race—will "vision models" be the next "large language models"?

```

Author: Long Yue

Source: Hard AI

The flames of competition in the field of artificial intelligence are spreading from large language models to a more cutting-edge area—the "World Models" capable of understanding and simulating the real physical world. xAI has quietly joined this race, competing alongside tech giants like Google and Meta.

According to the UK’s Financial Times on October 12, Musk’s startup xAI hired artificial intelligence experts from chip giant NVIDIA this summer, specifically to work on World Model development. Unlike large language models that rely on text, World Models are trained on massive amounts of video and robotics data, aiming to master the physical laws of the real world.

“In the future, video models will become as intelligent as language models,” Google researchers said in a paper. NVIDIA also declared last month that the potential market size for World Models could be close to the current total size of the global economy.

Advance Troops: xAI’s “Surprise Attack” on Gaming and Ambitions for Robotics

To gain a foothold in this race, xAI is actively recruiting talent.

The company has hired two AI researchers, Zeeshan Patel and Ethan He, from NVIDIA, both of whom have extensive experience in the field of World Models. NVIDIA has long been a leader in this technology, thanks to its Omniverse platform for creating and running simulations.

According to informed sources, xAI’s first commercial landing point for World Models is in the gaming industry, to generate interactive 3D environments. This move has quickly attracted market attention, as it is not only a clear sign of xAI’s commercialization strategy, but also highlights the enormous potential of World Models as next-generation AI technology.

Musk himself has also confirmed on social platform X that xAI will “release an excellent AI-generated game before the end of next year.” In the longer run, these technologies may eventually be applied to AI systems for robots.

xAI’s recruitment information also confirms its development direction. The company is hiring technical personnel in image and video generation for its “omni team,” with salaries ranging from $180,000 to $440,000. The team is dedicated to “creating magical AI experiences beyond text.”

In addition, the company is hiring “video game mentors” at hourly rates of $45 to $100 to train its AI model Grok in making video games.

Paradigm Shift: The “GPT Moment” for Vision Models

xAI’s high-profile entry coincides with a key industry forecast emerging: In the future, video models will become as intelligent as language models. A recent Google paper noted that its video model Veo 3 is showing “emergent abilities” similar to those of large language models (LLMs).

Just as LLMs learn extra skills such as math and creative writing from the simple task of “next token prediction”, video models, through “next frame prediction,” are also beginning to unlock a range of surprising abilities—such as object segmentation, edge detection, and simulating tool use—in zero-shot fashion, without specialized training for those tasks.

Google researcher Jack Clark wrote in a paper, “We believe that, just as NLP shifted from task-specific models to general-purpose models, machine vision may experience the same transformation through video models—a ‘GPT-3 moment’ for vision.”

They liken the process of generating video frame by frame to the “chain-of-thought” in language models, calling it the “chain-of-frames,” believing this enables video models to reason across space and time.

This finding is significant, as it hints that by developing more intelligent video models, humans may be able to obtain very capable robotic “agents.”

Prospects and Reality: High Costs and Lack of “Vision”

Although the prospects are enticing, the road to World Models is far from smooth. Currently, the technology still faces huge technical challenges, chief among them being the extremely high cost of acquiring and processing enough training data to simulate the real world.

At the same time, the industry is also taking a sober look at the role of AI. Michael Douse, Publishing Director at Larian Studios, the developer of the popular game “Baldur’s Gate 3”, said on X this week that AI cannot solve the gaming industry’s “big problems,” namely “leadership and vision.”

He added that what the industry needs is not “more game loops generated by mathematics and trained in psychology,” but more diverse expressions of the world. This represents a common view: that purely technological breakthroughs cannot, by themselves, guarantee the creation of truly compelling commercial products.

Despite the many challenges, xAI’s entry has undoubtedly added more fuel to the World Model competition.

The focus of AI is irreversibly shifting from pure digital information processing to the simulation and interaction with complex physical reality. Whether vision models can replicate the brilliance of large language models and usher in their own “GPT moment” will not only determine the next AI hegemon, but could also fundamentally reshape our relationship with both the digital and physical worlds.

 

This article is from WeChat Official Account “Hard AI”. For more cutting-edge AI news, please go here

Risk Warning and DisclaimerThe market has risks and investment requires caution. This article does not constitute personal investment advice, nor does it take into account the special investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article fit their particular circumstances. Investment is at your own risk. ```