Give foreigners a bit more shock? Kimi Yang Zhilin: When will K3 be released? Before Ultraman’s trillion-scale data center is built

The AI community has blown up again recently! As soon as Moonshot’s Kimi K2 Thinking model was released, it sent the international developer community into a frenzy.

This model has surpassed OpenAI’s GPT-5 and Anthropic’s Claude Sonnet 4.5 in several key benchmarks such as Humanity’s Last Exam and TAU-Bench, while its API call price is much lower than both.

Hugging Face co-founder Thomas Wolf exclaimed directly: "Is this another glorious DeepSeek moment?"

While global developers were hotly debating the release, at 11 November Beijing time, in the early hours when most in China were still asleep, Moonshot founder Yang Zhilin, together with co-founders Zhou Xinyu and Wu Yuxin, hosted a multi-hour AMA (Ask Me Anything) on Reddit.

This was also the first time all three co-founders appeared together, responding to sharp questions from overseas developers.

The Q&A lasted several hours, covering topics from the rumored $4.6 million training cost to the release date of K3, from open source strategy to industry competition, from technical approaches to AGI timelines. Yang Zhilin’s team answered dozens of questions all at once.

The $4.6 million rumor is not true; true costs are hard to quantify

The hottest topic was the rumored $4.6 million training cost. Confronted with this number that shocked Silicon Valley, Yang Zhilin responded directly:

"This is not official data. Since a large part of the training cost is research and experiments, it's very hard to quantify exact numbers."

This response dispelled industry speculation about K2 Thinking’s "ultra-low costs." Though the specific numbers weren’t revealed, from a technical perspective the model does show breakthroughs in cost control:

K2 Thinking adopts a mixture-of-experts architecture with one trillion parameters, but only activates 32 billion parameters per inference, using native INT4 quantization technology to double inference speed.

On hardware, Yang Zhilin revealed the team trained with H800 GPUs with Infiniband. "Although they’re not as advanced as American high-end GPUs and we don’t have the quantity advantage, we squeezed every bit of performance from each GPU card."

Reportedly, the API price for K2 Thinking is 1-4 yuan per million input tokens, 16 yuan for output, only a quarter of GPT-5’s price, truly achieving a perfect balance of performance and cost.

This price-performance advantage is drawing more and more enterprise users to shift from closed-source models to open-source solutions.

Is K2 Thinking too “talkative”? Focused on Agent capabilities

Facing many developers questioning that K2 Thinking is "too verbose," the team gave a clear response.

Yang Zhilin said: "In the current version we focus more on absolute performance than token efficiency. Later we’ll try to include efficiency into the reward system, so the model learns to compress thought processes."

This design reflects Moonshot’s technical priorities: in order to guarantee the quality of complex tasks, sacrificing token efficiency is acceptable. K2 Thinking can continuously perform 200-300 tool calls to solve complex problems, keeping stability in the "think-tool-think-tool" alternating mode.

Biggest challenge during development

For implementation, the team used end-to-end agent reinforcement learning training, letting the model excel at hundreds of steps involving tool use and intermediate steps including retrieval. The core of this method is to have AI imitate the human problem-solving process, gradually approaching optimal solutions through repeated iteration.

Moonshot co-founder Wu Yuxin revealed in reply that supporting the interleaved "think-tool-think-tool" pattern was one of the main challenges in development: "It's a relatively new behavior in LLMs and required a lot of work to get right."

When will K3 be released?

When asked about the release date for K3, Yang Zhilin gave a rather humorous answer:

"Before Sam Altman’s trillion-scale data center is built."

Someone joked: “So, we’ll never see it, after all he’ll never finish that doomed project? Just kidding～”

Why release a pure text model first?

On multimodal capability development, Yang Zhilin said: "Training vision-language models takes time for data gathering and training adjustment, so we decided to release a text model first."

You can feel the AGI vibe

On open-source motivations, Yang Zhilin answered quite idealistically: "We embrace open source because we believe AGI should be a pursuit that unites, not divides."

K2 Thinking uses a Modified MIT License, which generally reserves the freedoms of standard MIT license but adds a key restriction: if the model is used in commercial products with more than 100 million monthly active users or $20 million in monthly revenue, the use of Kimi K2 must be credited.

Asked about the AGI timeline, Yang Zhilin replied rather cautiously: "AGI is hard to define, but everyone can already sense the vibe, and there’ll be more powerful models in the future."

Will you release larger closed-source models?

When asked whether larger closed-source models would be released, Yang Zhilin gave a rather intriguing reply: "If it becomes too dangerous :)"

This hints at concern for model safety as well as leaves room for future commercialization.

Currently, in less than 48 hours since launch, K2 Thinking has been downloaded more than 50,000 times, making it the hottest open-source model on Hugging Face.

Technical divergence from DeepSeek: OCR and KDA

Confronted with different technical routes, the Moonshot team showed clear preferences. On DeepSeek’s recent OCR route, Zhou Xinyu gave a differing view:

"Personally, I think this path is somewhat heavy. I prefer to keep improving in feature space, to find more general and modality-independent methods to enhance model efficiency."

On future directions, the team revealed KDA is their latest experimental architecture, and the concepts may be used in K3. KDA blends KDA and MLA routes at a 3:1 ratio, letting the model learn to ‘grasp key information’ on traditional transformers, optimizing performance, speed, and memory usage.

Yang Zhilin said the team has internally tested the new Kimi Linear structure (built around KDA—a linear attention module with stronger expressiveness); initial results look promising, and it can be further combined with sparsification techniques.

Risk Warning and DisclaimerThe market has risks and investment requires caution. This article does not constitute personal investment advice, nor does it consider the special investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their own circumstances. Investing based on this is at your own risk.