Deepseek V4's first round of evaluations is here!

Deepseek V4's first round of evaluations is here!

After the launch of DeepSeek V4 preview open-source version, the first wave of evaluation results from third-party platforms has been released. Multiple evaluations show that DeepSeek V4's performance, especially in code tasks, has entered the top tier among open-source models, and its "million-level context + low price" further lowers the barrier for developers.

From third-party evaluations, the platform Arena.ai on X described V4 Pro (Thinking mode) as "a major leap compared to DeepSeek V3.2", ranking it 3rd among open-source models and 14th overall in its code arena; another evaluator, Vals AI, stated that V4 took the top spot among open-source weight models in its Vibe Code Benchmark with "overwhelming advantage," defeating closed-source models like Gemini 3.1 Pro, achieving roughly a 10x performance leap compared to V3.2.

In terms of pricing, V4-Flash output is priced at $0.28 per million tokens, over 99% cheaper than Claude Opus 4.7; V4-Pro output is priced at $3.48, making it one of the least expensive options among leading models in its class. Comparison tables show Flash is at the lowest tier for small models, and Pro is at the low tier for "large model frontier."

Discussion on actual experience is starting to diverge. Several users on X call its cost-effectiveness “unbeatable.” DeepSeek maintains a cautious tone in its self-descriptions, noting that knowledge and reasoning capabilities are near closed-source systems but still lag by about 3 to 6 months, and warning that "limited by high-end computing power," Pro's service throughput is modest, with expectations for further price reductions ahead.

Third-party reviews: Code skills dominate, overall rankings rival the top

Shortly after OpenAI GPT-5.5 was released, DeepSeek-V4 preview officially launched and was open-sourced simultaneously, covering the 1.6 trillion parameters (with 49B active) V4-Pro, and the 284 billion parameter (with 13B active) V4-Flash. Both models support a context window of 1 million tokens and use the MIT open-source license.

The model evaluation platform Arena.ai announced on the day of V4’s release that DeepSeek V4 Pro (Thinking mode) ranked 3rd among open-source models and 14th overall in its code arena, describing it as "a major leap compared to DeepSeek V3.2." Arena.ai also tested V4 Flash, both supporting a 1 million token context window.

Vals AI's review offers even more noteworthy results. The platform stated that DeepSeek V4 became the top open-source weight model in its Vibe Code Benchmark "with overwhelming advantage," outperforming Kimi K2.6 in second place and beating cutting-edge closed-source models like Gemini 3.1 Pro.

Vals AI particularly emphasized that V4 achieved about a 10 times performance leap compared to V3.2—"V3.2 only scored 5 in this benchmark, that's not a typo." In Vals’ comprehensive index rankings, V4 finished 2nd, just 0.07% behind leader Kimi K2.6.

The community response was highly positive. On the X platform, user Sigrid Jin called it a new "shocking moment," mentioning "now you can run a GPT-5.4-ish model at home." He wrote:

"GPT-5.5, sorry, DeepSeek V4 is the new shocking moment, it beat GPT-5.4 high-intensity mode in the code arena."

User Ejaaz said:

"China is leading AI, they've caught up. DeepSeek V4 Flash is 99% cheaper than Opus 4.7, only $0.28 per million tokens, number one in the code arena—this is not a typo."

Some users expressed reservations—X user Michael Anti, after testing, said V4 Flash’s actual experience did not surpass the already mature V3.2, and for longtime users the upgrade experience was disappointing.

Official self-assessment: cautious wording, smallest gap in code and Agent domains

DeepSeek maintains its usual cautious approach in evaluating its own performance. Official documents show that on knowledge and reasoning tasks, V4-Pro has surpassed mainstream open-source models and is approaching closed-source systems like Gemini, but there is still a gap of about 3 to 6 months compared to the most advanced frontier models. For Agent and code tasks, its performance approaches or even partly exceeds Claude Sonnet.

For internal usage, DeepSeek stated that V4 has become the company's main Agentic Coding model. Feedback indicates its experience is better than Claude Sonnet 4.5, with delivery quality approaching Opus 4.6 (non-thinking mode), but still has some gap compared to Opus 4.6 (thinking mode).

In math, STEM, and competition-grade code evaluations, V4-Pro surpassed all open-source models with public evaluations, including Moonshot’s Kimi K2.6 Thinking and Zhipu GLM-5.1 Thinking, and achieved scores comparable to top closed-source models.

Blogger Simon Willison wrote in his review article that V4-Pro (1.6 trillion parameters) is now the largest known open-source weight model, surpassing Kimi K2.6 (1.1 trillion), GLM-5.1 (754 billion), and DeepSeek V3.2 (685 billion), providing a new option for enterprise users wanting local deployment.

He also showed pelican illustrations generated by different models:

This is DeepSeek-V4-Flash’s pelican:



As for DeepSeek-V4-Pro:

Pricing System: As low as 1% of competitors, further price cuts possible later in the year

DeepSeek’s pricing strategy is the most market-focused aspect of this release. V4-Flash input/output prices are $0.14/$0.28 per million tokens, lower than OpenAI GPT-5.4 Nano ($0.20/$1.25) and Gemini 3.1 Flash-Lite ($0.25/$1.50), currently the lowest priced option among small models.

V4-Pro input/output prices are $1.74/$3.48, also lower than Gemini 3.1 Pro ($2/$12), GPT-5.4 ($2.50/$15), Claude Sonnet 4.6 ($3/$15), and Claude Opus 4.7 ($5/$25).

Blogger Simon Willison’s compiled price comparison data show that V4-Pro is the lowest-cost option among large frontier models, and V4-Flash is the lowest among small models—even less than OpenAI's GPT-5.4 Nano.

DeepSeek attributes its low prices to efficiency optimization in ultra-long context scenarios. Official data indicates that in a 1 million token context, V4-Pro's per-token inference compute is only 27% of V3.2, KV cache is only 10%; V4-Flash drops to 10% and 7%, respectively.

Notably, DeepSeek notes in its pricing document that, "limited by high-end compute, Pro's service throughput is very limited currently. It is expected that after Ascend 950 supernodes launch in bulk later this year, Pro's price will significantly drop," implying further price reductions possible.

Technical Architecture: Hybrid attention mechanism breaks long-context bottlenecks, adapts to Chinese compute platforms

The core technological innovation of DeepSeek-V4 is the first-ever "CSA (Compressed Sparse Attention) + HCA (Highly Compressed Attention)" hybrid attention architecture, designed to address the industry challenge that traditional attention mechanisms scale quadratically in ultra-long context, making memory and compute hard to engineer. CSA compresses every 4 tokens into one info block and retrieves the most relevant content via sparse search, retaining mid-section detail while greatly reducing compute; HCA condenses massive info into framework-level blocks to focus on global logic.

Besides this, V4 introduces mHC manifold-constrained hyperconnection (upgrading traditional residual connections, constraining signal propagation to stable manifolds) and the Muon optimizer (replacing traditional AdamW, adapting to MoE large models and low-precision training). Official data shows full-chain engineering optimization can nearly double inference acceleration.

For adaptation to domestic compute platforms, DeepSeek-V4 completed full verification of fine-grained expert parallel optimization schemes on Huawei Ascend NPU platform, achieving acceleration ratios of 1.50 to 1.73 in generalized inference loads. DeepSeek states V4 is the world’s first trillion-parameter model trained and inferred on domestic compute platforms—but Ascend platform adaptation code is currently not open-sourced and remains closed. Additionally, Cambricon has completed vLLM inference framework adaptation for V4-Flash and V4-Pro, with the relevant code now open-sourced on GitHub community.

Risk Warning and DisclaimerThe market has risks, investment needs caution. This article does not constitute personal investment advice, nor does it take into account individual users’ specific investment objectives, financial situation, or needs. Users should consider whether any opinions, views, or conclusions in this article suit their specific situation. Invest at your own risk.