DeepSeek V4 Depth: A Structural Revolution in Attention Mechanism

DeepSeek V4 Depth: A Structural Revolution in Attention Mechanism

``` DeepSeek releases V4 preview version, open-source simultaneously. There's a sentence in the announcement: "From now on, 1M (one million) context will be the standard for all DeepSeek official services." OpenAI and Google have long supported ultra-long contexts. The issue is cost. The computation of the Transformer attention mechanism increases quadratically with sequence length—double the sequence, quadruple the computation—processing 1 million tokens in traditional architectures is nearly impossible to commercialize. The technical report shows the extent of this architectural change: under the 1M token scenario, V4-Pro’s single token inference FLOPs is only 27% of V3.2’s, and KV cache usage is only 10%. Two Blades Standard Transformer self-attention requires every token to compute relevance weights with every other token in the sequence. This is quadratic complexity—structural, not solvable by engineering optimizations. Previously, there were two main approaches: either cut down the computation range (sliding window only looks at local neighbors, global awareness disappears), or bypass the long text itself (RAG retrieves first, then feeds to the model—retrieval quality becomes the new ceiling). There is also fixed sparse attention, manually designing sparse patterns to skip some calculations, but these patterns are rigid, information distribution varies greatly by task, so generalization is limited. V4’s solution is the CSA + HCA hybrid attention architecture. CSA (Compressed Sparse Attention) solves "what to compute". A lightweight indexer first roughly screens all token pairs, quickly estimating relevance ranking, then selects the token set needing full computation. The key is that this sparse structure is trainable—the model learns during training where it needs high-density attention, and where it can be sparse. DSA in V3.2 was a prototype, V4 evolves it further. HCA (Heavily Compressed Attention) solves "what to store". Continuing from V3’s MLA (Multi-head Latent Attention), it maps KV vectors to a low-dimensional latent space and decompresses them during inference. Add to this FP4+FP8 mixed precision—MoE expert parameters use FP4, others use FP8—the GPU memory for KV cache is halved again. The combined effect of both is seen directly in the two numbers: 27% FLOPs, and 10% KV cache usage. This means that, with the same computing power, the concurrent capacity for long contexts is about 3 to 4 times what it was before. There are two other technical details noted in the report. mHC (Manifold-Constrained Hyper-Connections) imposes manifold-constrained strengthening on residual connections, targeting the cross-layer signal attenuation issues in the 1.6T parameter ultra-deep model training. The Muon optimizer replaces Adam, updating based on matrix orthogonalization, achieving faster and more stable convergence at large scale—Adam is the default for large model training, but DeepSeek replaced it this time. Numbers The official report provides comprehensive comparisons with Claude Opus 4.6, GPT-5.4 xHigh, and Gemini 3.1 Pro High. Math and competitive reasoning are V4-Pro’s standout areas. Codeforces score of 3206, the highest among the four (GPT-5.4 is 3168, Gemini and V4-Flash are both 3052). Apex Shortlist 90.2, surpassing Opus 4.6 (85.9), GPT-5.4 (78.1), Gemini (89.1). IMOAnswerBench 89.8, second only to GPT-5.4 (91.4). For agent capabilities, SWE Verified 80.6 (Opus 4.6 is 80.8). Toolathlon 51.8 (Opus 4.6 is 47.2, GPT-5.4 is 54.6). The announcement includes an internal assessment: V4 is now the main model for employee Agentic Coding, "user experience surpasses Sonnet 4.5, delivery quality is close to Opus 4.6 non-thinking mode". There are two numbers for long context evaluation to compare: MRCR 1M (retrieving key info from long texts) 83.5, Gemini is 76.3, Opus 4.6 is 92.9. CorpusQA 1M (precise Q&A on long documents) 62.0, Opus 4.6 is 71.7. MRCR emphasizes whether key information exists, while CorpusQA requires pinpointing and analyzing within a million tokens—the contrast between these two evaluations is self-explanatory. Comprehensive knowledge and cutting-edge science reasoning: SimpleQA-Verified 57.9, Gemini is 75.6. HLE (frontier science difficult questions set) 37.7, lowest among all four. V4-Flash: 284B total parameters, 13B active, about 18% the size of the Pro version, but also supports 1M context and Think/Think Max inference modes. Officially, it is said to be "comparable" to Pro for simple Agent tasks. DeepSeek refers to this release as a "preview", and the technical report title is "Towards"—heading toward, still on the road. The design logic of CSA and HCA is now public; how the sparse training mechanism performs across different tasks is what the open-source community will reveal next. Data source: DeepSeek official announcement "DeepSeek-V4 Preview Version: Entering the Era of Million-Token Context for All" (April 24, 2026); technical report "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence". Risk Warning & Disclaimer The market has risks, investment should be cautious. This article does not constitute personal investment advice and does not consider the specific investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are applicable to their circumstances. Investments made accordingly are at your own risk. ```