The "singularity" moment of edge computing power—three-dimensional resonance of demand, models, and chips

```

The real explosion of edge computing power may not happen in phones and computers, but in moving robots.

On May 18th, the Guosheng Securities Communication Industry Research Team (analysts Song Jiaji, Huang Han, Shao Shuai) released a deep research report, reviewing the latest progress of edge computing power from the three dimensions of demand, models, and chips, and made a judgment: edge computing power is entering its "singularity" moment.

The starting point of this report is an honest self-review.

Two years ago, Guosheng Securities released a deep report on edge computing power, predicting rapid growth of local AI computing on devices like smartphones and PCs. But in reality—AI features on these devices mostly still rely on cloud computing, and edge computing did not ramp up as expected.

Edge computing power (On-Device Computing / Edge Computing) refers to the data processing and computational capability executed directly on user-end devices (such as smartphones, AI glasses, PCs, smart home devices, and now possibly robots, etc.) without complete dependence on remote cloud servers.

The report summarizes this history in two sentences: "Don’t underestimate the ability boundaries of cloud models," and "Demand doesn’t come out of thin air."

Cloud is too strong, traditional edge-side demand is "suppressed"

In the past three years, the speed of evolution of cloud-based large models has far exceeded expectations.

The report points out that with cloud-side architecture deployments such as "super nodes" and "PD separation", cloud models are not only rapidly improving in capability but also having per-token costs drop quickly.

Take image generation as an example: three years ago Qualcomm was still deploying Stable Diffusion on the edge, which could only generate 512×512 base images with weak logic; but cloud-based models like GPT-4o and Nano Banana can now produce 4K high-resolution images in 10 seconds with much better logical details, far surpassing the edge.

The three original reasons supporting edge AI—privacy, low cost, low latency—are also being undermined one by one by the powerful advance of the cloud. The report believes that "privacy" and "low cost" are being debunked; only "low latency" remains a truly defensible point.

But the low latency referred to here isn’t about the speed at which AI responds to humans. Tencent’s Hunyuan T1 model already outputs at 60-80 tokens/second, and gives the first word in just a second—faster than the comfortable human reaction time.

The low latency mentioned in the report refers to the device’s internal processing speed of external signals.

The human brain takes approximately 180-200 milliseconds to process visuomotor responses; but for a device to receive a signal, transmit it to the cloud for analysis, and then return for local execution, it often takes 2-5 seconds or more—and for multimodal signals like images, it takes even longer.

This is the blind spot that cloud computing can’t reach. The report uses an analogy: if you replaced human nerves with wireless signals and the brain with cloud computing, the stability and latency of the entire link would be significantly prolonged by wireless transmission.

Where is the demand? In moving robots

With "low latency" as the core proposition, the real demand direction becomes clear: make "human-like terminals" more human-like.

The analysts divide current human-like terminals into four categories by intelligence level:

First category: Perception hardware like cameras, needing to process more signal streams and finer recognition models
Second category: Tool robots (lawn mowers, pool cleaners, etc.), needing to recognize more scenes—for example, if a lawn-mowing robot can recognize pet droppings, stones, snow, and fallen leaves, it can evolve into an all-season "courtyard robot"
Third category: Smart cars, needing to understand irregular obstacles and extremely complex scenarios
Fourth category: Humanoid robots, needing to understand the physical world in real-time and interact, with inputs spanning vision, hearing, touch, and outputs as complex limb movements

The analysts’ core judgment is: this round of edge-side demand is not just wishful thinking from the capital market but the "closed-loop result of growing customer demand superimposed with industry capability evolution"—the popularization of mowing robots, food delivery robots, and autonomous vehicles makes users begin to raise higher requirements after accepting basic functions.

Model triple jump: From "image reading" to "predicting the future"

Evolution on the demand side cannot be separated from support on the model side. The report clearly maps the evolution path of edge-side visual models.

First generation: YOLO model

Before the era of large models, machine vision relied on the YOLO model under CNN algorithms. The principle is to divide an image into grids, let each grid predict objects within—like "an experienced security guard standing high up, quickly scanning the crowd and immediately drawing a box if there’s likely a 'car' or 'person' in a grid." It’s fast but flawed: difficult to handle irregular shapes and 3D images, and cannot understand logical relations between objects.

Second generation: Vision Transformer (ViT)

With the introduction of large model concepts to vision, ViT opened a new ceiling. It slices images into patches, thinking about the relationship between each patch and the rest of the image, just like reading comprehension. The report’s vivid description: "Seeing a 'cat ear' in the upper left, it instantly relates to the 'cat tail' in the lower right, even if they’re far apart."

ViT is more computationally intensive, which precisely reinforces the logic of upgrading edge computing power—more compute truly translates into greater capability, rather than "having compute power but no real capability gains."

Third Generation: VLM→VLA→World Model

Intelligent driving accelerated this evolution.

VLM (Vision-Language Model): Can understand images and translate them into semantic information, akin to a "co-pilot commentator" turning road conditions into machine-understandable "intelligence"
VLA (Vision-Language-Action Model): Adds an "action" dimension to VLM, directly issuing control commands from visual perception—like "turn steering wheel left 10°", "press accelerator 20%"—achieving eye-to-hand/foot end-to-end control. Nvidia recently open-sourced its VLA model Alpamayo
World model: Goes further, introducing prediction, previewing multiple possible next-second scenarios before acting—"generating future video frames to assess risks, thereby choosing the safest path out of countless 'parallel universes'"

Robot frontier: GEM model

Compared to intelligent driving, enabling robots to understand and interact with the physical world is orders of magnitude harder. The goal of smart cars is "to avoid interaction with the outside world"; robots must physically and verbally interact with the outside world in real-time.

The report believes GEM (Grounding Embedding Model) is a possible path to solve this challenge. Simply put, it can map a robot’s sensory data (camera frames, LiDAR point clouds) and high-level instructions ("hand me the blue cup") into the same feature space, so even if a robot hasn’t seen an object before, it can execute actions through semantic understanding. Google’s RT-2 model is exploring this, tokenizing images, actions, and language for alignment.

The report points out that the main pain points of GEM models lie in alignment of different modality signals, as well as catastrophic forgetting, modality gaps, etc. "It not only needs continual model engineering optimization, but in the future, also requires dedicated compute chip architectures for execution."

Chip competition: NPU hits its ceiling, GPGPU starts to penetrate downward

With model needs confirmed, chips are the final anchor. The report analyzes the strengths and weaknesses of NPU and GPGPU approaches in detail.

NPU: Built on YOLO, now faces architectural bottlenecks

The first wave of NPU’s growth came from YOLO models—security cameras and entry-level autonomous robots widely adopted NPU chips. The Rockchip RK series became mainstream with its price and power advantages, seeing revenue grow from 1.298 billion yuan in 2016 to 4.402 billion in 2025.

But entering the era of large models, NPUs hit architectural constraints: in low-power settings like robot vacuums, if you want to run a ViT-based model instead of YOLO, compute demand will approach 100 TFLOPS. More crucially, NPUs lack CUDA CORES, all commands come from the CPU, and limited power/cost rules out high-performance CPUs—“if too many NPU cores are hung on a weaker CPU for AI tasks, AI instructions can hog all CPU bus traffic and crash the device.”

Currently, there are two paths to break this deadlock:

Qualcomm Snapdragon IQ10: With better CPU and larger NPU cores, integrating some GPU task scheduling structures
Rockchip RK182x: Uses 3D-DRAM + coprocessor dual parallelism, stacking to increase NPU-memory bandwidth, and offloading AI inference from the main chip to relieve bus congestion

GPGPU: Inherited from the cloud, ecosystem advantage magnified

Compared with NPU, GPGPU’s edge-side path is smoother. Cloud GPGPUs are fully functional chips, so bringing them to the edge only means reducing size and core count as needed, without the overhaul that NPU faces.

Nvidia’s intelligent driving business revenue grew from $536 million in fiscal 2021 to $2.349 billion in fiscal 2026, with Orin and Thor product lines covering various price and compute segments.

But GPGPU’s central advantage is not just hardware, but the ecosystem. The report points out that most edge-side models’ pre-training and fine-tuning rely on the CUDA ecosystem. "Using GPGPU’s compute at the edge significantly outpaces NPUs in deployment speed and effect, since the latter requires model translation." At the same time, Nvidia already has mature solutions for low precision inference like FP4, which can be pushed directly to the edge, while NPU is struggling to catch up.

The analysts' conclusion is: bullish on GPGPU’s increasing edge-side penetration rate. But Nvidia’s high prices mean it won’t be the only pick—this leaves room for Qualcomm (with SoC schemes fusing communications + computing) and domestic chip firms (competing on cost-effectiveness for the mass market).

Investment layout: three lines—chips, modules, storage

Analysts divide edge computing investment opportunities into three segments:

Chips: The link with the most value growth. Watch for NPU iteration and GPGPU penetration downward. The report highlights that compute cost share will rise significantly in edge devices, "this logic is similar to cloud infrastructure."

Modules: Called “the middleman who wins no matter rain or shine.” Edge-side clients are highly fragmented; module firms serve as a bridge between upstream chips and millions of downstream users. No matter which chip route wins out, module makers benefit. In the IoT era, Chinese module companies already achieved the "east rises, west falls" pattern globally and shouldn’t be missing out on this round of growth.

Storage: 3D-DRAM is a new direction specially mentioned. Edge chip inference capability is just as limited by memory size and bandwidth; 3D-DRAM stacks DRAM and NPU, boosting bandwidth under low cost and low power.

Risk warning and disclaimerThere are risks in the market and investments should be cautious. This article does not constitute personal investment advice, nor does it take into account the specific investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions herein are suitable for their individual circumstances. Invest accordingly at your own risk. ```