Google releases its first native multimodal embedding model, Gemini Embedding 2

Google releases its first native multimodal embedding model, Gemini Embedding 2

```

On March 10, Google DeepMind launched Gemini Embedding 2, the company's first native multimodal embedding model, which maps text, images, video, audio, and documents into a unified embedding space, marking a new phase in AI embedding technology with full modality integration.

Gemini Embedding 2 supports semantic understanding in over 100 languages and surpasses mainstream models in benchmark tests for text, image, and video tasks, while also introducing speech processing capabilities previously lacking in embedding models.

The model is now available for public preview via the Gemini API and Vertex AI, allowing developers to connect immediately.

For enterprise users, the release of this model directly lowers the technical barrier for building multimodal Retrieval-Augmented Generation (RAG), semantic search, and data classification systems, and is expected to simplify the complex data pipelines that previously required separate cross-modal processing.

Unified Full Modality: Expanding from Text to Five Media Types

Gemini Embedding 2 is built on the Gemini architecture, extending embedding capabilities from pure text to five input types:

Text – supports up to 8192 input tokens;Images – processes up to 6 per request, supporting PNG and JPEG formats;Video – supports MP4 and MOV files up to 120 seconds;Audio – can directly ingest and generate embedding vectors without intermediate transcription to text;Documents – supports embedding of up to 6 pages of PDF files directly.

Unlike traditional methods that handle each modality individually, this model supports interleaved inputs, allowing multiple modality combinations such as images and text in a single request, enabling the model to capture complex and nuanced semantic relationships between different media types.

Gemini Embedding 2 continues to use Matryoshka Representation Learning (MRL) technology featured in previous Google embedding models. This technology dynamically compresses vector dimensions via "nesting," allowing the output size to be flexibly reduced from the default 3072, helping developers balance model performance with storage costs.

Benchmark Leading, Speech Capabilities are a New Highlight

Google stated that Gemini Embedding 2 outperforms mainstream competitors in text, image, and video benchmark tests, positioning it as a new performance benchmark in the multimodal embedding field.

Google recommends developers choose among 3072, 1536, or 768 dimensions based on their application scenarios to achieve optimal embedding quality. This design is particularly important for enterprises needing large-scale embedding vector deployments, enabling effective control of infrastructure costs without significantly sacrificing accuracy.

In terms of capability coverage, the model introduces native speech embedding, which was commonly absent in previous models, enabling direct audio processing without the need for speech-to-text conversion as an intermediate step.

Google points out that embedding technology has been widely used in various products, covering RAG context engineering, large-scale data management, as well as traditional search and analytical scenarios.

Early access partners have already begun building multimodal applications based on Gemini Embedding 2, and Google says these use cases are demonstrating the model's real potential in high-value scenarios.

Risk Warning and DisclaimerThe market carries risks; invest with caution. This article does not constitute personal investment advice and does not take into account the special investment objectives, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article fit their specific circumstances. Investment based on this is at your own risk. ```