Google shipped Gemini Embedding 2 last week — the first natively multimodal embedding model. Text, images, video, audio, PDFs: all mapped into a single vector space. Cross-modal retrieval now works without glue code. You query in text and retrieve a video timestamp. Same index, different media, no transcription step.
Most of the coverage positioned this as a RAG upgrade. Better enterprise search, richer educational tools, more accessible knowledge management. All true.
That’s not what I keep thinking about.
The phrase that won’t leave me: this isn’t new capability. It’s new accessibility.
Intelligence agencies have had versions of this for years. Bespoke tooling, large budgets, controlled infrastructure, tight access. The step-change with Gemini Embedding 2 isn’t what you can do. It’s who can now do it. A city government. A landlord consortium. A mid-size corporation with a compliance program and a grudge. Anyone with a Vertex AI account and existing camera or audio infrastructure. The barrier dropped from nation-state budget to cloud subscription.
A few implications that haven’t gotten much coverage:
Retroactive surveillance. The most underappreciated one. You don’t need to know who you’re looking for at collection time. Collect everything now, index it natively, query years later when you have a reason. The archives of security footage, meeting recordings, and phone calls that already exist just became semantically searchable after the fact. Their value didn’t just persist — it went up.
Cross-modal identity tracking. Index voice, face, and gait across disparate camera networks. Query with any modality — a photo, a voice clip, a text description — and retrieve matches across all of them. No explicit biometric database required. Unified embedding index, nearest-neighbor search. This used to require bespoke database engineering and labeling at scale. Now it’s a query.
The enterprise version. Meeting recordings indexed for knowledge retrieval become queryable for things no one intended to be findable. You agreed to a recording. You didn’t agree to someone running a semantic search across it three years later asking “find all conversations where someone expressed doubt about leadership.”
Regulatory frameworks are nowhere near this. GDPR and CCPA were built around text, explicit data collection, and a consent model that assumes you know what you’re consenting to at the time of collection. Multimodal embeddings at rest don’t fit that model. When you agreed to a meeting being recorded, you didn’t agree to your voice being cross-referenced against facial recognition data from a camera system you didn’t know existed, in a query someone formulates years later.
The education and accessibility uses are real and genuinely exciting. A unified embedding space that lets you serve a Hindi video lecture to an English-speaking student — retrieve the semantically relevant Hindi segment, serve it with dubbing that preserves the original speaker’s voice — that’s the kind of thing that improves lives. I spent most of this week thinking about those applications.
The same embedding space makes everything above work. Same API call. Different actor, different intent.
I don’t have a clean resolution here. The tools ship regardless of whether regulation caught up. The cost drop happened. And the question of who gets to build surveillance infrastructure is now political in a way it wasn’t before — when only nation-states could afford it, oversight happened at the nation-state level. When anyone with a cloud account can do it, that framework breaks.
The people building better educational tools and the people building identity tracking systems are not different categories of people. They are using the same API call.