menu
close

Google Unveils Gemma 3n: Powerful Multimodal AI for Mobile Devices

Google has released Gemma 3n, a groundbreaking multimodal AI model designed to run efficiently on consumer devices with as little as 2GB of memory. The model can process audio, text, images, and video inputs while operating locally on phones, tablets, and laptops. This mobile-first architecture, developed in collaboration with hardware manufacturers like Qualcomm, MediaTek, and Samsung, represents a significant advancement in making powerful AI accessible without cloud connectivity.
Google Unveils Gemma 3n: Powerful Multimodal AI for Mobile Devices

Google has officially launched Gemma 3n, its latest open multimodal AI model engineered specifically for mobile and edge devices. This release marks a significant milestone in bringing advanced AI capabilities directly to consumer hardware without requiring cloud processing.

Gemma 3n comes in two sizes based on effective parameters: E2B and E4B. While their raw parameter counts are 5B and 8B respectively, architectural innovations allow them to run with memory footprints comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory. This efficiency is achieved through several technical innovations, including the MatFormer architecture and Per-Layer Embeddings.

The model is truly multimodal by design, natively supporting image, audio, video, and text inputs while generating text outputs. Its expanded audio capabilities enable high-quality automatic speech recognition (transcription) and translation from speech to text. Additionally, the model accepts interleaved inputs across modalities, enabling understanding of complex multimodal interactions.

For visual processing, Gemma 3n features a highly efficient vision encoder, MobileNet-V5-300M, delivering state-of-the-art performance for multimodal tasks on edge devices. This encoder natively supports multiple input resolutions (256x256, 512x512, and 768x768 pixels), excels at a wide range of image and video comprehension tasks, and can process up to 60 frames per second on a Google Pixel.

The E4B version achieves an LMArena score over 1300, making it the first model under 10 billion parameters to reach this benchmark. Gemma 3n delivers quality improvements across multilinguality, supporting 140 languages for text and multimodal understanding of 35 languages, as well as enhanced math, coding, and reasoning capabilities.

Privacy is a key feature, as local execution enables features that respect user privacy and function reliably even without an internet connection. The model was created in close collaboration with mobile hardware leaders like Qualcomm Technologies, MediaTek, and Samsung's System LSI business, and is optimized for lightning-fast, multimodal AI, enabling truly personal and private experiences directly on devices.

The full release follows a preview at Google I/O in May 2025, with the model now available through popular frameworks including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, and MLX. This comprehensive launch empowers developers to build a new generation of intelligent, on-device applications that can understand and respond to the world around them.

Source:

Latest News