Google has officially launched Gemma 3n, its latest open multimodal AI model engineered specifically for mobile and edge devices. This release marks a significant milestone in bringing advanced AI capabilities directly to consumer hardware without requiring cloud processing.
Gemma 3n comes in two sizes based on effective parameters: E2B and E4B. While their raw parameter counts are 5B and 8B respectively, architectural innovations allow them to run with memory footprints comparable to traditional 2B and 4B models, operating with as little as 2GB (E2B) and 3GB (E4B) of memory. This efficiency is achieved through several technical innovations, including the MatFormer architecture and Per-Layer Embeddings.
The model is truly multimodal by design, natively supporting image, audio, video, and text inputs while generating text outputs. Its expanded audio capabilities enable high-quality automatic speech recognition (transcription) and translation from speech to text. Additionally, the model accepts interleaved inputs across modalities, enabling understanding of complex multimodal interactions.
For visual processing, Gemma 3n features a highly efficient vision encoder, MobileNet-V5-300M, delivering state-of-the-art performance for multimodal tasks on edge devices. This encoder natively supports multiple input resolutions (256x256, 512x512, and 768x768 pixels), excels at a wide range of image and video comprehension tasks, and can process up to 60 frames per second on a Google Pixel.
The E4B version achieves an LMArena score over 1300, making it the first model under 10 billion parameters to reach this benchmark. Gemma 3n delivers quality improvements across multilinguality, supporting 140 languages for text and multimodal understanding of 35 languages, as well as enhanced math, coding, and reasoning capabilities.
Privacy is a key feature, as local execution enables features that respect user privacy and function reliably even without an internet connection. The model was created in close collaboration with mobile hardware leaders like Qualcomm Technologies, MediaTek, and Samsung's System LSI business, and is optimized for lightning-fast, multimodal AI, enabling truly personal and private experiences directly on devices.
The full release follows a preview at Google I/O in May 2025, with the model now available through popular frameworks including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, and MLX. This comprehensive launch empowers developers to build a new generation of intelligent, on-device applications that can understand and respond to the world around them.