New Gemma 4 AI models understand images and audio

The new Gemma 4 AI models can now process images, audio, and video, making them much smarter. This is a big step up from older AI models.

New Gemma 4 models demonstrate expanded multimodal understanding and enhanced efficiency, pushing boundaries for AI development across personal devices and cloud platforms.

Google DeepMind has rolled out its latest suite of Gemma 4 models, marking a significant progression in open-weight large language model technology. The released models, including Gemma 4 31B IT Thinking, Gemma 4 26B A4B IT Thinking, Gemma 4 E4B IT Thinking, and Gemma 4 E2B IT Thinking, showcase enhanced capabilities in processing not only text but also images, video, and audio.

Expanded Multimodal Horizons

The Gemma 4 family distinguishes itself through its expanded multimodal functionalities. Across all model sizes, the systems exhibit support for variable aspect ratio and resolution images. Specifically, the E2B and E4B variants offer native audio and video processing, broadening their application scope considerably. This advancement positions Gemma 4 for more complex tasks, including rich audio-visual understanding and agentic tool use, as demonstrated by its performance on benchmarks like τ2-bench for agentic tool use.

Read More: Google Maps Needs Location Access To Show Your Map

Gemma Collins shows off her 3.5st weight loss as she wows in a pink swimsuit while preaching about body confidence - 1

Performance and Safety Gains

"Gemma 4 models significantly outperform Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low."

Evaluations reveal substantial improvements in safety metrics compared to previous Gemma iterations. The models produced minimal policy violations in text-to-text and image-to-text tasks. This focus on safety, coupled with maintained or improved performance on diverse datasets such as MMMLU (Multilingual Q&A) and AIME 2026 Mathematics, underscores a commitment to responsible AI development.

Accessibility and Deployment

The Gemma 4 models are designed for flexible deployment. A JAX library, available on GitHub, facilitates using and fine-tuning the models on personal hardware, including CPUs, GPUs, and TPUs. Furthermore, Gemma 4 is now accessible on Google Cloud, integrated with services like Vertex AI, Google Kubernetes Engine (GKE), and Google Compute Engine (GCE), offering developers robust options for scaling their AI applications. The integration with Google ADK (AI Development Kit) also enables the creation of fully functional AI agents.

Development and Training

The underlying strength of Gemma 4 stems from the quality and diversity of its training data. While specific details on the training dataset remain largely undisclosed, the model card highlights its extensibility for building autonomous agents capable of planning, navigating applications, and completing tasks via native function calling support.

Read More: DeepSeek V4-Pro AI Model Price Cut 75% in China

Frequently Asked Questions

Q: What are the new Gemma 4 AI models?
Google DeepMind has released new Gemma 4 AI models that are better at understanding different types of information. They can now process text, images, audio, and video.
Q: How do the Gemma 4 models work with images, audio, and video?
The Gemma 4 models can understand images with different sizes and clarity. Some versions can also directly process audio and video, which helps them do more complex tasks.
Q: Are the new Gemma 4 models safer than older ones?
Yes, the Gemma 4 models are designed to be safer and make fewer mistakes than previous Gemma models. They had fewer policy violations in tests.
Q: How can developers use the new Gemma 4 models?
Developers can use the Gemma 4 models on their own computers or through Google Cloud services. They can also use tools like the JAX library or Google ADK to build AI agents.