Google Gemma 3 Models Now Support Images and Text for Developers in May 2025

Google's new Gemma 3 models, released in May 2025, can now understand both text and images, a big step up from older versions.

Google's Gemma series of open-weight large language models (LLMs) has presented a multifaceted landscape for developers and researchers. Available in various sizes, from 270 million to 27 billion parameters, these models are engineered for diverse applications, including multimodal understanding and enhanced multilingual capabilities. Initial releases, such as Gemma 1, debuted in February 2024, followed by updates like Gemma 2 in June 2024 and Gemma 3 in May 2025, showcasing iterative improvements and expanded functionalities.

The core appeal of Gemma lies in its 'open-weight' nature, purportedly allowing local execution on personal devices like phones, tablets, and laptops, thereby democratizing access to advanced AI capabilities. This approach facilitates on-device embeddings and hyper-efficient AI tasks, catering to low-latency audio and visual understanding, as well as private agentic workflows. Developers can integrate Gemma models into applications, utilizing frameworks like Hugging Face and tools such as Ollama for local deployment and interaction.

Read More: Data Scientists Focus on Practical Use, Not Just Complex Models

Live Updates: Supreme Court considers Trump's order on birthright citizenship | PBS News - 1

Architectural Underpinnings and Performance Metrics

Gemma's architecture draws from the technology powering Google's Gemini models. For instance, the Gemma 7B model employs multi-head attention (MHA), while the Gemma 2B model utilizes multi-query attention (MQA). Technical reports detail performance metrics and model capabilities, with newer iterations like Gemma 3 emphasizing more efficient attention mechanisms. Gemma 3, specifically, supports multimodal inputs, enabling the processing of both text and images through a dedicated vision encoder, available in parameter counts of 1B, 4B, 12B, and 27B. This multimodal functionality is particularly evident in variants like gemma-3-4b-it, gemma-3-12b-it, and gemma-3-27b-it.

Practical Implementations and Developer Ecosystem

The Gemma ecosystem is supported by official documentation, quickstarts, and developer forums, including a dedicated channel on Google Developers Discord. Developers have leveraged Gemma for specific use cases, such as building offline AI microservers for educational institutions with Lentera or improving Swahili language understanding with Crane AI Labs. Marine biologists and AI engineers have also partnered to develop specialized models, exemplified by the creation of 'DolphinGemma'.

Read More: Groq LPU Uses SRAM for Faster AI Inference, Nvidia Responds

Live Updates: Supreme Court considers Trump's order on birthright citizenship | PBS News - 2

Quantization and Efficiency Measures

Efforts to enhance efficiency and reduce computational load are evident in Gemma's development. Techniques such as quantization, including Quantization-Aware Training (QAT) for Gemma 3, are employed to create smaller, more manageable models. For example, models can be loaded with 4-bit quantization to optimize memory and computation for tasks like running the Gemma 7B Italian model. Furthermore, the embedding layers are shared across inputs and outputs to compress the model, and Gemma 2 is noted to utilize deeper neural networks compared to its predecessor.

Specialized Variants and Broader Applications

Beyond general-purpose LLMs, Google has introduced specialized Gemma models. This includes MedGemma 1.5 4B, designed for high-dimensional medical imaging interpretation, and PaliGemma, which offers capabilities in areas like CBRN (Chemical, Biological, Radiological, and Nuclear) domains, though its knowledge in these specific fields is noted as low. Other variants like CodeGemma focus on code completion and generation across various programming languages. The Gemma framework also supports encoder-decoder architectures for enhanced contextual comprehension and integrates retrieval techniques to ground responses in real-world data.

Read More: Iran War: AI Targeting Raises Civilian Casualty Concerns in 2024

Development History and Iterations

The Gemma model family began with the first generation, introduced on February 21, 2024. Gemma 2 followed on June 27, 2024, focusing on improvements in practical model sizes. The most recent significant update, Gemma 3, was announced around May 2025, bringing substantial upgrades including multimodal capabilities and a wider range of parameter sizes. Each generation aims to refine the architecture, increase efficiency, and broaden the applicability of these open-weight models.

Frequently Asked Questions

Q: What new features do Google's Gemma 3 models have since their May 2025 release?
Google's Gemma 3 models, released around May 2025, now support multimodal inputs. This means they can understand and process both text and images together.
Q: How can developers use the new image and text features in Google Gemma 3 models?
Developers can use Gemma 3's multimodal abilities for tasks that need both visual and text information. This includes applications that analyze images and provide text-based answers or descriptions.
Q: What sizes are available for the new Google Gemma 3 models released in May 2025?
The Gemma 3 models released in May 2025 come in several sizes, including 1 billion, 4 billion, 12 billion, and 27 billion parameters, offering flexibility for different needs.
Q: Are Google's Gemma models like Gemma 3 available for use on personal devices?
Yes, Google's Gemma models are designed to be open-weight, meaning they can often be run on personal devices like phones, tablets, and laptops for local AI tasks.
Q: What kind of performance improvements does Gemma 3 offer compared to older Gemma models?
Gemma 3 offers improved attention mechanisms and multimodal capabilities, allowing it to process images and text more efficiently than previous versions like Gemma 1 and Gemma 2.