Ever since its launch, the Gemma family of models from Google has received success from developers worldwide. Millions of downloads within months of release showcase the potential of these lightweight and capable AI models. Witnessing this creativity has fueled Google to keep upgrading the Gemma models for better performance and wider usability. In this article, we will discuss the latest additions and upgrades to the Gemma family announced by Google – Gemma 2 and PaliGemma.
Gemma 2: The Next Generation Model
Gemma 2 is the next big thing in the Gemma lineup. It is a powerhouse, boasting 27 billion parameters. What sets Gemma 2 apart is its remarkable efficiency. Despite its size, Gemma 2 requires fewer computational resources, reducing deployment costs significantly. Still under development, Gemma 2 promises breakthrough performance with a new optimized architecture.
Key Features of Gemma 2
Some key features of Gemma 2 include:
1. Increased Size and Performance
At 27 billion parameters, Gemma 2 can match the capabilities of models twice its size. This makes it comparable to 70B Llama while using half the compute power.
2. Lower Deployment Costs
Gemma 2’s efficient design allows it to run on less powerful hardware. It can operate on a single TPU host or NVIDIA GPUs, making deployment more affordable for all.
3. Versatile Tuning
Gemma 2 will support extensive tuning capabilities on various platforms like Google Cloud, HuggingFace, TensorRT, etc. Developers can also optimize it to meet their specific application needs.
4. Benchmark Performance
As per the benchmarks, Google Gemma 2 (27b) almost matches Meta Llama 3 (70B) on tasks like MMLU, HellaSwag and GSM8K and outperforms xAI’s Grok-1 (314 B) on MMLU and GSM8K despite its smaller size as shown in the chart below:
Coming Soon
The official launch is slated for the coming weeks. So, stay tuned for more updates on Gemma 2’s capabilities and availability.
PaliGemma: A Powerful Vision-Language Model
Along with Gemma 2, Google is releasing PaliGemma, an advanced open-source Vision-Language Model (VLM) inspired by PaLI-3. PaliGemma is engineered to excel in a wide range of vision-language tasks. From image and short video captioning to visual question answering, understanding the text in images, object detection, and object segmentation, PaliGemma does it all.
Key Features of PaliGemma
1. Image Captioning
PaliGemma can generate natural language descriptions for images when prompted with captions. This enables generation of summaries for images.
2. Visual Question Answering
It has the ability to comprehend images and answer free-form questions related to objects, scenes and details in an image. This expands its understanding capabilities.
3. Object Detection
The model is trained to detect and localize objects present in an image when prompted. It outputs bounding box coordinates for detected objects.
4. Referring Expression Segmentation
Going beyond detection, PaliGemma can also segment out specific objects referred to in an image using natural language phrases. This fine-grained understanding is useful for applications involving precise segmentation.
5. Document Understanding
Combining its image and text understanding, PaliGemma exhibits strong reasoning and comprehension skills over multi-modal inputs containing both visuals and textual content. This makes it suitable for tasks involving joint vision-language inferences.
How PaliGemma Works?
PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
PaliGemma Models
The team at Google has released three types of models: the pretrained (pt) models, the mix models, and the fine-tuned (ft) models, each with different resolutions and available in multiple precisions for convenience. All models are released in the Hugging Face Hub model repositories with their model cards and licenses and have transformers integration.
1. Pre-trained Models
Image-Text-to-Text Pretrained models in transformers format:
Image-Text-to-Text Pre-trained models for use with the big_vision repo
2. Fine-tuned models
Below are the fine-tuned models of PeliGemma
- paligemma-3b-ft-vqav2-448 (Diagram Understanding)
- paligemma-3b-ft-cococap-448 (COCO Captions)
- paligemma-3b-ft-science-qa-448 (Science Question Answering)
- paligemma-3b-ft-refcoco-seg-896 (Understanding References to Specific Objects in Images)
- Paligemma-3b-ft-rsvqa-hr-224 (Remote Sensing Visual Question Answering)
3. Mix models
Below are Image-Text-to-Text Mix Models of PaliGemma
- google/paligemma-3b-mix-224
- google/paligemma-3b-mix-448
- google/paligemma-3b-mix-224-jax
- google/paligemma-3b-mix-448-jax
How to Use PaliGemma
You can find PaliGemma on GitHub, Hugging Face models, Kaggle, Vertex AI Model Garden, and ai.nvidia.com with easy integration through JAX and Hugging Face Transformers. Keras integration is also coming soon.
HuggingFace Demo
Additionally, an interactive demo is available on HuggingFace Space that allows exploring PaliGemma’s image-to-text abilities.
In Conclusion
With Gemma 2 and PaliGemma, Google is raising the bar of capabilities for open-source AI models. PaliGemma can understand images and process language, while Gemma 2 boosts size and efficiency to enable more use cases. Both offer expanded capabilities for developers and researchers to build diverse multimodal applications.
| Also Read Latest From Us
- Meet Codeflash: The First AI Tool to Verify Python Optimization Correctness
- Affordable Antivenom? AI Designed Proteins Offer Hope Against Snakebites in Developing Regions
- From $100k and 30 Hospitals to AI: How One Person Took on Diagnosing Disease With Open Source AI
- Pika’s “Pikadditions” Lets You Add Anything to Your Videos (and It’s Seriously Fun!)
- AI Chatbot Gives Suicide Instructions To User But This Company Refuses to Censor It