Grok is xAI’s conversational AI assistant, which is capable of understanding natural language. With the recent release of version 1.5V, Grok has taken an important step towards multimodality by gaining the ability to comprehend visual information. Let’s take a deeper look into the vision capabilities of Grok-1.5V and how it aims to connect the digital and physical worlds.
Table of contents
Grok-1.5 and Grok-1.5V
Grok-1.5 is xAI’s latest conversational AI model that demonstrates significantly improved abilities over the previous Grok-1 release. It has a drastically larger context window size of 128,000 tokens. Additionally, the model showcases high-level reasoning and problem-solving proficiency across various tasks. While Grok-1.5 represents xAI’s state-of-the-art language model, Grok-1.5V builds upon the same foundation with the added ability to understand visual content.
Grok-1.5V Vision Capabilities
Grok was previously only able to understand text, but the latest version introduces fundamental computer vision functionalities. Grok-1.5V can now process diverse visual data like documents, diagrams, charts, screenshots, and photographs. It analyzes visual content to supplement its natural language understanding.
This multimodal capability allows Grok to have richer conversations that involve images. Additionally, it can answer questions about visual data or generate descriptions of images. Grok’s text and vision skills work in tandem to solve problems involving both modalities.
Example Vision Capabilities of Grok-1.5V
1. Writing code from a diagram
2. Mathematical Calculations From an Image
3. Explaining Things From Screenshot
4. Solving a Coding Problem From Screenshot
Grok-1.5V Performance on Different Benchmarks
To evaluate Grok-1.5V’s multidisciplinary reasoning abilities, xAI uses several benchmarks with datasets spanning multiple domains. So, let’s have a look at Grok’s multimodal performance on these notable benchmarks!
1. Multi-domain benchmark
It evaluates a broad conceptual understanding of a diverse set of prompts. Grok achieved a strong score of 53.6% here.
2. Mathvista
It focuses on math word problems and diagrams. Grok solved 52.8% of problems correctly, outpacing other state-of-the-art AI models like OpenAI’s GPT-4V, Antrophic’s Claude 3 Sonnet and Claude 3 Opus, and Google’s Gemini Pro 1.5.
3. AI2D
It features illustrations and charts from scientific papers. Grok’s high 88.3% accuracy shows its proficiency with diagrams. Additionally, it outperformed GPT-4V, Claude 3 Opus, and Gemini Pro 1.5 on this benchmark.
4. TextVQA
It assesses reading comprehension of text. Grok led the pack with a 78.1% score, performing better than GPT-4V and Gemini Pro 1.5 on this task.
5. ChartQA
It evaluates the interpretation of charts. Grok solved 76.1% of questions correctly here.
6. DocVQA
It assesses comprehension of scanned documents. Grok got 85.6% of answers right, displaying sharp document intelligence.
The benchmarks confirm Grok-1.5V’s state-of-the-art multimodal performance. Its balanced mix of language, diagram, chart, and document skills is well-suited to engage in knowledgeable conversations involving various media.
Grok-1.5V Performance on RealWorldQA Benchmark by xAI
While AI models excel at synthetic tasks, grasping our real world remains a challenge. So, to push the frontier, xAI created RealWorldQA – a benchmark focusing on basic physical commonsense.
RealWorldQA contains over 700 images depicting everyday scenes and a question about each image. Sample questions test spatial reasoning abilities like estimating object sizes or identifying cardinal directions.
Surprisingly, Grok-1.5V answered 68.7% of RealWorldQA questions correctly, outdoing peer AIs. Though room for progress remains, this benchmark will help the assessment and development of assistants with stronger real-world problem-solving aptitude. Overall, it aspires to cultivate AI that can truly comprehend our physical environment.
The Road Ahead
Looking forward, both Grok’s understanding and generative abilities across visual and other modalities will be continually strengthened. The goal is to build beneficial general artificial intelligence with comprehensive multi-sensory skills.
xAI also aims to grow RealWorldQA to cover more challenging physical inference types. As these models and benchmarks evolve in tandem, so will our prospects of developing broadly intelligent AI having practical uses. Exciting times ahead!
| Related:
- Grok AI Weights Released, Elon Musk Finally Open Sources Grok
- Grok-2 Is In Training And Will Beat Current AIs
| Also Read Latest From Us
- How to Set Up MCP with Claude AI: Transform Your Development Workflow
- Cohere AI Drops Command A, The AI That’s Smarter, Faster and More Affordable
- Gemini Robotics: How Google’s New AI Models Are Revolutionizing the Physical World
- Spain Cracks Down on AI Deepfakes with Massive Fines for Hidden Tech
- Meta Is Testing Its First In-House AI Training hip To Lessen Reliance On Nvidia