Digital Product Studio

OpenAI Drops GPT-4o: An Omnimodel with Integrated Vision and Audio Abilities, And It’s Free via ChatGPT

At its Spring Updates event today, leading AI safety organization OpenAI announced the release of its most advanced language model to date – GPT-4o. This new state-of-the-art model improves upon previous iterations with enhanced multi-modal capabilities across text, images and speech. A key highlight of the announcement was OpenAI’s decision to make GPT-4o freely available through ChatGPT, marking the first time the company has provided its frontier model for free to all users. Let’s get started.

OpenAI Drops GPT-4o

OpenAI’s newest generative model, called GPT-4o, represents a major step forward in building fully immersive and human-level dialogue. What sets GPT-4o apart is its “omnimodal” architecture – it can accept and generate any combination of text, audio, images and video within a single neural network. Previous models required multiple separate models to handle different modalities. However, GPT-4o processes all inputs and outputs by the same underlying AI system. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, similar to human response time(opens in a new window) in a conversation. 

Multi-Modal Reasoning Abilities of GPT-4o

GPT-4o represents a significant leap over previous models with its ability to reason across different modalities. GPT-4o can understand speech, analyze images and videos, and utilize this multi-modal knowledge to respond intelligently regardless of input type.

Some examples of its multi-modal abilities include:

1. Natural Conversational Ability Incorporating Speech

Users can chat with GPT-4o through voice, and it will respond verbally, adjusting voice attributes like pitch, speed and emotion.

2. Image recognition and description

The model can examine visual content, describe objects and extract relevant text from photographs or videos.

3. Cross-modality understanding

GPT-4o comprehends how different media types relate, allowing it to discuss topics integrating audio, visual and text-based knowledge.

Upgraded ChatGPT Experience With GPT-4o

OpenAI plans to iteratively roll out GPT-4o across its products, starting with enhanced capabilities for the popular ChatGPT conversational assistant. Users can now access ChatGPT on desktop through the new cross-platform app in addition to mobile and web versions.

Some ChatGPT upgrades include:

1. Fluid Multimodal Conversations

It seamlessly transitions between text, images and voice as input/output without breaking the dialogue flow.

2. Customized Emotional Responses

The AI assistant conveys emotions like happiness, sadness or concern through tone of voice depending on context.

3. Full Understanding Through Visual Context

It provides responses drawing on visually-derived facts rather than just text.

Demonstrations of GPT-4o Capabilities

OpenAI showcased GPT-4o’s abilities through a variety of demonstrations on their website. These included real-time translation between languages, harmonizing songs, assisting customers and translating Spanish phrases while pointing to images.

Impressively, GPT-4o was also able to hold a fluent sign language conversation with a deaf user via webcam. By taking in visual inputs, it could understand gestures directly without relying on a transcription intermediary.

GPT-4o even showed basic creativity – in one example, it collaborated with a human to develop an original bedtime story together through language, images and audio. Its omnimodality clearly unlocks new AI applications.

GPT-4o Performance Testing

OpenAI conducted extensive testing to evaluate its capabilities across modalities compared to prior models. Here are some of the key results:

1. Text Performance

On standard text and reasoning benchmarks, it matched the GPT-4 Turbo. This included general language understanding and answering open-domain questions.

2. Speech Recognition

Additionally, it dramatically outperformed prior models like Whisper on audio speech recognition (ASR) across languages. It showed especially strong gains for under-resourced languages with limited training data.

3. Speech Translation

New state-of-the-art results were set on the MLS speech translation benchmark, with GPT-4o surpassing Whisper in translating spoken languages.

4. Multilingual Testing

The M3Exam evaluation – containing multilingual, multiple choice questions with diagrams – saw it outperform GPT-4 in all assessed languages, indicating stronger multilingual skills.

5. Visual Perception

Moreover, it achieved top performance on standard image understanding benchmarks, showcasing its ability to derive meaning from visual content.

Through rigorous quantified and qualitative evaluations, OpenAI confirmed that this model not only matches but exceeds prior models in key modalities.

GPT-4o Performance on LMSys

A version of GPT-4o was tested on the LMSys conversational AI platform under the name “im-also-a-good-gpt2-chatbot”.

Its performance was evaluated using the ELO rating system against other models. Some key points:

  • It achieved an ELO rating of over 1300, significantly higher than other agents.
  • The chart shows its rating surpassed models like GPT-4-1106-preview and gemini-1.5-pro-api among others
  • ELO ratings can plateau on easy prompts, but it exceeded on harder prompts
  • Specifically, when tested on coding challenges, it demonstrated a +100 ELO improvement over the previous best mode.

This quantitative user testing on LMSys provides compelling evidence of its advanced conversational abilities and problem-solving skills compared to rival systems. 

Upgraded Voice Mode

One of the most exciting developments with GPT-4o is the impact it will have on Voice Mode – ChatGPT’s voice conversation capability.

Previously, with GPT-3.5 and GPT-4, Voice Mode suffered from slow response times and missing context due to its multi-step process. Audio inputs were transcribed to text for the models to understand, and responses needed transcription back to audio. Plus, output was limited to standard text responses.

However, GPT-4o’s unified handling of text, audio and vision will revolutionize Voice Mode. By training end-to-end across modalities, the full conversation context remains intact. Response times have improved significantly as well.

Most exciting of all, GPT-4o can now understand audio inputs and generate laughed, songs, sounds and more in its replies. New applications will emerge as Voice Mode gains a virtually seamless audio-text-visual exchange.

How to Access OpenAI GPT-4o

GPT-4o is being made widely available in two ways:

1. Through ChatGPT

All existing and new ChatGPT users can now access its omnichannel functionalities for free with enhanced features and fewer rate limits. Plus users can have up to 5x higher message limits. OpenAI wil roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.

2. OpenAI API

Developers have expanded access to GPT-4o’s abilities via the API to build customized applications combining different mediums. Costs are significantly lower to encourage innovation.

Safety as a Top Priority

Of course, with greater capabilities comes greater responsibility. OpenAI has focused intensely on ensuring GPT-4o is aligned with human values and avoids potential downsides. They applied their Preparedness Framework to evaluate risks like cybersecurity throughout development.

Extensive self-supervised learning and human evaluations helped refine its behavior. For example, audio outputs are currently limited to prevent unwanted sounds. And independent reviews involving over 70 experts identified additional risks to mitigate.

OpenAI also open-sourced their safety research to invite collaboration. Continued oversight will be vital as new versions explore GPT-4o’s full potential while safeguarding its role as an AI helper, not hindrance.

Final Verdict

Overall, GPT-4o shows great promise for enhancing how people and AI can work together. Its seamless integration of text, audio and visual abilities could make conversation with intelligent systems as natural as talking with another person. By bringing such a powerful tool to all users for free, OpenAI aims to spread these benefits as widely as possible. If responsibly guided, OpenAI new omnimodel may come to serve countless valuable functions.

| Also Read Latest From Us

SUBSCRIBE TO OUR NEWSLETTER

Stay updated with the latest news and exclusive offers!


* indicates required
Picture of Faizan Ali Naqvi
Faizan Ali Naqvi

Research is my hobby and I love to learn new skills. I make sure that every piece of content that you read on this blog is easy to understand and fact checked!

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

Don't Miss Out on AI Breakthroughs!

Advanced futuristic humanoid robot

*No spam, no sharing, no selling. Just AI updates.