GPT-4o: Revolutionizing Real-Time Human-Computer Interaction

In an exciting leap forward for artificial intelligence, OpenAI is proud to announce GPT-4o, our new flagship model designed to seamlessly integrate and process text, audio, and image inputs and outputs in real time. This groundbreaking advancement brings us closer to truly natural human-computer interaction.

Unveiling GPT-4o

GPT-4o, where the “o” stands for “omni,” is engineered to handle any combination of text, audio, and visual data, offering unprecedented versatility and responsiveness. The model’s ability to respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds, is a significant milestone, closely mimicking human conversational speeds. It matches the performance of GPT-4 Turbo on text in English and coding tasks, while significantly enhancing capabilities in non-English languages, vision, and audio comprehension. Additionally, it is both faster and 50% more cost-effective in the API.

Model Capabilities

GPT-4o’s capabilities are vast and varied, demonstrating a range of sophisticated tasks:

Interaction and Entertainment: Two GPT-4os can engage in interactive singing, preparing for interviews, playing games like Rock Paper Scissors, and even understanding and delivering sarcasm.
Educational Support: It can assist in math problems with educators like Sal and Imran Khan, harmonize in songs, help users learn Spanish through a "point and learn" feature, and provide real-time translations.
Daily Assistance: From crafting lullabies and birthday songs to telling dad jokes and engaging in customer service proofs of concept, GPT-4o showcases its utility in everyday scenarios.

Voice Mode Advancements

Previous iterations of Voice Mode required three separate models, which often led to delays and information loss. GPT-4o integrates all audio, text, and visual inputs and outputs within a single neural network. This integration preserves nuances such as tone, the presence of multiple speakers, and background noises, enabling more natural and expressive interactions, including laughter and singing.

Explorations of Capabilities

Visual Narratives

Imagine a robot writing journal entries that include visual and sound updates, creating rich, immersive narratives.

Model Evaluations

GPT-4o achieves GPT-4 Turbo-level performance in text, reasoning, and coding tasks, with superior performance in multilingual, audio, and vision tasks.

Text Evaluation

Improved Reasoning: The model sets new high scores on MMLU benchmarks, showcasing its enhanced reasoning capabilities.
Audio ASR and Translation Performance: GPT-4o outperforms Whisper-v3 in both speech recognition and translation tasks.
Vision Understanding: It achieves state-of-the-art performance on visual perception benchmarks.

Language Tokenization

Significant token reduction across various languages improves the model's efficiency and performance, making interactions smoother and faster.

Model Safety and Limitations

GPT-4o incorporates safety measures across all modalities, employing filtering, post-training refinement, and external red teaming. It adheres to OpenAI's Preparedness Framework, ensuring no high-risk scores in cybersecurity, CBRN (chemical, biological, radiological, and nuclear), persuasion, or model autonomy evaluations. Initially, the model supports text and image inputs with text outputs, with audio outputs limited to preset voices to maintain control and safety.

Model Availability

GPT-4o is now available, offering its advanced capabilities and improved efficiency to users. Text and image features are being rolled out in ChatGPT, with enhanced message limits for Plus users. Voice Mode with GPT-4o will soon enter alpha testing for ChatGPT Plus subscribers. Developers can also access GPT-4o via the API, benefiting from its speed, cost efficiency, and higher rate limits.

Explore the transformative potential of GPT-4o and stay tuned for further updates on its audio and video capabilities. This is just the beginning of a new era in real-time human-computer interaction.