New ChatGPT Capabilities: see, hear, and speak

OpenAI has introduced new voice and image functionalities to ChatGPT. These updates provide users with a more interactive platform by enabling voice interactions and allowing users to share visuals with ChatGPT. The incorporation of voice and image expands the usability of ChatGPT.

For instance, while traveling, users can click a photo of a landmark and engage in a real-time discussion about its significance. Similarly, at home, users can take photos of their kitchen items to brainstorm dinner ideas or seek guidance in cooking. And, for academic support, users can snap a math problem and receive assistance.

Over the upcoming two weeks, ChatGPT will extend these voice and image features to its Plus and Enterprise subscribers. The voice function will be accessible on both iOS and Android devices through the settings, while the image feature will be available universally.

Engaging in Voice

Conversations with ChatGPT Users can now verbally communicate with their digital assistant. This offers convenience while on the move, the joy of a narrated bedtime story, or resolving discussions at dinner. To activate voice, users should navigate to Settings → New Features in the mobile application and select voice chat. A click on the headphone icon, found at the top right of the main screen, will allow users to pick from five distinct voices.

This voice feature hinges on a state-of-the-art text-to-speech model that transforms text into lifelike audio in moments, thanks to the collaboration with professional voice talent. The transcription of user speech to text employs Whisper, OpenAI’s proprietary speech recognition software.

Discussing Images with ChatGPT

Users can share pictures with ChatGPT. This can be helpful in situations like diagnosing grill issues, planning meals, or interpreting intricate work-related graphics. A drawing utility in the mobile application lets users highlight certain parts of their shared images.

To share an image, users can press the camera icon. Those on iOS or Android should click the ‘+’ symbol first. ChatGPT can engage with multiple images, and users can employ the drawing feature for better guidance. This image recognition relies on the capabilities of multimodal GPT-3.5 and GPT-4, which utilize their linguistic proficiency to interpret a broad spectrum of visuals.

Gradual Deployment of New Features

OpenAI aspires to create AGI that stands for both safety and utility. A phased release approach ensures continuous enhancement and risk mitigation. This becomes crucial when implementing sophisticated voice and vision models.

Voice chat, which has been developed in partnership with professional voice actors, has a plethora of applications. Yet, it also poses potential misuse risks. Thus, it’s restricted to specific contexts. Collaborations, such as with Spotify for their Voice Translation feature, underline the potential of this technology.

The Vision Challenge Image-based models carry their unique set of challenges, from misconceptions to high-stake interpretations. OpenAI has conducted exhaustive tests and accumulated feedback to define responsible use parameters.

The aim of vision, akin to other ChatGPT features, is to support users in their daily lives, and it’s most effective when it has a clear view of user contexts. This insight stems from OpenAI’s collaboration with Be My Eyes, an application assisting the visually impaired. However, to ensure privacy and accuracy, there are stringent controls on ChatGPT’s ability to evaluate and comment on humans.

Feedback will be crucial in refining these safety measures.

Model Limitations

It’s important for users to understand that while ChatGPT excels in certain domains, it has its limitations. For instance, while adept at transcribing English, it falters with some non-English languages. Users are advised to exercise caution in such scenarios.

Additional details on safety approaches and collaboration with Be My Eyes are accessible in the image input system card.

Expansion Plans

Plus and Enterprise subscribers can anticipate access to these voice and image tools in the next fortnight. OpenAI is eager to offer these to a wider user base, encompassing developers, in the subsequent phase.

The next leap could be adding text-to-video generation tools (such as Midjourney and Pika Labs provide) directly into ChatGPT UI interface.