Gemini 3 Flash demos audio transcription, speech, and live music generation
A talk by Thor Schaeff at AI Engineer showcases Gemini's full audio stack — from rich transcription in a single API call to real-time multimodal speech and on-stage music generation via Lyria 3.
Score breakdown
The demo illustrates that Gemini's audio stack now spans transcription, expressive speech synthesis, real-time sound-to-sound interaction, and full-song music generation — all accessible through a unified API with tool-use integration.
- 01A single Gemini 3 Flash Preview API call returns speaker labels by name, timestamps, emotion tags, language detection with English translation, and a full summary.
- 02Speech generation is directed by a 'director's note' rather than selected from a predefined catalogue.
- 03Gemini 3.1 Flash Live is a sound-to-sound real-time multimodal model with thinking baked in, not cascaded through a separate LLM.
Thor Schaeff's talk at AI Engineer walks through Google DeepMind's audio stack in three distinct layers. The foundation is Gemini 3 Flash Preview's audio understanding capability: a single API call returns speaker labels by name, timestamps, emotion tags, language detection with English translation, and a full summary — all at once. Speech generation builds on this by accepting a "director's note" to shape output rather than requiring a developer to pick from a predefined catalogue of voices or styles.
The second major layer is Gemini 3.1 Flash Live, described as a sound-to-sound real-time multimodal model.
The second major layer is Gemini 3.1 Flash Live, described as a sound-to-sound real-time multimodal model. Its key architectural distinction is that thinking is baked directly into the model rather than cascaded through a separate LLM, enabling real-time responsiveness. The talk culminates in a live on-stage demo featuring Lyria 3, Google DeepMind's music generation model capable of producing full songs with lyrics. In the demo, the Gemini Live model calls Lyria 3 via tool use in response to a request, generating a German techno schlager about the UK startup scene in real time.
Key facts
- 01A single Gemini 3 Flash Preview API call returns speaker labels by name, timestamps, emotion tags, language detection with English translation, and a full summary.
- 02Speech generation is directed by a 'director's note' rather than selected from a predefined catalogue.
- 03Gemini 3.1 Flash Live is a sound-to-sound real-time multimodal model with thinking baked in, not cascaded through a separate LLM.
- 04Lyria 3 is Google DeepMind's music generation model capable of producing full songs with lyrics.
- 05A live on-stage demo showed the Gemini Live model calling Lyria 3 via tool use to generate a German techno schlager about the UK startup scene.
Topics
Summary and scoring are generated automatically from the original article. We always link back to the publisher and never republish images or paywalled content. Last processed Jun 10, 2026 · 15:34 UTC. How this works →