From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind
Jun 9, 2026 · 19:34
Thor Schaeff from Google DeepMind presents the Gemini audio stack—Gemini 3 Flash Preview for deep audio understanding, Gemini 3.1 Flash Live for real-time sound-to-sound multimodal interaction, and Lyria 3 for music generation. He shows how a single API call extracts speaker labels, timestamps, emotions, language detection, and translation, and how speech generation uses a 'director's note' to modify a base voice's accent and tone. The talk culminates in a live demo where the Gemini Live model uses Lyria via tool calls to generate a German techno schlager about the UK startup scene.