From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind — AI Engineer

Intro0:00

Thor Schaeff0:15

All right. Uh, what's new in AI audio? Uh, I'm sorry, it's a little bit misleading because the title leaves out the @googledeepmind. Um, so we're just kinda looking at, you know, what we've been working on at DeepMind. If we were to look at everything in AI audio, we'd be spending a lot of time here. But, um, yeah, I'd love to show you kinda what we're, what we're working on at DeepMind.

Uh, yeah, this is me. Hi, everyone. I'm Thor. Uh, I work on the developer experience at Google DeepMind, working on the Gemini API and Google AI Studio. Uh, hello zusammen. Herzlich willkommen. My name is Thorsten. Bonjour. Je m'appelle Thor. Je suis très désolé, mon français c'est très mauvais. Uh, こんにちわ。おれば。らいじんが。 Um, 大家好，我是住在美国的土哥人。不好意思，我的中文还不好。 Okay.

Uh, that was for the demo, and now I just need to make sure... Last time I did this demo, I recorded over it, and then it was all gone. That was very sad. But we'll, we'll come back to that in a bit. Yeah. What have we been up to at DeepMind? Um, there's been a couple releases. Uh, I, I actually joined the team in November, uh, literally the day before Gemini 3 was released.

So I joined, and they told me, "Tomorrow, we're releasing Gemini 3." And I was like, "Yay." Didn't do anything, but it was great. Um, it was, it was a great time. Uh, most recently on the open model side, we released Gemma 4, uh, I think literally last week. And, um, yeah, pretty incredible. Some cool stuff you can do there.

Recent Work1:39

Thor Schaeff1:59

Multimodality as well, uh, baked into Gemma 4. So, uh, there's audio understanding in, uh, the, the Gemma 4 models, and you can do that, uh, on device, on kinda edge devices as well. So, uh, that is some, some very exciting stuff. In terms of, uh, Gen Media and audio, uh, you're probably very familiar with our, um, you know, image generation models, video generation models.

Obviously, Veo has audio generation in there as well. Um, so this is kind of where the progression, uh, is there most recently with, uh, Veo 3.1 Lite on the Gen Media model side. And then on the audio models, um, we recently launched Gemini 3.1 Flash Live, which is our kind of full duplex, um, you know, sound to sound, uh, real-time conversational model.

Also multimodal, so you can ingest, uh, real-time text, voice, vision, um, which we'll, we'll look at in a, in a bit. So, you know, on kind of audio, very, very broad topic, but sort of the, the baseline of everything we do are kind of the, you know, frontier Gemini models. And so Gemini 3 is, um, incredibly good at understanding audio and, you know, that's not just transcribing it, uh, but really, you know, understanding kinda all the nuances that are in there.

Audio Understanding2:51

Thor Schaeff3:23

So that might be, you know, obviously speech, but also the context of the speech, the, the emotion, um, you know, your pacing, your, um, sort of, yeah, a- anything that sort of swings within the audio that's, you know, not just text. Um, so on the audio understanding, kinda our goal is to build models that deeply comprehend, uh, richly transcribe, and robustly reason through audio, uh, seamlessly handling, you know, a large mix of different languages, dialect, accents, and modalities, and sort of anywhere and always.

EchoScript3:58

Thor Schaeff3:58

You know, Gemini is really good as on, you know, transcribing even people that are talking over each other, um, which is, which is pretty, pretty incredible. You know, seamlessly switching between different languages. That was sort of the demo we're looking at, um, now. So, um, EchoScript is kind of Gemini 3 Flash Preview to sort of analyze, you know, audio recordings and extract sort of information out of it.

Uh, it is built with Google AI Studio, so you can, you can find it in the gallery in AI Studio. You can try it out. Uh, I can give you the slides later as well. Um, and so that was sort of, you know, what I was trying to demo earlier. Um, so, you know, different from just kind of a pure transcription, uh, model, w- we can extract a lot of information out of the audio within one single, you know, request to the model or one single API request if we're using the API.

So you can see here, you know, summary. I introduced myself by name. Um, so we're actually able to label, um, you know, the section with speaker by name. Uh, I forgot there was no hecklers in the room, otherwise we would have picked that up as well. Um, and maybe we can see if we have more time later and we can do that.

But so we can see here, you know, we're extracting timestamps, uh, we're labeling the speaker, kinda identifying the speaker. We're, we're identifying the language and sort of the emotion of it, right? Uh, happy, you know, to introduce myself, uh, is great. Um, now you can see this was in German. Um, normally, it would classify my German as angry.

Uh, but, uh, here, you know, I, I guess I'm very happy to be with y'all. So, uh, I just told it, you know, uh, label sort of the emotion, uh, label the language. If it's a language that is not English, um, give me an English translation as well, right? Um, in French, uh, neutral. Normally, you would say sad.

Uh- You know, French is just a bit more of a... Uh, no, I sat... You know, I'm sorry, my French is very bad. Uh, didn't, didn't sound sad enough, so neutral in this case. Um,

okay, this, uh, this didn't work, so my Japanese, I gotta, gotta practice that if anyone reads Japanese. So it should actually say, "Hello, my name is Thor." Unfortunately, bit of a miss there. Uh, let's see if my Mandarin was any better. "Hello, everyone. I'm a German living in the United States." Sorry, my Chinese... Yeah, that is, uh, correct.

Does anyone read Chinese in the room? No? Okay. Well, uh, we'll just trust that that is, um, that is correct. Uh, and so we can see here that, um, this was, you know, one, uh, kind of request to the model, um, where, you know, I basically just told it, "Identify the distinct speakers. If you have context, you know, label it by na- label the speakers by name.

Give me the accurate timestamps. Give me the language. If the language is not English, give me the translation. Uh, identify the emotion out of, you know, happy, sad, angry, neutral, and then also provide a brief summary of the entire audio at the beginning." So this was one API call, uh, to Gemini 3 Flash Preview, and you know, we got all this information out.

We could, you know ... I just gave it a response schema, so kind of structured outputs, uh, and I was able to just populate that into my, into my API, uh, into my UI to have sort of the structure. So, you know, this kind of audio understanding and sort of the, the, you know, base research in the Gemini 3 models, that is sort of what powers, um, the speech generation as well, as well as the, um, you know, real-time conversational, um, generation.

Speech Generation7:41

Thor Schaeff7:41

So having that audio understanding, um, is, is really, really great in terms of, you know, knowing what certain things sound like, um, including, you know, different pacing, different accents, and, and, and scenarios like that. So, um, you know, the foundation of kind of all our models is now sort of the, the Gemini 3, um, foundational research, and then we're building kind of the, um, dedicated audio models on, on top of that.

And so with speech generation, uh, it's a bit different. You know, if you've used kind of other, um, TTS providers before, uh, you probably have a huge library of, you know, different voices that you sort of, you know, you filter by gender, by, by, you know, accent, by languages, what have you. Um, but so in, in, in Gemini, you have, you know, just I think it's, like, 30-odd sort of base voices.

And, and then what you do is you, you kind of direct that voice to act in a certain way. Uh, and again, because we have that audio understanding, uh, we can, we can basically modify the voice to, you know, act in a certain way to, you know, use a certain accent. Uh, and so we can go from kind of a small set of, of base voices to a very specific, you know, kind of voice that we're looking for for our speech generation.

Uh, again, there's, um, a little application that you can try out. Uh, it's in the, uh, Google AI Studio Gallery as well. Uh, it's called the Voice Library. Uh, and so what we can do is, you know, um, kind of giving the, the, the prompt structure that we just saw, you know, we're building sort of the, the audio profile, the scene.

We're setting the scene. We're instructing sort of this director's notes, so we're giving guidance for the performan- you know, just like how you would, um, direct a human, you know, to, to act out a certain way. Um, and then some sample context and kind of the transcript, um, that we want. So now what we can do is, you know, we, we just said sort of, uh, we want, you know, a high-pitch Irish male.

Voice Demos9:47

Thor Schaeff9:47

Uh, and so basically, I just used Gemini, um, 3 Flash here again to then construct our, um, system prompt for the speech generation. So we're saying here, you know, um, we're, we're setting sort of our au- audio profile. You know, Finnian here, and the scene, sort of cozy, crowded pub, uh, in the coast of County Clare. Um, you know, deliver the lines with a strong, authentic, uh, Irish accent.

And so now we hope the, the TPUs don't, uh, disappoint me. There we go. It failed. But, um, I didn't, you know, I prepared it so we can, we can listen to it here.

Oh, you wouldn't believe the size of the thing until you saw it with your own two eyes, I'm telling you. It was a grand, old mess so it was, and we were all laughing fit to burst by the end of the night.

Thor Schaeff10:36

So as y- as you can see, you know, this was, uh, the, the base voice here is, uh, this one.

What kind of problem could we solve?

Thor Schaeff10:45

So, you know, that is a fairly sort of stan- standard, you know, American accent. But so now by, you know, giving it that director's note, we can then sort of give-

Oh, you wouldn't believe the size of the thing until you saw it with your own two eyes, I'm telling you.

Thor Schaeff11:00

Takes me straight back to Dublin.

It was a grand, old mess so it was, and we were all laughing f-

Thor Schaeff11:05

Or, you know, similarly here we have, um, Sapphire. So this voice, uh, is here.

Ready to build something awesome today?

Thor Schaeff11:12

Again, you know, kind of fairly standard sort of American, um, uh, English accent here. And now we could say, you know, give it kind of a Singaporean sort of scene.

Wow, you must try this chicken rice, love. The chili is damn shiok. Confirm plus chop you will love it. Faster queue before the uncle close shop, okay?

Thor Schaeff11:31

Anyone spent time in Singapore? Uh, you know, yeah. That's, uh, you know, that's something you would hear in the Hawker Center. So, um, again, you know, that is kind of underpinned by the audio understanding. So, um, the model really understands what, you know, these, these different scenarios sound like and then can modify, uh, the speech generation to, to, to be like that.

Um, yes. And then, you know, finally Sort of the, the native audio, uh, sound to sound multimodal real time. So we, uh, just launched a couple weeks ago Gemini 3.1 Flash Live. Um, so it is, uh, a speech to speech, it's kind of real time multimodal model. Uh, you can ingest, uh, text, audio, video in real time through, uh, a we- WebSocket connection, and then you get, uh, real time audio response back as well as kind of the text transcript of that.

Flash Live12:28

Thor Schaeff12:28

Um, you know, obviously benchmarks are, you know, especially in the audio space, benchmarks, you know, can't really trust them. Um, you know, it's great. Y- you can see sort of the reasoning, the thinking. Uh, so, so here the, the, the thinking and the reasoning and the intelligence is baked directly into the model. So that's, you know, different from a cascading pipeline where you would actually go, uh, through text to then go through an LLM to get the intelligence.

Here, you know, the intelligence is baked into the audio model. So that's kind of the difference there. Um, but obviously in, in real scenarios, um, you know, you can try this out in, uh, ai.studio/live. So the great thing with AI Studio is, uh, you can try it out, um, you know, without paying anything. So this is, uh, allows you to, you know, try kind of, um, the models without actually needing to put down the credit card.

So AI Studio is a, is a really great way to do that. Uh, and again, you know, we have the audio understanding kind of baked into the model here. So what we can do is we can give it some system instructions, you know, for example, speak in a friendly Irish accent, and then also we can ingest kind of our, you know, camera here, for example.

And then, um, we can say, "Hey, can you see me?"

Well, hello there. I can see you loud and clear. So I can. What's on your mind today?

Thor Schaeff13:54

What do you think of my outfit?

Ah, look at you with your Gemini shirt. It looks grand altogether. And the backwards hat gives you a fierce laid back vibe, so it does. You're looking sharp.

Thor Schaeff14:07

Ah, wunderbar. Kannst du mir ein Gedicht auf Deutsch erzählen, bitte?

Ah, a poem in German, is it? Sure, I can give that a go for you. Here's a little one. Es war einmal ein kleines Licht, das strahlte so hell und mit, es brachte Freud in jedes Herz, vertrieb den Kummer und den Schmerz. How was that?

Thor Schaeff14:29

So what's funny here now is, um, it actually applies the Irish accent to the German as well. So obviously you need to kind of adjust your system instructions to, you know, not speak in an Irish accent in any, every language. Um, but so it, it, it's pretty funny. You can kind of switch between, um, the different languages there.

Uh, again, so ai.studio, uh, /live. You can try it out. Uh, you could also ingest your screen. So you're basically just ingesting, um, video frames in addition to the audio at a maximum frame rate of one frame per second at the moment. That's, that's kind of the maximum you can ingest. Um, there is, uh, a couple examples, you know, if you're a developer yourself, uh, feel free to try these out.

Dev Resources15:16

Thor Schaeff15:16

They are linked from the docs as well. These are, you know, just some examples of how you would get started, say, in Python for like a server to server connection or in JavaScript with kind of a client to server connection. So you can find those here. Uh, also recommend the, uh, Gemini skills. So we have published coding agent skills for kind of all of the Gemini APIs, including the Live API.

So, you know, real time audio, working with real time audio can just be a bit more challenging. So, uh, using these agent skills and kind of installing them in your, uh, you know, coding agents can really help steer them sort of in the right way and give you sort of that result that you're looking for. Um, lastly, we have...

Okay. We have a bit of time. So music, uh, you know, also audio, so I put that in here. Uh, we recently released Lyria 3.

Lyria 315:58

Silver glow and a steady beat. Again and again.

Thor Schaeff16:19

So yeah, it's, it's a music generation model. Um, but so it, it now actually can generate music with lyrics. There's two separate models. There's a, uh, Lyria 3 Clip, uh, which is a 30-second kind of jingle gener- generation model, and then Lyria 3 Pro is the full length, uh, song generation model. Um, and so, I, you know, kind of bringing everything together, I don't know if you remember the good old days when you would actually call the radio station and wish for a song.

Um, and so I've, I've kind of built this application, um, called Live Jukebox. And so we can now try it out where, um, we actually give the, the real time Gemini Live model, uh, a tool to then generate a song using Lyria. And so let's see if we can, if we can do that here.

Live Jukebox16:50

Right then. You're live on air with the Jukebox DJ. What an absolute legend. Who am I speaking to and what kind of vibe are we gonna cook up today? Is it a proper rave starter or something a bit more ch-

Thor Schaeff17:20

Yeah, can we get, um, maybe a German techno schlager about the UK startup scene?

German techno schlager about the UK startup scene? Mate, that is revolutionary. I love it. Right. Picture this, pounding techno beat, a proper Hans Zimmer style-

Thor Schaeff17:39

Yeah, that's great. Let's go. Let's do that

Hold on tight. We're almost there. Just to make sure it's a certified banger, any specific buzzwords or stories from the UK startup world you want in the lyrics? And are we thinking manic energy or something a bit more melodic within that techno madness?

Thor Schaeff17:54

No, just manic energy, and surprise me on the lyrics.

All right, you're on. Get ready to have your eardrums blasted. Manic German techno schlager with a British startup twist. Cooking up a proper banger for you. Check this out.

Zeig mir dein Pitch Deck los. Der Wachstum ist grandios. Series A ist jetzt das Ziel. Das Spiel ist ein Gewinner-Spiel. Viel Kaffee in meinem Blut. Das Startup Leben ist so gut. Das Startup Leben ist so gut. Ja, ja, ja, so gut.

Closing18:41

Thor Schaeff18:41

All right, I'll, I'll leave you with that. Um, if you don't speak German, sorry, but, um, thanks so much. Appreciate y'all.

No worries at-

Thor Schaeff18:56

Oh, yeah, and if you want the slides, uh, I can just rewind, but there's, like, all the links in there. Um, if that hel- if that's helpful, you can, you can just grab them there. Awesome. Thank you. And yeah, enjoy the rest of the conference. And big thank you as well to our friends in the back on the audio.

You know, wouldn't be possible without them. Cheers.

From Transcription to Live Music: Gemini's Audio Stack — Thor Schaeff, Google DeepMind

Topics

Mentioned

Transcript