Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind — AI Engineer

Intro0:00

Gus Martins0:16

Hi, everyone. Uh, can you hear me? Yes, you can hear me. Hi. Sorry, sorry, one minute late. I'll try to do my best to finish earlier, so my friend can do, uh, pretty cool demos for you. I'm Gus, this is Ian. We are from the, uh, Google DeepMind, and I'm specifically I work on the Gemma product. Do any of you know what Gemma models are?

Okay. Perfect. Perfect. Thank you very much. Uh, so today we're going to talk a little bit about ownership and open models. Uh, and, uh, well, you know who we are. But the idea is, uh, last Thursday, we released our new, uh, family of models, Gemma 4, and I'm going to talk a little bit about them. Uh, there's going to be more information tomorrow in the keynote by Omar, and there's another talk by Cassidy also tomorrow that she'll go into even more details.

Two Families1:02

Gus Martins1:02

We are going to tell a little bit of the story, but the story is a little bit bigger. We'll try our best here. Uh, so why does it matter? Uh, if you ask me, I work for Google, of course, if you ask me what's the best model for you to try, the easiest one, I will answer for you, Gemini.

Gemini is the best model we have, pretty strong, multimodal, can do all kinds of things. But then there is, uh, there's more to this story than just having the strongest model possible. In some situations, you want to own the model, you want to be able to run on your own hardware, you want to customize it, you want to be able to send your proprietary data that cannot leave your infrastructure.

So there are many situations where even the best proprietary model will not be able to help you directly. That's when you might need an open model. That's where Gemma comes in. So when you think, why does Google have two family of models? Because they complement each other. So Gemini is the m-most intelligent one, can do a lot of cool stuff, but it's hosted in Google servers, you need the API to access.

If you need more control and access, you need an open model. That's why we have Gemma. That's why, uh, and, and we are very proud that the quality is very, very strong. We're going to go some details later. But the idea is you would be able to do a lot of cool stuff with it. Among the launches we did, we released four sizes.

Four Sizes2:10

Gus Martins2:27

Uh, two are target to mobile, let's say like that, or, uh, or IoT or smaller devices. It's a E2B and a E4B. Uh, these names is a little bit weird. We are the only ones that use this name. Uh, and the E stands for effective. And the idea here is, uh, the model, uh, uses as much as 2B, uh, what a 2B model would use as memory, but it's larger than that.

The 2B is around 5B, uh, parameters. But then you say, "Oh, but where is this other 3B in memory?" The, the fun fact is that they are not really parameters from the transformers. They are, uh, like mapping, uh, tokens. So you can leave them in other memory. So what you really need on your GPU memory is the two billion or the four billion.

Why do we do that? So that you can run these models on a phone, on a Pixel phone or b- any phone you have there. You can run these models, and they're very strong. The E2B and E4B, both of them have, uh, text, uh, vision, and audio input, and they do only text output. They can do thinking, they can do coding, function calling, all these kind of cool things.

These all run on your phone right now. You could download it right now, right? We also have other two models, which are the larger ones. We have a 26B and a 31B. The 26 is a mixture of experts, which means that, uh, it's as if we had many other models working together where, where each of the... where each one of those are like a 4B model.

Why does it matter? Because it, it has twenty-six billion parameters, but it needs a space of a four billion param to do the work. And this makes, uh, makes it accessible to way more hardware, to way more people, and it's still pretty strong. But our strongest model is the 31B Dense, which is thirty-one billion parameters model, and this is really, really strong.

Efficiency4:20

Gus Martins4:20

If we look into our, uh, ELO score on LM Arena, you can see that both our models are-- They are, I guess, now they're fourth and seventh as the leads on open source models, open models. And if you compare them to maybe the top twenty, thirty, all of them are at least twice, three times larger than our models.

In some cases, twenty times larger. So we are talking about a disproportionate amount of intelligence per size. So our 31B model is the one I use very regularly. It can do basically anything from coding, agentic, everything, multilingual, all of that. So I strongly recommend you try those. They are so strong that they are, uh, o- both of them are really good to use on your-- on a-- as a cloud deployed model.

They can run on your desktop, but if you use on your server as your endpoint to do s- uh, your work, they are pretty good. And you ask, "Oh, is this the most intelligent model?" No, it isn't. I'm going-- I'm, I'm very biased, and I love them, but I know the capabilities. But the question is, do you need the most intelligent model of the planet to summarize your email, to do some more menial tasks, to help you code, to do some agentic capabilities that are s- uh, searching and interacting with docs?

Probably not. That's why these models are so strong, because they're cheaper. They're very strong, but they're cheaper to run. They require way less hardware. A 31B run in one GPU. The competitors need two hundred gigabytes of memory, which would be maybe four or five GPUs. So you can see that the price here is really, really different. One easy place for you to try these models is on ai.dev where you can try Gemini models, Veo, all the other models, but Gem- Gemma are there.

Both 26 and 31B you can try right now. They're free. You can play with it. Uh, and they can do-- I was, I was going to show a demo, but I, I can't now. But they can do some cool stuff, which is vision plus thinking plus code execution all at the same time, right? Uh, I'll try to post something about this later.

But the idea is you can play with the models pretty easily and right there, r- not now. Finish. Let's finish the, the, the talk and then you play with it. Uh, and as I was saying, the in- the intelligence per parameter that these models bring is pretty good. It's very, very strong. And if we, we use the ELO score for LM Arena because it's a benchmark that's a, a person's preference, right?

We can look into academic benchmarks. They are very, very strong. But the-- how the model responds to your queries, that's very important, right? That's how your customers will see, how you will see it and, and interact with it. So this is why this is so important. And why does all this matter? One of the reasons that we care so much is because, uh, you want to, you want to-- the user to have ownership.

And more than that, we are enabling sovereignty. And sovereignty me- means in terms of you own the model, and you can adapt your use cases, and you, you are not susceptible to, I don't know, loss of service or for some kind of, uh, someone saying, "No, no, you cannot use this model anymore." It's all available to you.

Sovereignty7:13

Gus Martins7:32

And one of the changes we made last year until Gemma 3 and others, we had our specific license, a Gemma license, which is pretty good, commercial-friendly and all. But there's a problem. If you have a custom license, I don't know if you have any lawyers here. Uh, if I tell you, "Oh, we have this custom license," your lawyers will look at me with that face that, "I hate you, Gus."

And then they will spend like 18 month doing procurement process to understand the license and trying to change, and that never works. So it's pretty hard for sovereignty institutions to adopt this kind of thing. That's why we moved to Apache 2.0 for, for Gemma 4 and going forward. That makes- Thank you. And that makes our life mu- uh, your life much easier to, to convince your legal department, let's say like that, that, "Look, we own this model.

We can use." So this is pretty important, and it enables many, many, uh, sovereignty institution to use our models. We have some examples. For example, ki- for example, uh, Ukraine used Gemma to, in parts of their inf- uh, uh, services. We have a, a one version of the Gemma model that was fine-tuned for Bulgarian. It was their, uh, LLM for the country.

That was based on Gemma 2. We are working to make sure they use Gemma 4 now. We also have a Brazilian version that is based on Gemma 3. It was fine-tuned for Portuguese. And the challenge of these models today is that they-- if you want to fine-tune Gemma to a specific language, it's becoming very hard to do that.

And the problem is hard because not the f- the tooling or anything. It's because the model is pretty strong on those languages already. So any gains you try to have, you might not get there. So you might spend a lot of time to get like one percent. And then maybe, I don't know if it's the best use of your, uh, time.

So this is good and bad at the same time 'cause-- but it's good that you can automatically use in many language. You can try right now. And if you're going to, uh, the LM Arena for languages, in many language, they are top two, three. And look, it's a 31B model. It's very, very small, right? So this is pretty good.

That being said, uh, I will let my colleague continue and show some demos.

Ian Ballantyne9:39

Thank you, Gus. So, uh, one thing that I think is really important about these models is that when you think about using open models, you think about, like, using proprietary models. We're move-- We're seeing a shift now to more kind of agentic capabilities and the kind of tasks that we're trying to do. And with that comes a cost in tokens and token generation.

Agentic Shift9:39

Ian Ballantyne9:57

So one benefit of, uh, taking ownership of the models is your ability to control, or in cases where you have sunken hardware cost, uh, to be able to iterate on top of that. Uh, this graph on the right-hand side is from the State of AI report that OpenRoute did, and it shows the, uh... It's a bit small for you on this diagram, but have a look at that link.

Uh, it shows the different types of tasks that people are doing through OpenRoute at the moment. And you'll see the, the, the one that's about here, this one here is programming is right in the middle, and this is kind-- among some of the highest tasks in terms of token generation, both input and output combined. So the more we have agents work and do these kind of tasks for us that have very high token generation costs, that's when you start to get more benefit from being able to take control of that i, in itself.

So if, for instance, you have a laptop that is capable of doing a particular task that you need to be doing, like processing a document or analyzing some data or doing some research or, in the cases, uh, Gus talked about, doing some coding that's suitable for that, then you have a GPU that you can take advantage to do some of that stuff.

Now, similarly, similar to what Gus said about, you know, we don't necessarily-- We still have frontier models for doing the best possible things. I wouldn't get this model to do like a, you know, a full systems architecture and redesign of your application, right? It's not kind of for that. But what it is very good at doing is following very specific instructions about doing things like refactoring, analyzing, uh, generating code in, uh, in small modular bits, and you can offload a chunk of work in that style to these kind of models to be able to do that, whether it's on a single GPU or on your own personal hardware.

Um, and the way that we kind of think about this is like a, a set of thresholds. Like, when do we get to the point where these models are capable of doing the task, but then they also fit on the right hardware, that they also, uh, can do it with the right amount of latency, depending on the, the-- if it's a task for a user, it needs to happen in a couple seconds.

If it's a task where you're doing things like batch processing, you maybe have slightly different thresholds for, like, what needs to be done. Um, and then also what the cost of actually doing that is. So if you have a sunken cost in terms of, like, infrastructure that you already own or that you're prepared to, uh, outlay and then operating on that, or whether you're leasing, like, GPU time or something else like that.

So these are gonna be very specific to the task that you're trying to achieve. But what, uh, what you can do with open models is you can think very carefully about, like, what-- which of these tasks can I f- can I fully offload or can I fully own compared to relying just on using the best possible models to do, uh, that in the cloud?

Phone Demo12:37

Ian Ballantyne12:37

And, uh, an example, so Gus talked about the different types of hardware that can run these things now. I'm just gonna run this little demo in the side at the moment. So we now have models that will work directly on mobile and edge devices. Uh, this example here, uh, was built by Cormac's team, um, is a set of agent skills that, that the model is running on a phone.

So, uh, I'm gonna mute the microphone for that. Um, so you can talk to the model, you can show it images, uh, you can show it the world around you, and you can prompt it and chat to it. And what this one is showing is that it can look through a set of skills that it has about things on the phone.

So either it can take actions on the device itself, so trigger other applications like trigger calendar apps, trigger maps apps, or you can kind of define your own skill sets. And what's different now with the Gemma 4 models than we saw for the previous generation is that it's able to reason about what actions it needs to take and reliably, uh, make those function calls defined.

So what this app will allow you to do is it, is kind of acts as like a playground. Uh, so this is Google AI Edge Gallery, and you can find it on iOS and Android. And you can experiment to see what the models of this size are actually able to do. So I think this is the two billion parameter model, but there's also the four billion parameter model, depending on, uh, the size of your hardware.

Scaling13:56

Ian Ballantyne13:56

And when we get to desktops and single GPUs, a-as Gus mentioned, that's where you can use the, uh, the 26 and the 31B models, again, on your local hardware, and I'll show you how to do that in a minute. But there's one kind of key point here is, well, whereas we're not paying for, uh, the price of these agents or models within tokens, we're actually paying for them in terms of energy cost if we think about it.

Because now you're thinking about utilization of GPUs, you're thinking about utilization of MPUs on the hardware itself. Uh, when are you gonna do these tasks? Does the user need to get a response right now when you're taking a picture of something, or is it something that you can process offline as a background task when they plug their phone in at night?

So the-- what I'm trying to say here is that the thresholds and how you think about the usage of these models kind of shifts when you come to on-device or ownership because you think more about how they're being executed and why they're being executed. Um, yeah, perfect. Uh, and similarly, on the enterprise side, if you don't have a piece of hardware that can, you know, run the thirty-one billion parameter model, you can now be thinking about scaling that down.

So maybe if you wanted to use a three hundred plus, uh, billion parameter model before, you might have need multiple GPUs. Now you can think about using a single, uh, H100 or a A100, or even in some cases like an L4. And then the costs obviously related to that also kinda go down. So again, it's a calculation that you'll have to do depending on your use cases, but there's ways that you could scale, for instance, running one of these models to serve, you know, a small team or to serve a company, um, depending on, uh, what you're trying to do.

And the f-- and the f-- and the final point is that you also have the fine-tuning component too, which is that because these models can be customized, you can deploy your own version of it. So for instance, we have a variant of Gemma models called MedGemma, which is specialized for medical use cases. So if you wanted to have something that would operate on private data that you can control yourself, you can now d-- re-- feasibly deploy this to, like, one or maybe two GPUs, uh, to run that for, I dunno, like a whole hospital, for instance.

So these are kind of worth considering for the enterprise case. Uh, I'm gonna jump straight to demos now. Um, I've shown you some demos on the phone. I'm gonna show you a quick demo, uh, here. Who-- Quick show of hands, who's ever used, uh, a, a tool called LM Studio? Okay. Uh, just under half people. So LM Studio is a way that you can play around with local models.

And I have here-- I have-- This is the 26B model, so this is our faster of the, the two larger models with four billion activated parameters. And I'm-- At the moment, including the context, it's probably about twenty-six gigabytes in RAM. And this is an M4 Mac. So I've got unified memory. I've got up to about forty-eight gigabytes.

LM Studio16:24

Ian Ballantyne16:43

Uh, so I can run it on this machine. And I'm just gonna run this terminal right here. Let's give that a go. Oops. Pre-showing my demo. Let's try that again. There we go. So I'm just gonna run a little process where I'm going to do some quick-- a trick-- quick translation on my device. So what it's gonna do is I've got an orchestrator on this side here, uh, which is going to hopefully kick off my agent in a minute.

Let's make sure we are loaded. Let's see what LM Studio is doing. Yeah, it's just processing at the moment. And then it's gonna farm out, uh, this translation to all of these different, uh, uh, windows, and each one of them represents a different sub-agent. So this is running on my device, uh, and it's going to basically execute all these translations in one go.

So I've given it, like, the Gemma 4 announcement, and I just wanna s-- wanna translate to all these different languages. So you'll see in a second, it should hopefully send it over there. Three, two, one. And hopefully we should be generating translation in a second. There we go. So, so you can imagine doing any kind of agentic task on your local machine.

Uh, you could have it, like, processing files. You could have it doing, uh, additional analysis. And hopefully, what you'll see in a minute is it will be able to compile all these back, and then it will generate me a quick webpage, and then you can see the results of your translation. There you go. So there's the multilinguality of the model there as well.

Thank you. Uh-

Right. Uh, so in the interest of time, uh, I just wanna say that the, the main next step for exploring and trying out these models is as simple as this code on the right-hand side. Um, you can take any OpenAI-compatible interface that you've got, and you can point it at a service like Ollama or LM, uh, LM, uh, LM Studio, and you can just pick out the Gemma model, and that's all you need to change code-wise to at least, uh, try it out.

Try It Out18:47

Ian Ballantyne18:47

So the first thing we recommend you do is to drop it into existing workflows that you have to then see what the model can handle. Like, what is it working well at? What would it need tuning for? What is kind of out of its depth in terms of, like, the complexity of the task? Um, next is to kind of bolster your evaluation suites because, you know, benchmarks are great and everything for just saying what general capabilities are.

But the reality is that how good the model is depends on how well it does on your task and not anybody else's task. Um, the other thing I mentioned very briefly is thinking about how you actually serve these models in the end. So if you need to run your own GPU, you need to host it. Yes, you're in control of, like, uptown-- uptime and downtime, but then there's, like, maintenance costs and there's stuff like that.

So you have to be... You have to consider that as, like, one of the factors, like the ongoing costs as well, as well as any upfront CapEx costs if you buy infrastructure or hardware to do that too. Um, on mobile devices, for instance, you have to think about, like, if I'm gonna offload stuff to a phone, like, what am I supporting?

What accelerators do they have? What size RAM do they have? So the conversation becomes a little bit more complex, but then there's a whole heap of things that you can unlock, like working offline or working on users' private data that never leaves their device. Um, and finally, if you wanna scale this up to enterprise levels, you have to think then again about, like, the kind of infrastructure that you're running on and what the ongoing costs are of that as well.

But it does kind of unlock that. So, uh, with that, the summary is that you can use these models in pretty much any way you can think about. Experiment what kind of tasks are possible with it. Use some of the benchmarks to kind of give you an indication of, like, what's feasible. But, uh, really, we want to hear your feedback and how you get on with these and, and how you fine-tune them and what kind of things you run into and, uh, we wanna help you on that journey as well.

So with that, thank you very much.

Sovereign Escape Velocity: Ownership w Open Models — Gus Martins, & Ian Ballantyne, Google DeepMind

Topics

Mentioned

Transcript