Intro0:00
Okay. Hope everyone are awake after lunch, and nice to meet you all. I'm Ziv. I'm in the AI Labs team in NVIDIA based out of Paris and working with different, uh, frontier model builders across a lot of domains. Of course, diffusion is one of them, and we'll hear about a couple of examples of, of work we do with them.
Um, we have only twenty minutes, so obviously it's kind of a mix between going deep and going very high level. Each of these topics will probably be a full day or full, um, conference to cover. So I, I'll try to cover everything I can within this time frame, but feel free to reach out afterwards, either through LinkedIn or I'll stay here a few minutes afterwards.
So without further ado, diffusion models, I assume everyone are here knows about it. Anyone doesn't know what VideoGen, ImageGen, how they work on a high level, denoising? Perfect. Okay. The idea is that, of course, unlike autoregressive, um, architectures, LLM, the idea is that you have a lot of iterations to denoise the image or the video, usually between twenty to fifty steps.
And, um, we see, I think in the last year, an influx of very good high-quality models, both for image generation, whether it's Flux 2, um, video generation, um, LTX 2.1, um, Google with the NanoBanana and, and the later generations. And we do see a lot of more practical use cases for that. And the main challenge once we have some interesting use cases is how to make it actual usable, right?
We know that it's cool to generate videos or to generate images, but now if we talk about a developer context or a enterprise context, this should be, uh, fast, okay? We want it to be mature, we want it to be scalable. And these are usually the challenges that, uh, are hard to solve as this ecosystem is, is not as mature as the autoregressive LLM, VLM ecosystem.
Okay? So we try to borrow a lot of the concept we see work very well for LLM, and we gradually kind of distill them, if I'll steal this, um, termin-terminology, into the world of diffusion models. Okay? We'll cover a few of the topics here, but again, it's every day we see more and more research, um, in this domain, and I expect this world to be even more mature in the next AI engineer.
Use case enel-- enablement, real-time image, real-time video is obviously the, the Holy Grail, okay? Imagine how many new use cases where it's, uh, world models for, for robotics, for, um, you know, computer games, for content generation. It opens a lot of new avenues for companies and developers to use it. And the big challenge together is, of course, the latency.
Okay? It takes a lot of time to get a first image and then to obviously get a high quality, okay, if we talk about ten ATP or seven BP, uh, content out there. And to bridge this gap, I'll talk about three concepts. Of course, it's not, um, covering all the, uh, the ways you can optimize your VideoGen, ImageGen models, but I'll touch on quantization, caching and distillation.
It's not necessarily the, the order you'll deploy it yourself, okay? Usually, you'll start a distillation, then do some, uh, quantization, then some caching. But I started from the simple to the more complex. Okay? Simple is usually quantization, okay? For those of you who tried it in LLMs, concepts are quite similar. Okay, then we'll talk about caching and distillation.
When we talk about quantization, we have two approaches, okay? Post-training quantization and quantization aware training. Um, I'd say in many cases, of course, we do want to use the more simple approach like PTQ, but we know that at least for the-- to maintain the image quality, the video quality, it's a little bit more complex with the diffusion models.
Quantization3:53
Okay? Uh, we also know that, um, these type of models are more attention heavy, which means that the impact of doing quantization is not as impactful as the LLMs, VLMs. But it is s-still quite a, a low-hanging fruit when we talk about taking advantage of the more advanced, um, features of Blackwell, for example, and, and more modern, uh, compute.
Uh, in this example, just the work we did with Black Forest Labs on Flux 2, you can see that, um, using usually dynamic quantization, okay, you don't wanna use... We can use static, which means that we compute all the range of all the, uh, different parameters upfront, deploy it, and use this static range for the quantization. In this case, we use dynamic approach, okay, which means that some of the range will, uh, be computed on the fly.
Okay? Again, to make sure that the distribution is in line with the different, uh, data distribution that you'll probably want to use when running these models. It's something that you can either do it yourself, okay? We recently released a good example in our tierg-llm visualgen repository. Open source, you can start using it and see how it goes.
What we also try to, to do to, again, help the community to adopt it, is also to help our partners to do quant-- pre-quantized, uh, checkpoints. So you can just go to Hugging Face, load the, the quantized checkpoint and start using it. Okay? If you don't need to fine-tune or to do some LoRA adapters afterwards, it's something, again, it's quite handy, and you can already see the impact.
Of course, when we talk about quantization, the impact is both on the memory. Okay? It will require less memory, which means you can run it on lower-end GPUs, whether it's consumer GPUs or lower-end data center GPUs. But, um, also something that will help you in the performance. Okay? So this is one part of the toolkit. Again, a whole world sitting behind it to make sure that it's something that is effective.
Just today, I've seen one of the latest, uh, research coming from Hao Lab, um, about Attention FP4, which again, as I mentioned, attention is quite heavy for this kind of model. So we do try to follow up with the latest research and make sure it's accessible for your devel-- you as a community. Okay. When we are talking about the second stage, caching, okay, KV cache is something that, you know, anyone that worked a little bit about in-- with LLMs, with autoregressive models, it's, it's something that everyone talk about how to, um, to use it efficiently, how to offload it, et cetera, et cetera.
Again, it's a whole world. With the characteristics of diffusion models, it's not the same way, right? We don't, we don't generate a token every time, so it's harder to use these kind of techniques when we talk about denoising steps or, you know, getting, again, making sure that we use the computation we had before in the way we'll generate future images or future videos.
Caching6:52
There are some... TCache is one example. Again, it's not, um, a very strong example, but it's a good example to understand the concept, okay? While we are doing denoising steps, right? We talked about twenty to fifty steps. There are areas between the denoising steps that are pretty much the same, okay? So we don't necessarily need to recompute them.
What TCache is doing is, okay, if, if there's-- there was a minimal change or very small change between the denoising steps, it compares it, and you underst-- okay, now I don't need to, um, to recompute for the next denoising step, okay? So it's more general, okay? It, it does it for the entire pixel space or latent space, okay?
More modern techniques of caching will do it in a more chunk-based, okay? Imagine that, I don't know, now we are in the classroom here. Most of you audience are sitting, staring at the, the screen, so nothing much changes. But I try to be a little dynamic, so I'm, uh, you know, you still wake up and follow me, and which means that this chunk of the video doesn't necessarily need to...
You guys don't need to recompute. I do need to. Okay, so we'll isolate just th-this chunk and recalculate that, okay? Of course, you can define the threshold, um, and this is something that actually makes a lot of impact. We provided here some good examples of the expected boost you can get from using this. Um, but make sure that you try it, of course, and you maintain the quality, okay?
Caching is something that, if you don't do it the right way, can have quite a significant impact on the quality of the image, okay? And as content creators, again, world models, et cetera, it's something you wanna make sure that, uh, you maintain while you get the boost. Okay? Um, so that's caching. Again, I encourage you to read more about different techniques.
This is something that is already available in the TRTLM VisualGen I mentioned. Just a flag you enable, and you set up the threshold. Uh, you can experiment with it, but also it's available, you know, in vLLM Omni, as GLAN Diffusion and other serving, um, libraries. Distillation, okay? A-and this goes to the you don't necessarily need fifty steps.
Distillation is something we've seen. Again, um, I, I'd say probably the big bang for distillation was during the DeepSeek first release, how they managed to distill from a very big model to much smaller models and get, um, you know, I would say acceptable quality, but with a much, with a much smaller, uh, model. In diffusion, the goal is not to get to a smaller model, okay?
Distillation9:28
You'll still have the same number of parameters. This is more about step distillation, okay? Training the model, the student model, to generate as good quality images or videos, but by using much less steps, okay? Instead of fifty steps, going to four steps, eight steps, in some cases one, one shot, okay? And maintaining the quality, okay? And this is the big challenge.
And imagine if you are able to reduce this significant number of steps, but maintain the quality, it's something that can give you ten x, two hundred x improvement in performance. And if you go back to the real-time generation, this is something today, it's probably the only way that it can get us there in good quality, okay? Um, there is the next one, I think there's some demo we did in the last, uh, GTC conference a couple of weeks ago in, in San Jose, uh, with two different, um, distillation techniques, um, and we got to a real-time generation, okay?
And this is something, again, that everyone are looking for, all the AI labs, and I'm sure also the bigger players, because this is, um, this means that we can actually get to, again, streaming something that will open a lot of new use cases. Okay, so how do we get it? Okay, we are-- When we talk about distillation, we always have a teacher model and a student model.
Currently, we have two main approaches when we talk about distillation, okay? One is trajectory-based, which means we'll try the student-- try to teach the student how to follow the trajectory of the denoising steps as the teacher is doing, okay? And the second is dis-distribution-based, which means we'll only look at the output distribution, okay? We want the, uh, student to get to the same point at the end, but we'll let the student understand how to get there, okay?
And not by following the exact trajectory. Okay, the more common, I would say, and better quality technique these days is distribution-based, and we also see a lot of ways that can be combined. These techniques can be combined. Um, in the last Fast video release, they actually managed to do kind of a hybrid approach that maintained the quality but also got to a more stable, um, um, training.
The challenge and why I kept it to the last is that distillation us-usually it's a post-training technique, okay? Which means that you Let's say if you do want it to work with your data, it's something you'll need to use some data for that technique and, and you want it to converge in a good way, right? 'Cause otherwise it will just, um, you know, garbage in, garbage out.
Okay? So it will require more compute, it will require more time, also more proficiency. Again, as it's an exploratory still or research driven, uh, domain, there's a lot of different techniques out there, and we expect more to come. But we are starting to see more mature techniques coming and some very good examples shown in, again, in-- with the latest open source, um, models.
Of course, um, closed source model builders are also using this approach. So FastGen is something that came out of our, um, NV research group. Okay? It's an open source repository. You can go... There's a lot of different techniques there. It-- Okay, it's not a tech-- distillation technique or method, but the idea is that because it's so complex when we talk about, you know, large models, okay, a lot of the new video diffusion models are twenty, thirty, forty B, uh, parameters, and we expect it again to get to hundreds of billions of parameters.
FastGen12:51
It requires post-training, it requires scale sharding all-- across different, uh, GPUs. So to manage all of this, we came with FastGen as a way to structure this process for you and enable you to focus only on the quality and of course, you know, fine-tuning the exact recipe that you wanna use. Like you can see, there's an optional training data here.
If you're not using, um... Well, you can always use open source data and it will work up to a point, right? And, and we're actually happy about the results there. Um, but if you want it to work for or if your use case have very specific data distribution, then we recommend you to use your own data for the fine-tune.
Um, some of the results quoted here, okay, um, you know, the speed up, it's actually something we got, not just-- The speed up doesn't come only in time, it's also in using much smaller, uh, much, um, less compute, I would say, to get to real-time. Okay? We got, again, in, in GTC, as I mentioned, we got to one GPU of Blackwell B200 to generate near real-time video or real time, again, depends on the quality of the output.
So it does something that we highly recommend you to look into if you wanna get to this point. Okay? We do expect, again, a lot of the other auto-regressive techniques to come and gradually, um, be relevant for the video dif-- video generation and image generation. We also see a lot of new model builders working kind of a transfusion or auto-regressive diffusion approach.
Okay? So you use the diffusion to generate, uh, um, um, a frame, sorry, but then it generates frame a-after frame in a auto-regressive manner. So again, we expect a lot more of these techniques to get into this domain, but it's still a lot of research, uh, driven. So make sure, again, this is one very good example you can take a look at.
And I think the best value about it is all of it is incremental, okay? You can use this plus this, plus this. You don't necessarily need to decide, okay, I'm doing only distillation or only quantization or only context parallelism or... Again, there's a lot of different techniques out there, and they're all incremental. Okay? So you can start with quantization, as I mentioned, which is the easier approach.
If it's good enough for you, stay there. If not, okay, let's move to now multi-GPU. Maybe do some context parallelism, maybe add some caching techniques. Okay? And then last and the most impactful, that's the distillation. And again, hope to see a lot of you trying it, uh, and getting into the real-time performance. Now, try it yourself. Okay.
Try it15:39
All of it are open source resources that you can use. We have added support for the open source, uh, models as well, whether it's the One family, Flux 2 family, LTX2 family, and other ongoing. So hopefully you'll be able-- we-- we'll be able to see also you guys contributing to this, um, and making video diffusion as good as we see with LLM-VLM.
Um, I think I'm almost at time, so if there is maybe one, two questions, happy to try and answer. If not, um, we can let you one minute of breathing.
Q&A16:48
On, on average, what would you say is like, uh, are the requirements for you to fine-tune this model? Because access to GB200s are not that easy right now. And, uh, in terms of dataset, how big are the datasets that you've seen work well with some of these models?
Okay, so the question was about the compute needed for that, and then f- the dataset needed for that, just for everyone to hear. Um, what good about distillation is that you don't need, um, GB200, right? You can do it with
Hoppers, you can do it with, uh, H200, H100, G- B200, you know, B300 if you... So it's not necessarily that you need very big compute as you do for, uh, pre-training, uh, but you still need to compute. Okay? Uh, so it's not something you just, you know, take your, um, I don't know, just one instance and start doing it.
Of course, it depends on, on the size of the model, right? If your model is small, you have video generation models that are very small, two B, four B parameters, so this will require obviously much less compute. On the data front, I think it's very important to make sure that, one, you know how to evaluate, okay, so you can understand what's different if I just use just a general purpose dataset versus your specific data, uh, requires for your use case.
And in such cases, we have seen, um, differences. So for the more general demos, we don't do-- use any special dataset, and it works well. But again, if it's something that, I don't know, protein generation or something around that, that it will require, you know, something more specific. I'm at time, I think. But yeah, until they'll kick me out.
Any one other question there? Okay. We can afterwards, I think. Thanks, everyone.
