Stop Making Models Bigger, Make Them Behave — Kobie Crawford, Snorkel — AI Engineer

Intro0:00

Kobie Crawford0:16

This is the, the last presentation I have to give this, uh, conference, so I'm feeling already a little bit of the euphoria of like, ah, it's all done. Um, I know we're, we're at a, at a, uh, also closer to the end of the whole sequence. Uh, uh, is-- Has it been... I mean, I, I keep finding these conferences to be like some of the highest signal that I get, like wherever I, wherever I go.

So how, how-- Did-- Generally, is that-- Do people feel like, like they're getting what they came for here? I'm just curious because like I, you know, we're, we're at Snorkel. Um, and we, we put in our sponsorship, and we wanna know that like people are getting what they want, or they know they're gonna come back 'cause we wanna sponsor next time.

We wanna know people are happy about it. So did, did you, did you guys see what you wanted to see? Yeah? Yeah. Yeah. Really good. Brilliant. Brilliant. Um, so now that it is three forty-five, I'm gonna go ahead and start the, the official thing. Uh, so, uh, my name is Kobie Crawford. Uh, I'm a developer advocate at Snorkel.

Uh, we call ourselves the frontier data-- uh, frontier AI data lab. And the-- What we're doing right now, the, our main thing is starting from the research-backed work that Snorkel's been doing since its inception. Uh, we've been working on a variety of things about data quality. And then at this point now, where, where we're focused is actually providing data sets where we assure a certain level of quality.

Snorkel1:08

Kobie Crawford1:30

We're very attentive to being very, uh, very, very motivated about making sure the data is high quality. And part of how we get to high quality is we always make sure to have sort of an expert in the loop as part of the process. So we have, uh, expert contributors that we work with, and we bring people in t-to provide their expertise to make sure that the data that we generate, uh, is of top quality.

And then for the, to the top labs that wanna use our data to, uh, improve their models and get, get the hill climbing done, uh, in the right way, uh, that's what we, that's what we do at Snorkel. Um, because of that, um, you know, uh, a lot of what goes on is, uh, still more research. And, um, this is, uh, a talk that's talking about some of the work that we did-- that our research team did.

And, um, one of the keys in this research is, uh, that, uh, we're looking at, uh, how the best quality data can be best applied and, like, where it is that we need to be looking for where there are opportunities, uh, to get that done. So in this particular case, talking about stop making models bigger, I mean, it's a nice, punchy title.

Why Smaller?2:11

Kobie Crawford2:31

Of course, we don't really mean, like, models shouldn't be large intrinsically. But, uh, the point, broadly speaking, is that sometimes w-we find great wins to be had with the right data applied to the right, uh, to the right problem s-statement. And so this is something we're gonna talk about, a specific use case that we-- that our research team discovered and, uh, and in partnership with, uh, the RLLM team, uh, which is a, a research group, uh, a-a-- It's part of UC Berkeley.

Uh, and so the UC Berkeley, uh, team over there, RLLM, the Agentica project, their, their lab, uh, partnered with us on this particular work. So the goal, this is the-- as it says, making a four billion parameter model outform-- outperform a two hundred and thirty-five billion parameter model on, uh, tool use tasks for financial analysis. So we'll start with the research objective, and then we'll iterate in through talking about the approach that we-- that was used for this particular process, and then talk about the results.

And, um, happy to report that we got what we were looking for. So, um, so, so good things we had. So couple of quick level-setting backgrounds of the, what we're talking about here. First is that as we see enterprise use cases, uh, take on some greater complexity. Uh, we have-- Obviously, we've got the massive explosion of what people are doing in terms of personal assistance.

And as people are working in the context of enterprise, a lot of times you still need a sort of a more constrained, uh, choice about how to implement something and make sure that it's reliable. Um, like when you're looking for things that are gonna be done, uh, for, uh, you know, enterprise production use cases, you kinda almost have to make sure there's a lot of safety and security things done.

So these other f-kind of priorities and, that, that fold into what people typically wanna do, we're looking at these things and saying, "Okay, well, these are the enterprise use cases that people have." And as people try to solve the problems of making the models perform at the level that makes it, like, acceptable for actually being deployed as a production service, um, we see very often that people choose, like, "Well, okay, we didn't get the performance that we wanted with this right now.

We'll just drop in a larger model. It'll be smarter. It has greater reasoning skills, and we'll just sort of expect that the import-- performance will improve," uh, commensurate with the additional load of the size of the model and the, the greater inference cost that goes along with that. And in some cases, that might not always be the right thing.

So we see people saying, you know, "Let's just get a bigger model. That'll solve the problem." And, uh, sometimes maybe that isn't quite the answer.

In this case, what we're trying to do is to say, "Can we take a smaller model and then use RL with the right data to yield the kind of performance gains that we're looking for and to deliver the kind of application functionality that we want?" And so that's the target here. Uh, and again, for these various reasons: cost, speed, security, and then the, the idea that in general, you know, you start with a really big model and make your POC and make it work, and everybody's happy that it works.

And it's like, okay, now what do we do to productionize it? And you wanna roll the production. You wanna think about how you're able to deploy that. Do you need to keep everything on premise? Do you, do you have the ability to deploy and run your service yourself so that you don't have to have external dependencies and worry about the data export aspects and data control?

Especially in the context of financial data and healthcare and other domains like that, people have to be concerned about those, uh, aspects as well. So for getting a smaller model to be able to perform as well as larger models, uh, we feel like in the particular case of talking about tool use for financial analysis, that RL is the right time to, to be, uh, making the kind of training.

You're talking about, like, changing the behavior, and so that's kind of more of a behavior thing. And then RL is kind of better for behavior than, say, talking about, like, changing the- Core data and knowledge that's inside of that. So that's, that's an intuition about, like, how we have approached it, uh, and that's part of what's going on here.

Tool Discipline5:59

Kobie Crawford6:13

So a larger model, sometimes it's more like, uh, taking a sledgehammer to crack a walnut. It's like just adding all of this capability is like this. And the RLLM team, uh, that we worked with, they talked about this, and their description of it was, uh, the Terence Tao effect. Uh, uh, Terence Tao, the famous mathematician who's, uh, uh, I forget what awards he's won and whatnot, but, uh, well known for being, you know, generally brilliant about mathematics across the board and, uh, therefore, like, could approach and manage any kind of mathematical problem.

But that much brilliance might not necessarily be what a financial analyst actually has to have. They don't have to know all the kinds of math. They don't have to do latent digital-- digital algorithm stuff, uh, to talk about, you know, doing a SQL query and getting some math, getting some, some data back and then, you know, doing some addition and subtraction, right?

So the, the idea that you must always get to a much smarter model to do something or deeper reasoning to get something done well, uh, is the thing we're challenging here. Um, so here is that two hundred and thirty-five billion QEN3 model responding to the question in this environment that we built. I'm gonna talk about the environment a little bit more in detail later.

But I point this out to sort of show here's a reasoning model, a smarter model, and its response in the context of needing to actually use tools. So the response that it got-- that it generated to the question, what is the year-over-year growth rate of YouTube ads revenue from '23 to '24, began with, uh, making a query to find an existing, uh, some values.

But the, the query it chose was to a non-existent table. The table didn't exist. It didn't actually inspect the environment and inspect the tools to find out what tables it could query. It just threw a query out, uh, without doing that. Um, so the table, it wasn't there and didn't get anything back. It guesses again, still doesn't get anything back, and then having not gotten anything back in either of those two, uh, attempts, it falls back to just hallucinating an answer.

And so out comes this hallucinated answer. It's completely, you know, don't know what the weights told it to say, but that what came out and, you know, it's, it's not very useful. So even though the model is incredible in terms of, like, much better at reasoning than a much smaller model would be, uh, that greater reasoning did not help it when it needed to use the tools.

We're gonna come back to, uh, the same question again against the model that we fine-tuned that's only the four billion parameters, and you're gonna see the difference, and we'll talk a little bit more about those differences later. So put a pin in that, come back, we'll see that, that year-over-year question from... come back. So here, this is what we're talking about, summarizing it again.

No discipline in tool use, even though it has all the re-- the abilities to reason that it has.

Moving forward to then what we did for this, uh, attempt to use RL to make this smaller model work well.

Our Approach9:02

Kobie Crawford9:11

Uh, the first thing is to generate a high-quality dataset. Um, at Snorkel, our general approach is, again, to have experts in the loop. I don't know if I said... I say it again now, but I don't think I've already said it. We have experts in the loop for the data that we do. Um, the way we generate data and the way we work on it is we have, uh, a platform that we've used internally for interacting with things.

We, we solicit the work and support of experts on various tasks and various topics. So if we need somebody who is, uh, in the financial analysis space already, then we get them and pull them in. We'll work with people at the PhD level for the, their domains of expertise, uh, and also, of course, people who are, uh, deep in the industry and have been working for some time, and they know their space well.

Um, the process of doing that, that's like one of the things that we've put an emphasis on, is how we work at Snorkel for our data generation. Um, and then, uh, broadly speaking, um, naturally, that can be augmented with other kinds of things. Uh, but, uh, the, that's what really key about, like, what we wanna do in terms of emphasizing quality as a core element.

Uh, so we have the dataset, and then we go through and make sure there's a verification step done to make sure that the tasks that are defined from that dataset are actually appropriately, uh, uh, fitting, fitted to the task and are actually, like, good tasks that, in terms of, like, the... it, it can be queried, you know, you know that you're gonna get the results that you need from it and that, um, that we should be able to have a verifiable answer that we're looking for.

So we do all the verification steps to make sure that everything's correct on that front, and that's another part of what it means to, you know, put together the dataset and have it ready for use. And data quality, again, is a, a big emphasis for us, so we, we make sure that's a, a key.

And then it was time to do RL with it, and the way we did our-- the, the way that this was done, uh, you know, we're talking about su- very few surprises in terms of, like, you've seen the state-of-the-art in this, in this space, um, GRPO. Uh, again, we started with a four billion parameter model and, uh, then, um, the environment that we use, the RLLM framework, um, again, through the UC Berkeley partnership, they're the, the developers of that framework, and, uh, we have our FinQA environment that we've built, and we're gonna talk a little bit more about the details about that environment in just a moment.

Um, but then this is something that was able to be done, like, in a twenty-four, twenty-one-hour job, and the total cost of running that job, uh, was under five hundred dollars per run. So, um, RL does not have to be a very expensive thing to be able to get, uh, non-trivial performance gains. And if you're already working about, working on, uh, working with models that you wanna host yourself, if you're already thinking about, like, what you'd like to do to be able to do things we have on-premise kind of solutions or things where you're doing it with, uh, uh, smaller models, and you aren't already thinking that you can improve the models the way that you want, uh, then this is like a call to action that you actually can, that it's actually a very tractable thing to get a model that you want to work with actually up to the performance levels that you need using RL.

Even if Karpathy doesn't like it.

Um, so our FinQA environment is something that we built. It's set up for, uh, being able to host the kinds of questions that are being done here. It provides a specific set of tools. Um, it's, uh, set up where, like, everything is built into the environment, so there's no external dependencies that are-- that, that might be, you know, in some remote, uh, data center that you don't have access to.

So when you deploy the environment, it's fully self-contained. Kind of rollout that if you've worked with something like Harbor before, or if you worked with, like, OpenEnv, you're familiar with the same thing about using an, an environment like this. And this is an environment that we've actually built and published. It's available, uh, on PrimeIntellect's, uh, infrastructure as well.

You can-- So it's something you can load up right there at PrimeIntellect. Also on OpenEnv, uh, and, and, and actually saved into the OpenEnv repo on GitHub. And then the, the OpenEnv, the PyTorch folks and, and Hugging Face folks team up and host these, uh, in Hugging Face spaces. So, um, these kinds of things are accessible and easy to find if you wanted to take a look at them and see, uh, how you might take, uh, take them, apply them to your needs.

Uh, and then again, like, getting, getting started with RL is actually, um, e-easier and easier these days. We have the FinQA set up, uh, where we have two hundred and ninety samples that way, and we have our more advanced, uh, seventy-nine samples called FinQA reasoning that requires multi-table, uh, uh, queries. Uh, and so there's, uh, enough, enough of the reasoning that has to be done across that to make the...

We've d-- We've, we've identified that these are harder tasks. And so we have, like, essentially two benchmarks that are built inside of this environment.

So that's the setup of, like, how we get this done. We're gonna go about talking about the evals and the results that we got, uh, working with this now, given the RL that we just did on this four billion parameter model.

Performance Leap13:55

Kobie Crawford13:55

So we did it. It's-- performs better than the two hundred and thirty-five billion parameter model with that RL training loop. Um, and the performance in terms of pass at one was essentially double of what it had been percentage-wise in terms of, like, solving problems. So it's a very significant uplift that was done with this five hundred dollar loop.

And again, the right dataset and, uh, and, and is really a key. You wanna get the questions and answers to be actually things that are really gonna help the model learn. But what is also interesting is what was important about what the model needed to learn. So just to give you a little flavor of, like, what that four billion parameter model looks like in terms of, like, how it behaves, and if you recall what we talked about earlier, the two hundred and thirty-five billion parameter model tried some queries without knowing what the tables were, didn't find anything, and then hallucinated an answer.

This four billion parameter model, having been fine-tuned on this, uh, dataset, tries a table and actually first discovers the tables by using the tool get_table_names. The tool existed for the other model as well, and it just didn't choose to try it. So the first thing it did was actually query to find out what tables it had available to it.

So that's already, like, win. All right. The, the second thing is then from there, it went on to actually inspect the schema. Let me find out what's in that table so I know how to make the right SQL query. And so it's like does get_table_info to get the information back to know what to query it. Following that, it runs a query.

Actually ran into an error. It actually asked for the revenue column, uh, but that column was not actually a part of the, the data in the table. Given that error, it actually corrected. It self-corrected. It observed the error, responded to that error by actually correcting to find the actual column that it needed. And so you're seeing both the error correction that it had learned how to do as well as the use of the tools to discover the right information in the first place.

So between those two, those behaviors are the real keys to succeeding at these questions. And this is actually something, like, maybe not quite intuitive about, like, where it is that the model was failing. The reality is that what it needed to do, and here it is getting the correct answer, the reality is what it needed to do was to learn how to use tools.

Couple interesting things that go along with it that are, um, more, uh, more fun and, uh, also really useful and good for our situation here. The training data that we talked about at the beginning, there were single-table questions, multi-table questions included in the overall dataset. And for, uh, as part of the ablation study, one of the things that they said was, like, "Let's take a look and see if we train with single-table only, train with multi-table mixed in, so the full dataset a-across both types, or try to do some curriculum learning and actually start with single table, let the model climb a bit, and then progressively add multi-table."

Ablation16:24

Kobie Crawford16:58

And it turned out the single-table only training was actually the one that yielded the greatest uplift for these kinds of questions. Um, so that was a nice, pleasant surprise. And, um, the, the other surprising thing was that even though the single-table only training regime was the best training regime, the uplift that we see in terms of the model's performance on that harder benchmark that is a-- that has multi-table questions was a similar doubling in, in, in percentage imporve-improvement.

So the harder multi-question, the multi-table, uh, uh, Q&A in the FinQA reasoning, uh, question set, uh, also saw thirteen point nine to twenty-six point six percentage, uh, uh, jump, uh, after this training. So interestingly enough, again, the tool discipline, just knowing how to use the tools that are in the environment, turned out to be a bigger deal than anything else in terms of how to make these models actually get better at what they need to do in this space.

So turned out it wasn't the reasoning that was the issue, it was the tool use. We focused only on single step for the best performance, and were able to fix that core failure mode. And given that core failure mode being fixed, it turned out that that then made the model better in terms of, like, the improvement generalizing to other question sets.

And, uh, so that means, like, that's the key to talk, to take away from this, is that sometimes the, the idea is to find the specific behavior that's really the problem. And one of the things I would-- go back to what we do at Snorkel. One of the, one of the things that our research team has been talking about a lot lately is building rubrics as part of our evals.

Rubric Evals18:31

Kobie Crawford18:49

And then those rubrics, by breaking down the rightness or wrongness of a model's response into a full list of different questions that can be answered, and looking at each of those individual questions, you can then start to use the rubric as, as a way to find and, and intuit, like, and find where the actual problem is among all the multiple possible arenas.

So instead of simply knowing yes or no at the final, which is good for the RL part, you can use the rubric to help you do an analysis of what are the behaviors that you wanna actually generate datasets to help you with. So you make decisions about what, which datasets you need or which data you wanna work with based on what you see coming out of the richer feedback that the rubric gives you.

And then the, uh, RL still gets your single value, as a GRPO just usually works with a single value. That's, uh, that's part of how it works. So you use that for the actual RL cycle.

So that's the, that's the summary of what we, we did with that. Um, we think it's a really interesting, um, result to know. And, uh, you know, again, the opportunity of what you can do with, uh, solving the right questions or right problems, uh, really helps. This link here is to a blog post that we have about this.

Closing19:46

Kobie Crawford20:01

So if you have questions about the details of this particular study and you wanna see more about it, you can drill down within that. It also links to a partner post from the Agentica team over at UC Berkeley, so their post also has additional information you can see from them. And, uh, and this is, uh, uh, you know, the significant thing we wanted to talk about.

So, um, thank you for your time. And I, I don't know how much time we have left for questions or not. Uh, how are we doing? Uh, are we already at time? Looks like it. Yeah, sorry. Okay, so I'm sorry we don't have questions. I'll hang out right outside if we, anybody has any follow-up questions they wanna ask.

Um, and thank you very much. Appreciate it.

Stop Making Models Bigger, Make Them Behave — Kobie Crawford, Snorkel

Topics

Mentioned

Transcript