RAG is dead, right?? — Kuba Rogut, Turbopuffer — AI Engineer

Intro0:00

Kuba Rogut0:15

Hi, welcome everyone. Uh, thanks for coming out. I see it's a full room, so appreciate everyone coming out. Uh, so welcome to the talk about RAG is Dead, right? Um, so my name is Kuba. I'm a deployed engineer at Turbopuffer. So for those that don't know what Turbopuffer is, we are a full tech search and vector search database, uh, built from first principles on top of object storage.

Um, if you would love to learn more, uh, just come find me after the talk if you have any questions. So let's get started. So, um, this talk I get is a, sorry, about how RAG is dead, how hybrid tool, tool rift retrieval is becoming a default for serious agentic search. So if you guys have been on Twitter or other social media platforms, or I guess X as they call it now, um, you might have seen a lot of tweets like this about how RAG is dead.

RAG Discourse0:57

Kuba Rogut0:57

You can see there's lots of tweets, especially in the last, you know, in end of twenty twenty-five and in the early of this year about how, you know, RAG is dead, agentic file search is all we need, and there's kind of a, a lot of tweet and a lot of kind of content about this now. But, you know, interestingly, if we're to look at something like the Google, uh, Google search volume over the last two years or the last couple of years, you can see that, you know, in twenty twenty-three, kind of as AI starts, we have this kind of this, this, uh, you know, increase, kinda caps out a little bit in twenty twenty-four, settles down for about a year, and then about midway through twenty twenty-five, we hit this new inflection point where search volume just goes through the roof.

Definitions1:12

Kuba Rogut1:35

Um, so take that Twitter. Um, so let's clarify first. What is RAG and what is Agentic Search? These are kind of the two terms a lot of people are throwing out these days. Um, so RAG. What a lot of people think RAG is, is just simple vector search. They just think that this is, um, just simply, you know, embedding a bunch of, you know, a corpus of content, passing an embedding vector and getting it back, passing it through your LLM.

And what at Turbopuffer, what we think this actually means, you know, if we break down RAG into retrieval augmented generation, you know, retrieval is not just vector search. It's a lot of different things. It could be vector search, uh, full tech search using stuff like BM twenty-five, grepping, globbing, using regex, using other just basic filters. Um, and then the augmented generation is obviously just passing it into your LLM cho- of choice.

And then Agentic Search. This is kind of the terms people are throwing around a lot these days. And generally, when people start talking about Agentic Search, what they usually talk about is essentially just file system graph. So if you guys are familiar with something like Claude Code, um, and kind of that's what-- or Claude Code Codex, um, you know, a lot of people call this Agentic Search, and this just essentially is grepping through your file system, and this is kind of why these terms are so correlated.

Um, and what we actually believe it is and, you know, kind of the definition we wanna bel- we wanna give it is, it's really giving the agents a set of tools to kind of progressively and iteratively find and reason over context. So with Claude Code, you can, you know, if you guys have-- are familiar with it, it can read your file, you can start grepping through your file system and read a file, decide that it hasn't found what it needed, what it, what it needed to actually complete the task, and it will, you know, find something again, and it keep doing this until it's happy, you know, it's reached a happy state where it can continue on with the task.

Cursor Indexing3:15

Kuba Rogut3:16

So we're gonna take a step back, um, and talk about one of the companies that use Turbopuffer, uh, that we believe is doing an excellent job with Agentic Search. This is a company called Cursor. You might have heard of them. Um, fun fact, they're actually one of, uh, Turbopuffer's very first customers. Um, and they have this excellent blog post that came out in the beginning of twenty twenty-six about how they index code bases.

So for those unaware, when you open up a new code base or a new branch in Cursor, what happens is that Cursor will start embedding your code base. So what they'll do is, you know, chunk out your parse, chunk and embed your code base, and make it available for semantic search. And this blog post goes into an excellent, um, kind of excellent technical detail of how they do this.

Um, just to give you the gist, essentially the cool thing they do, um, is that they found that, you know, most people working on a team, let's say there's a hundred engineers, uh, when they open up code bases, they're normally the same code base, you know, ninety-nine percent of the time because you can have a team of a hundred people most of the time working on one, two, maybe a few code bases, right?

Um, and it's really expensive to have to, like re-chunk, re-embed and re-upload these code bases every single time. Uh, so they essentially use like Merkle trees, which essentially is this crypto hash tree, um, to calculate similarities between code bases people open on the same team. And if they're similar enough, uh, they will essentially copy over the data and then only update the, uh, and re-chunk and re-embed the files that have changed and use Turbopuffer in order to make sure this is done securely.

And, um, yeah, it's j-just excellent blog post. They, they do some really cool stuff. Uh, and you may think, like this is a lot of work. Uh, why do they do this? Well, the reason they do this is also covered in a, in a different blog post, uh, about how they use semantic search. Again, they use Turbopuffer, Turbopuffer for this.

Uh, and what they find is on average across models, I think it's like a twelve and a half or thirteen and a half percent increase in answer accuracy. Um, this is across, across their internal cursor context benchmark. Um, so, you know, not, not a public benchmark, but, uh, you can trust the numbers they give us. And you can see on, on the right side, uh, their composer model, so this is, this is before Composer two, uh, they had a almost a twenty-four percent increase in answer accuracy.

Performance Gains4:55

Kuba Rogut5:20

So giving semantic search to these tools and, uh, to these models is really can drive real performance gains. And you can see on the, on the bottom right, um, this is from an online AB test they did, which is also covered in their thing, uh, in, in their blog post, um, about how it's almost like a two point six percent retenti-- uh, code retention in large code bases, and there's a two point two percent decrease in dissatisfied user request.

And you may be thinking like, "Oh, well, these numbers aren't that big, like two point six percent, two point two percent, not that large." But they also cover that, um, semantic search isn't used in every single query. So in their online AB test, you know, if you give it, if you give these tools to a hundred quer- a hundred random queries, not every hundred query will actually benefit from the existence of a semantic search tool.

Cached Compute6:05

Kuba Rogut6:05

So that's why these numbers look kinda small Um, and now let's talk a little bit about Claude Code. Uh, so Claude Code doesn't use vector search as covered f- uh, by this tweet from Boris Cherny. Uh, so those unfamiliar with Boris, he's essentially the founding father of Claude Code. Um, and he says that in early iterations of Claude Code, they actually did use RAG and a local vector DB, but they found that it just didn't really work out for them.

But this is something that, uh, is important to understand. It's something we, like, we've kind of like taken on a lot internally, uh, understanding, um, here at Turbopuffer, is this idea that, like, embeddings and semantic search are kind of cache compute. Um, and you may be thinking like cache compute, like, kind of throwing out a lot of terms at me right now.

Like, I don't know what exactly what that means. Um, and I think it's, like, best to walk through an example, um, of essentially almost like a Claude Code looking trace and a cursor looking trace, um, of how some... how these agents will understand your code base. So w- on the left is kind of a per session discovery of Claude Code.

So for example, if we were to ask the agent to understand how metadata filtering works, what it'd have to do is grep, read, assess, and repeat and try to find the files it needs in order to gain this understanding on a per session basis. So what this means is you could have, you know, ten agents on ten different days, um, across ten developers, and you...

they can be asking the same question multiple times, you know, in day, uh, every day. Uh, and every time the agent's gonna have to kind of repeat these same exact steps, uh, to gain the kind of the like same understanding of, of this code base. Um, and this could, you know, cost quite a few tokens. Uh, you know, six thousand doesn't seem like a lot here, but just remember this is like one sub-step of an agent.

And then on the right is kind of like a more cursor looking trace where there's this upfront cost of indexing, but then we're able to allow for this like lightweight tool to help the agent kind of retrieve this information at runtime. Um, so obviously there's this like upfront cost of, uh, parsing code base, embedding it, uh, and making it available, but this is like a one-time cost.

Uh, and then at runtime, the agent can just query something like how is metadata filtered. Uh, it can get some simple results, and it would save a lot of tokens, a lot of time, and just a lot of money. Um, and this just helps the agent to become a lot faster. Um, you know, a lot of people on the team now that maybe were big Claude Code users here at Turbopuffer, um, they've, you know, they've actually started switching to Cursor just because of how fast it's becoming, uh, especially with their Composer 2 models and also the semantic understanding.

Agentic Retrieval8:34

Kuba Rogut8:34

Uh, it's just become we- we're finding really, really good. So from RAG to Agentic Retrieval. Um, so what we're finding now is that a lot of people are no longer doing the simple RAG, you know, the, the Twitter "RAG" of just, um, doing a vector search once and throwing it into the context windows. Uh, what we're finding, um, is that this worked, you know, back in twenty twenty-three or early twenty twenty-four at kind of the beginnings of AI.

But a lot of the more sophisticated customers are doing Agentic Search and it's giving like real, real big performance gains, uh, and kind of un- unlocking like new products. Um, and it... what we're finding is, you know, they're doing a ton of calls. They're reasoning... These agents are reasoning through several steps. They're searching semantically or through full text, et cetera, as needed, and they're only fetching what's needed for that specific, uh, specific use case.

But the important thing to know is that, you know, retrieval is no longer just this like simple one-time call to vector DB. It's becoming super iterative, and these agents are really understanding what they're searching and searching to understand more in a sense. Um, and it's kind of this like interesting loop. Uh, you know, Google's Jeff Dean, he went on a, I, I forget if it was a show or a podcast or whatever, and he had this really good quote that, that we like to use, um, that w- we also we thought was super interesting.

Stage Retrieval9:44

Kuba Rogut9:55

Uh, he was talking a little bit a- I believe about how Gemini's models were kind of having these really big context windows. Uh, and I forget the exact question the, the host asked him. Um, but he was saying, you know, big context windows, it doesn't matter if you get to a trillion context window size. Um, what you really need is stage retrieval, like a lightweight mechanism to narrow down these trillion tokens into essentially millions at a time.

Um, and like the exact quote is, "You don't need a trillion at once. You need the right million." Um, this is something we think a lot about Tur- uh, here at Turbopuffer. Um, you know, we have customers that embed, you know, have trillions of tokens, um, inside Turbopuffer, and as we see, like the really important part is just getting down to this right hundred thousand, right ten thousand, right million, uh, in order to pass into these context windows.

Outro10:43

Kuba Rogut10:43

Uh, that's about it for a talk. Um, if you have any questions about any specifics, I'd love to, you know, either ask, have them asked now, or you can find me after the talk. Uh, but appreciate you guys coming out.

RAG is dead, right?? — Kuba Rogut, Turbopuffer

Topics

Mentioned

Transcript