Intro0:00
Good afternoon, everyone. Know it's the last day. Hopefully you're still holding on, not too tired of talking about AI yet. Um, yeah, today's gonna be a little different. Um, so I work for WorkOS. My name's Garrett. I run product for the team. Um, I'm not exactly talking about our product today, so I'll do, like, 10 seconds about WorkOS just to, you know, get that out of the way since, uh, my company cares about that.
We do enterprise platform features. We're a developer platform. Uh, the quick and easy is if you've ever logged into Cursor, you've used WorkOS, whether that was, like, username, password, or you went through your enterprise IDP. We power enterprise platform features for the likes of Cursor, Anthropic, OpenAI. Um, today though, I'm gonna talk about something a little bit different.
The Problem0:57
I'm gonna talk about how we operate internally and things that we've built to make ourselves more productive. So, um, I imagine most of you are probably engineers, uh, or on the technical side. Um, you might have questions about things about your company, about how customers are using the product, how things are working. Uh, your go-to-market teams or your support teams definitely have questions about how customers are using your product, trying to figure out, uh, answers to questions.
You might have, like, Retool inside your company. You might have built, like, dashboards and things like that. Uh, of course, those can be fairly rigid, right? You build a very specific thing. Someone comes and says, "Oh, actually, but I need this extra bit of data. I need to find out- answer this different question that the dashboard doesn't answer."
And so either you go and build that, you change it, right? We kind of see, like, a workflow that looks kinda like this where someone has a question, often about the business. They may not be technical enough to go answer it themselves. Uh, they often need something like SQL or someone that has access to the data. They have to explain their question, why they need it answered, the qu- you know, context to answer it.
They wait. Someone like you has to go answer the question, provide that data back to them. Did you actually answer the question? Did you provide enough detail? Oh, no, no, that's great, but I actually need the next layer deeper. Gotta go back and forth. Uh, you probably share that in Slack, sort of a one-off. Doesn't really scale very well.
Um, we have this problem. If you didn't, we have this problem every day. And so we built a tool called Studio, uh, that serves as kind of an internal workspace where people can answer questions and build these kind of apps or dashboards themselves. So I'm gonna show you both kind of what it looks like to build this out.
Studio2:20
I'll also show you a few examples of some of the tools that we use every day inside of Studio. And then I'll talk a little bit about how it works under the covers. So, um, so I don't, uh, get this completely wrong, I have a little prompt here already. But so, um, you know, a common thing we have, we do a lot of marketing.
You know, we're doing podcast advertisements. We're doing Google ads. We're getting people to come to WorkOS site, whether that's our blog, our docs, or our marketing site. And then we wanna know what content are they reading and what's effective, right? What is someone reading on our dash- on our site, and then u- converting to actually using the app.
So I wanna know, hey, what content leads to the most new teams? We call all customers teams internally, so leads to the most new team creations. So I can fire this off and we will, um... Studio starts operating. It basically says, "Okay, I wanna find this data. I need to kinda look at, like, what, you know, internal resources do I have access to?"
So it knows that it has access to, like, my Linear, my Notion, my Snowflake. So we have these data sources that we connect to. Um, and then basically understands how to use these tools and starts to run queries. So in this case, it's gonna run a bunch of Snowflake queries, which is our internal database. It's where we store a lot of this data.
Um, and it's gonna go through, figure out the schemas, look at the tables that it needs to, and do it. While it's doing this, since it might take just a minute, I'm gonna talk a little bit about, uh, how it works, uh, kind of under the hood. So, uh, you can either go to s- our internal studio dashboard or in Slack, we have a Slack bot, so you can ask questions of Studio.
Architecture3:56
That kicks off the process. We run in a little API behind the scenes that takes that, parses it, and then runs it through LangGraph, which is an agent that's both tied to, uh, LLM, which in this case we're using Opus, along with the tools and the guidance layer for how we should interact with these systems. So we have this integration proxy, uh, to the data sources, primarily like Snowflake, Linear, and Notion are the tools that we use.
Um, and this guidance layer basically defines rules around how you should query this data, the context you need to successfully query the data, right? Our Snowflake is a pretty sprawling set of databases, so it needs to, like, get context around how do I... what's the representation of a customer inside of Snowflake? How do I join tables in a way that's effective?
So the agent drives all of this, uh, makes queries, the LLM runs, and then, of course, it provides back answers or, uh, updates widgets, which I'll show you in a minute. And then we store a lot of that state today in Convex as a way to locally store this information so it's preserved over sessions. So we go back.
Um, cool. Looks like we've actually gotten a bunch of data here. Let me make it a little bit bigger for you. So we can see obviously, like, you know, people go to our homepage, people look at the pricing page. Um, we can see, like, the blog posts that are most effective for driving team signups, um, change logs and docs, and we kinda get the summary.
Okay. But this is great, so this, like, answers my kinda question, but I want this to be, like, a long-standing thing that I can reuse. So can you build a table of this that lets me see, see this data over various time slices?
Building Widgets5:44
And so here it's not just, you know, run the queries, get the answer, but I actually want it to build like a reusable tool that I can, that I can share with my teammates, that I can use in like our weekly syncs. And so it's gonna kind of think through how to do this, and then it's gonna go and build what we call a widget.
A widget is, in this case, uh, basically like sandbox code that runs. Um, and it's both the UI, the APIs, and the query necessary to power like, uh, a fully usable tool. Um, so this is gonna think for a minute, uh, as it actually creates the widget. Uh, I have another version of it that I can show you.
We'll see if they look the same, uh, across instances. But this is like a one that I had pre-built before this. Um, so it basically gives me this data of like teams over different time spans, you know, what content is driving those signups. Um, and it, it's live, right? If I, if I run this, it's gonna rerun that query.
It's gonna give me the data for different time slices. Uh, I can make it, you know, full screen here, show it. Um, and so it, you know, it's gonna think through this. Um, look, and we get like a pretty similar, slightly different view, but, um, this one actually has like category filters, so I can look at, you know, based on the kind of content that it's running, um, and see like, okay, what are the most effective, you know...
Radar Example7:26
We got a lot of blog posts, which are the most effective ones, um, that drive traffic to the platform so that we can tailor our content effectively. Um, but this isn't, you know, this is useful for a lot of things. Like for example, uh, Radar is one of our internal products. It's like a security product that blocks bots and bad actors, and, you know, sometimes customers say, "Hey, why did this user get blocked by Radar?"
Right? "Can you help me understand?" And so, you know, we've built some of these dashboards and stuff ourselves, but typically that involves like having to go through. A lot of our GSCs are like sharing SQL queries that'll run to answer these questions. But instead of having to do all that, I can just, you know, uh, I've already built this widget that has the APIs or the queries hooked up, and I can do a search for myself in this case, my personal email.
Uh, and, you know, in this case, it's, it's running a real query against our database to actually pull this data and, and look at it. And so you can kind of see like the conversation history of here of me talking with it, "Hey, can you build me this dashboard?" Uh, runs a bunch of queries. Uh, it actually like kind of messed up at first, but I, you know, saying like, "Hey, can you, can...
There seems to be an issue. Can you keep going?" Um, it did it, and sort of the last thing was like it had like a visual UI bug in the type column, so I was like, "Hey, can you, can you fix that, like visual bug? Like I'd like it to be one nice little column." And so here we can see like, uh, for Cursor, who's one of our customers that uses Radar, here's all the times that I logged into Cursor with my personal email and whether I was blocked or not.
Uh, I had a test here where, uh, you know, I blocked myself in one of our test environments. And so like this becomes like a self-serve tool that our support team can use to look this stuff up. Um, and so this has been really, really powerful for our support team. They use this in Slack all the time because they don't need, um...
You know, different customers have different, you know, specific issues, and they can say like, "Hey, can you go find me all the sessions that this customer has so I can find out like what went wrong?" Right. And so we're not trying to build, you know, we don't need to have some sort of like platform team or data team building these dashboards that are gonna be used and need to be constantly modified.
Uh, our support team can basically, if it's a one-off, get the question answered themselves. And if they're finding that they're actually asking the same question a lot, they can build these and then, you know, we can share them internally to other folks. And so we kind of build out our own dashboard and tooling, um, in a self-serve manner.
Three Techniques9:55
So I'll kind of, uh, wrap up with a little bit of like what did we have to do here to make it, uh, like useful and reliable. So there's kind of three things, um, that became really important in building this. The first is sequencing. So this is, uh, how should the agent like approach when it gets a new question, when it gets a prompt, how should it do this?
So when we make it run a lot of pre-flight checks, um, so this is, you know, are the tools connected correctly? Do you have enough context to better answer the question? If not, ask clarifying questions, um, and then sort of determine, run through a b- a checklist to determine the tools that it should actually use to call.
Um, we actually, at the time it decides to invoke a tool, that's when we inject context around how to use the tool. So, um, for example, if I show some of the, uh, tooling that we use, like for Snowflake for example, we have this context that we embed, and it's, it's not trivial. It's fairly long 'cause it encodes basically the schema of our internal database on how to like understand how do you connect teams to the environments, to the resources that they're using.
And so like for example, this gets injected, uh, at runtime when Snowflake is being adjusted. I saw someone earlier in a talk talking about how you, you know, you don't wanna preload all that context of all your tools 'cause it blows out your context window. Second is layering. Um, so we have sort of the base prompt that Studio uses to start off with.
We have sort of the defaults, and then we sort of have like org rules around, um, in a given, like set up for a given tool, there might be a specific context. If someone's going and editing a tool, we want that context to be maintained. Um, and then actually last, um, we tell the LLM to specifically like distrust, uh, knowledge about our product often, just because like sometimes the model training is using outdated data.
Our product changes very quickly. Things are moving all the time. And so we actually use, uh, we tell it to like, "No, no, no, go for primary sources. Look up data in our docs and things like that. Don't just rely on like what the, you know, model knows about WorkOS necessarily." And then last, validation. So, um- If it's gonna write a query to our Snowflake instance, we have it always run the query and validate that it gets data back.
Uh, you know, many times they can have a valid SQL query, but that returns zero data. If it doesn't notice that, it's not very useful. So it actually runs queries, uh, validates them before it hard codes them into widgets and things like that. So it's basically pre-validating its work before it's deploying it to, like, a dashboard. Um, and then, yeah, we, we run obviously evals when we're developing the product.
Evals are very useful. I don't have time to go into how to develop and design evals, um, but we use evals, uh, in both our staging and production instances. We treat all of that the same, so that way we get the same experience when we're developing Studio versus when our teammates are using it. So yeah, that's Studio.
It, uh, is our way of basically being able to answer any question about the business. Anything I can answer for y'all?
Q&A12:49
One quick question.
Yep.
Sorry.
Sorry, go ahead.
Um, did you have to do a lot of cleanup on your Snowflake, uh, data estate before-
No. Actually, and, like, we have... There's this one specific problem where, like, the connection between a customer entity to, like, you know, the users that they have or whatever, is like four joins deep because of reasons. And, and, like, every new employee has to learn, like, if you wanna do that, you have to, like, copy and paste this join block and use an o- It's like, we tell Studio about it once, right?
It knows how to do that every time. Uh, LMs are quite good at interpreting table schema pretty well. Um, and so there's a lot of stuff they can get if you have, like, pretty self-descriptive, you know, column names and stuff it can figure out. Um, but again, we do have that, like, context block that we provide because it does matter for, you know, example, um, like in Radar we have attempts, which is people trying to log in.
We have detections for when, like, things, and it's like, oh, you need to join this... You know, these tables join this way. And just by telling it that, it can basically run effective queries for, for that data. So there is some, uh, information you wanna provide. Um, but surprisingly good. Like, you don't need to, like, RAG all this stu- you know.
Like, we... There's no RAG database for us in all of this. Like, we're just invoking tools directly, um, with just context on top. Question?
Uh, yeah. Um, and a- interestingly, that's exactly what we do. We've just got a, like, a context thing that tells it how to do all the joins-
Yeah
... and then it knows, and it knows those quirks. Um, one of the things we've been looking at, though, we, we do something similar-ish-
Yeah
... but is, so those queries get generated, so it, I mean, those, those are widgets. You call them widgets. Those coming out.
Yeah.
Do those get audited or governed by anyone? Because our, our concern was that, like, someone generates a query, their skill gets it wrong, and then it becomes a truth, and everyone thinks it's true, and no one's ever checked it. So do you have anything like that?
Yeah. I mean, there's definitely a little bit of, like, you know, there's always some trust but verify. You know, f- um, I've actually been pretty impressed that, uh, the hit rate on the crossers is very, very high. I think there's, like, sort of a category of, um... Which you can actually embed into the, like, context of, like, you know, make sure you only pull non-deleted entities, right?
Make sure you pull things in an active status, right? There's kind of things that your data probably have c- you have consistency around of a status column, right? And, like, those are the kinds of things that I've seen that LMs will miss if they don't know. They're like, it's like, you know, how many users have this resource?
And it's just doing a count group by, you know, customer ID, right?
Yeah.
And it's like, oh, no, you actually need these filter columns. Um, but if you have that in your context, uh, that's the kind of thing that protects against a lot of those issues. So I find that that has removed a lot of the, the problems for us. Um, and then, you know, if, if it misses from there, it's, it's, you know, it's, like, making a bigger error that's pretty obvious.
Um, can a single widget have, uh, data from multiple tools, so you mix and match it?
Yeah. Yeah, so, so we have, uh, you know, just those few. We're adding more of those connections. But yeah, it can pull from different tools and combine that data into one interface.
And how would you refresh the data afterwards? Would you need to, like, know how to replay those tools, uh, sequentially, and-
So it's actually, the widgets are actually code. So it's writing, um, JavaScript that is-
Oh, it's writing JavaScript
... that is making the underlying API calls to that service through the tools. So once the widget is created, it is reliable. It is not... The, it's not the LLM running the tool every... So when I hit, like, refresh here, this is actually just re-querying, uh, data from those tools. So the LLM's not involved once the widget is developed, until I go back and say, "Hey, can you make an adjustment to this widget?
Can you add this column?" Or whatever. So, so the actual final product is very reliable in that regard.
Right. And if you need to pass a different data, it's just, uh, input arguments to that, uh-
Yeah, and I, uh... You know, like, you can do, you know, here it's like I'm giving an input, and that's just being fed into the query like any sort of JavaScript do it, right? So, like, when you're doing these user input things, again, right, you're not relying on the LLM to be able to, like, parse that correctly.
It's writing, you know, declarative code.
Mm-hmm.
Yep.
And how did you respect user access to the data?
How do you what?
Respect user access to the data?
Oh, yeah. That's a great question. Today, uh, the integrations are user-based, so, like, I connecting Snowflake and Linear and Notion myself. Um, that's something we're actually working on changing, 'cause that's kind of annoying. Like, you don't want every employee to have to necessarily do that. And there are cases where, like, you know, maybe you don't have a Salesforce account, but you probably should be able to read certain Salesforce data.
So we're actually working on... The thing that drives these integrations, we have a product called Pipes, which does third-party integrations. So we're actually using our own product under the hood here. We're building out organ- like, what we call org connectors. So, like, one person sets up the connection, and then can set rules about, you know, what's the default level of access when people are querying that.
So, you know, for example, you could say, by default in, you know, Linear you get read-only access. But certain people, based on, like, roles in the Studio application, they get, uh, you know, admin or, or edit access or something like that. So we're kind of building that permissioning layer on top of it, because, yeah, doing the per user login is kind of annoying.
One more.
Cool. Uh, yeah, sure.
Um, yeah. How did you take, uh, have a handle on costs? So obviously using Opus. Is there any, like, caching that you do?
Um, I mean, s- the widgets themselves, once they're generated, are declarative, so you're not paying the LLM cost every time. But, uh, honestly, for us, we're willing to pay the cost for the questions being answered. Like, Opus outperforms better than other models so much that, uh, like, I wouldn't... Trading the cost off would trade off quality in a way that we wouldn't deem acceptable, so, um, I think we're pretty willing to spend the money.
Cool. All right. Thanks so much.