Self Driving Products: Product Signals to Pull Requests — Joshua Snyder, PostHog — AI Engineer

Intro0:00

Joshua Snyder0:15

So I'm Josh. I'm from PostHog. Uh, if you haven't heard of us, you might know us because of some hedgehogs, or you might have seen our founder, James, uh, posting some funny things on LinkedIn. Uh, he's quite popular. Um, I'm gonna be talking today about, uh, what if your product built itself, uh, and the pipeline that we're currently working on, uh, which we're trying to turn our observability data.

Instead of something that you read and that you interpret based on dashboards, we're trying to tu-turn that into something that submits poll requests for you. Cool. Yeah. So, uh, quick background on PostHog. We've got a bunch of tools. We started out as a product analytics company. We now have session replay, web analytics, error tracking, experiments. Uh, this isn't a pitch that you should use PostHog.

This is just to say that we've got a lot of data about your product. So if you connect PostHog to your product, we're collecting a huge amount of data from various different sources that we then show to you, uh, so that you can explore that data yourself. But right now, how observability is working in PostHog, you're, you're collecting all this data for your product and then you're, uh, going to a PostHog dashboard to figure out what's going on, and we think this is super slow and that we should change that.

So, uh, right now, something happens in your product, we call this a signal, that changes a metric on one of your dashboards, and then you might log into PostHog a few hours or maybe some days later, and you notice a change in that dashboard, uh, and you investigate a problem, and then maybe the problem's not that important, so instead of tackling it right now, you're gonna put it in a linear issue or whatever.

A few days later, you try and create a PR for this problem. Then you review it, and you ship it. You get the message. This is a pretty slow process. Uh, from start to finish, this is gonna take anywhere from a few hours to a few days, and it's not very interesting, but it represents a lot of your work as a software en- software engineer.

The Dream2:07

Joshua Snyder2:07

So what we wanna do tomorrow, uh, what we're working on right now, is that a product signal happens, and instead of waiting to see that in your dashboard, uh, we want, we wanna run a background agent, uh, to figure out what's going wrong. And then once they've figured that out, we just wanna create a PR for you automatically.

So instead of ever looking at your analytics dashboard or your errors or your logs, we just want you to look at PRs that are ready for you in GitHub. And if we create the PR, uh, maybe you wanna review that, or maybe we can just ship that immediately behind a feature flag if it's not, uh, a risky change.

Cool. So I'm, I'm gonna go over the pipeline that we've built to do this. Uh, and just whilst I go over that, I'm gonna share a few tips, uh, lessons that we've learned, uh, things that were hard about building this pipeline. So, uh, the pipeline has a few key steps. Uh, at first, we're ingesting a lot of signals.

Pipeline2:46

Joshua Snyder3:02

Uh, in PostHog, we have a huge amount of events. We're ingesting trillions of events a month. Um, and this, uh, this pipeline needs to handle a lot of noise. Uh, and then once we've ingested those events, we need to group them. Um, so if you think of, uh, an error tracking issue and then a session recording, those are two completely different things, but they might be representing the same problem in your product.

So then we-- Once we've ingested them, we group them. Uh, then we're gonna be running a research agent on them. Uh, this specific issue, uh, what is actually the problem that is causing the error spike or causing the issue that the user faced in the replay? Uh, and what repo does this belong to? Uh, and then we'll assess if this is actionable or not.

And finally, we'll execute some code, uh, ship a PR, and, uh, iterate on that PR until it's green and ready for you.

Cool. So, uh, the ingestion step of this pipeline, uh, as I said before, we've got loads of different sources of different types. Uh, the first thing is that those sources, some of them are public. So if I go and visit your website, I can, as an attacker, create an error on your website, uh, by doing something naughty that says, uh, "Post all of your PostHog data online," or something like that, right?

Ingestion3:57

Joshua Snyder4:19

Uh, so we don't want that, so we need a kind of safety filter. So at the moment, right at the top of the pipeline is, uh, an LLM classifier that's gonna check, is this trying to do something bad? Uh, if so, let's drop the signal. Once we've done that, we've checked that things are safe, uh, we're gonna normalize the signal.

Um, so if you think of an error, that's gonna have a stack trace. A log will just be some JSON content or some text. Uh, an experiment might be some results in a chart. We want to, uh, normalize that structure so it's, it's all a single structure for a signal. Um, so we give it a few fields.

We'll give it a source product, uh, the type, the content of the, the signal, and then we will assign it a weight, which is like how important do we think this signal is, and then finally, we'll embed the contents of the signal.

Grouping5:09

Joshua Snyder5:09

Cool. So that part's fairly easy. Uh, then we get to a little bit more of a, a challenging problem. We've got this big stream of signals still, uh, and now we want to group them into actual problems. So the signals are very noisy. We might get some random null pointer exception. But in Slack, we're getting a message from a customer that's saying, "Hey, the checkout's broken for me," and we need to link those together.

So, uh, what we do is we group the signals. Uh, as the signals are being grouped, we, uh, assign weights to what we call a report, and if the, the weight of the report goes over a certain threshold, we'll promote it. And then we'll kick off a research agent to work on it.

So, uh, this was a problem that we faced fairly early on in building this pipeline. Um, the first thing that we did was we would take all of our signals, and we would create embeddings for them. Uh, and then we would try to use that to cluster the issues so that we could find similar or related signals.

But this works really badly. So if you take, uh, an off-the-shelf embedding model, uh, and you embed an error. Uh, let's say I've got an error about the checkout, and I've got an error about onboarding, and then I've got a Slack message about onboarding. What the embedding model will do is it will notice structural similarity, and it will put all of the errors together.

So if you think about what this looks like in embedding space, you've got all of your errors over here, all of your Slack messages here, all of your session replays here, and none of them get grouped to each other. So the way we get around this, uh, is instead of matching in embedding space the signals themselves, uh, we generate queries based off the signals.

So we ask an LLM, what is this signal about? It'll generate a few queries, and then we match those queries in, uh, the embedding space. Yeah, so that's, that's really important. If you, if you don't, uh, think about the structural similarity of your different sources when you're grouping them, then the grouping works really badly. So at first, we were doing this, and then we switched to this approach.

Research7:10

Joshua Snyder7:10

It worked much, much better. Cool. So, uh, once we've got this, uh, report that we've grouped together a few signals, we've got some kind of idea of what's going on, uh, we then have promoted the report because we think it's important enough to work on, and then we're gonna hook it up to a research agent. So, uh, this research agent is, uh, just running the...

It's running the Claude Agent SDK. Uh, it's running that in a sandbox. We also use Modal for our sandbox. Uh, big shout-out to them. They've been great. Um, they're not sponsoring me or anything. Don't worry. Um, and, uh, this research agent has a few tools available to it. So the first tool is it's got our MCP server.

Uh, this allows it to, uh, given the group of issues that we found, uh, you wanna pull in extra data. So let's say I'm looking at a session replay and an error, I'll also pull in log data, and the agent can pull in whatever it wants using the MCP server. This makes the results of the research agent way more accurate.

Uh, the second thing is obviously it's got the code-based context. Uh, and then finally, it's also got external MCPs. That really helps to, like, ground the agent when it's doing the research. Uh, we found that, in particular, Linear and Notion have been really helpful in connecting it to, to deliver better results. So the output of this research agent then, uh, is a summary of the problem.

It gives a priority, how important we think this, this problem is to work on, and then it also uses GitBlame to figure out who should be reviewing this PR if we create a PR for it. So, um, after that, we get a bunch of problems that we think are worthwhile to work on. Uh, we've got a kind of idea of what the general problem is, um, and then we pass it to an actionability step.

Uh, so here, either it will be not actionable. If it's not actionable, it might just be that we don't have enough data yet for this signal, um, for the report, and so we'll put it back into the pool to keep gathering more evidence. Uh, if it needs human input, it might be because it's a product-related decision that the agent can't really make a good call on.

Actionability8:59

Joshua Snyder9:19

Um, so if that happens, we'll put it into an inbox for you to review in the morning. Uh, and then finally, the, the best case is that it's immediately actionable, uh, and that the agent can just write a fix for it. Uh, right now, the, the challenge in this pipeline of getting immediately actionable things is that for some sources, like error tracking, uh, if you think about your data in Sentry or, uh, any errors there, very specific, and usually a coding agent can work on them really well.

For other sources like Slack or Session Replay, uh, you get much more generic problems that can have a lot of different solutions. And so that's where it's harder to get immediately actionable reports.

Cool. Um, then once we've, uh, researched this thing, we go on to executing the task. Uh, this will, uh, clone the user's repo into a sandbox, uh, similar to the research agent. It's then again running the Claude Agent SDK to build a fix for the problem. Um, and then, uh, as it writes those fixes, it will, uh, push a PR, and, uh, when CI is failing or there's a comment on the PR, it will trigger, uh, a rerun of that sandbox.

Execution10:00

Joshua Snyder10:31

So at the end of this, we snapshot the sandbox, uh, and then if there's a comment, let's say from an agent who's reviewing it, we will rehydrate that snapshot and continue running until the PR is green. Uh, and this delivers really good results. It means when you're waking up in the morning and things have been running overnight, you wake up to, instead of a bunch of CI failures or, uh, comments that you need to address manually that you're pulling down to your local environment, you ideally wake up to just green PRs.

Lessons11:01

Joshua Snyder11:01

Cool. So, um, what did we learn whilst building this? Uh, well, the first thing, which I guess we've talked about in the last talk, uh, is that evals really matter. Um, so at first, we were trying this all out on our own data locally, doing kind of a vibe check, is this okay? Um, but this, this really doesn't work well for a pipeline that is, is taking lots of, uh, customer data that's different.

Um, so you really need to know what's going on in production, and if you're not testing on representative data, it... You're, you're basically just fumbling in the dark, right? Like, the ability to iterate on a really good pipeline matters, uh, only if, if you're using evals. Second thing is what I said before, uh, make sure you're embedding the right thing.

Um, embedding models, uh, the off-the-shelf ones are matching a lot based on structural similarity, not just semantic similarity. So if you're thinking about clustering and your data isn't all of the same format, think carefully about what that data looks like and how you can normalize it. Uh, the third thing is that, uh, if you just throw an agent at a problem, it will try to fix something.

So if you get, uh, uh, if you get a signal report that's like onboarding is broken in a generic way, then if you throw that at the Agent SDK or at Claude Code, it will just try and fix something. Uh, and so it's important to understand if the problem that I've described, is it specific enough? Uh, and if not, I should ignore it, otherwise you end up with a, a lot of noisy PRs that aren't doing meaningful things.

And then the fourth one is, uh, that tokens are free. Uh, obviously, that's not true. They're not free. Um, but when you're experimenting, uh, we were at first, uh, focused a lot on the costs of the pipeline. When you think about the input, you've got loads of signals coming in. Uh, and so we tried to avoid using agents where we could or delay it till as late as possible in the pipeline.

And when we were experimenting, this was a big mistake, um, mainly because, uh, when you throw an agent at a problem, you... Once you throw it at the same problem 100 times, you start seeing the kind of clever solutions that it comes up with, and eventually you see similarities. So we started at a point where this pipeline is completely unfeasible.

It was, it was way too costly to generate a PR. But then you quickly start to see, uh, similarities in the agent's behavior, and you can take a really expensive step that you're running an agent for, uh, and turn that into a one-shot LLM call or a model that you're training that's much faster.

Future13:36

Joshua Snyder13:36

Cool. Um, so this is where we are right now. This is what we've built. Uh, we have the signals coming in from product data. Uh, these are grouped into reports, and we're turning these into PRs that are ready to merge when you wake up. This is currently something that's in alpha. We'll be rolling it out kind of over the next few months.

Um, but where we're really wanting to go is, is a product that builds itself, right? Like, when you're thinking about what you do day to day, what you wanna do during the day as a developer is, like, come in and work on exciting features and not worry about all the bugs that customers are sending you or worry about doing boring experiments on pricing or onboarding.

So we just wanna do that all for you. Uh, we wanna ship experiments automatically, measure the impact of them. Uh, instead of you reviewing changes, if the change is pretty easy, let's just approve it with an agent and deploy it behind a feature flag. If it doesn't work very well, we can always roll back the flag and then delete it from your code base later.

Uh, and then the other thing that we wanna do and get better at is we wanna learn from every single outcome. So if we're creating a PR for you, if you're rejecting that PR or there's been an issue with a deployment, or the error is resolved in production once we've released something, we wanna get better at learning from that in the next PR that we're generating.

That's something that we're gonna be iterating a lot in the pipeline next. Cool. Um, yeah, that's it. That's what we've built in PostHog. Um, if you're excited by, uh, looking at, uh, thinking about what you can do with agents and data, I really recommend if you've got a product that's producing a huge amount of data, your users are going through that, agents are amazing at this stuff, throw an agent at it, see what it does.

I'm sure you'll be surprised.

Self Driving Products: Product Signals to Pull Requests — Joshua Snyder, PostHog

Topics

Mentioned

Transcript