Why MCP and ChatGPT Apps Use Double Iframes — Frédéric Barthelet, Alpic — AI Engineer

Intro0:00

Frédéric Barthelet0:15

Hi, everyone. Uh, my name is Fred. I'm the CTO and co-founder of Alpic, the MCP hosting company. And today, I would like to share with you an adventure deep diving into the double iframe mechanism that we have on ChatGPT and MCP app and why it matters when we build apps. Uh, first thing first, if you haven't had the chance to, uh, listen to Ido, I-Ido and Liat talks just before about, uh, MCP apps, a quick, uh, sum up of what, uh, those MCP and ChatGPT apps are.

Um, that's a new surface area for your business to expose product and services, uh, with new acquisition channel, um, that has two main criteria. First one being discoverability. So you will have ecosystems of connectors and apps available in consumer, uh, generalistic agent like ChatGPT and Claude. So ChatGPT App Store and Claude Connectors. Um, those apps are, uh, browsable inside the store, but they are also discoverable in chat.

MCP Apps0:46

Frédéric Barthelet1:13

So if you are having a conversation that's relevant for an app to be brought into to add additional context and, uh, feature some, uh, nice additional actions, uh, that will be brought into the conversation. And the second part, which is the biggest part and what we will be focusing on in this talk, which is the addition of interactive UI inside those conversat-conversational agents, uh, where you used to add text-only, uh, apps adds a new layer of UI that could be provided by the MCP server, uh, but could be generated or, or generative UI as well.

Um, they were first thought of, uh, using MCP UI that was, uh, developed by Liat and Ido just before. Then released by OpenAI, uh, with an apps SDK back in October of last year and standardized across multiple clients on the first official extension of MCP, uh, called the App Extension. How does it work under the hood? Uh, if we take a little bit, uh, uh, a closer look at, uh, how this UI is brought into the conversation, um, those are brought using views.

Views & Iframes1:44

Frédéric Barthelet2:13

Views are the name that we use for those no-small snippet of UIs that appears inside the conversation. Um, views are always rendered as a result of a tool call. So if your server expose multiple tool to be used, you can actually add metadata on some of them to say, "This tool is best used when, uh, results will be displayed using a specific UI."

And if the host supports MCP apps, it will use, uh, the relevant view corresponding to this tool call to display the results. Uh, views are simple HTML document. You can include JS insi-CSS inside. Nothing new under the sun here. It's just a way to package those small snippets of, uh, application. And, um, they are discoverable ahead of time because all views are described on the tool list calls that happens at the beginning of the conversation between the host and your MCP server or MCP app.

Um, so each tools that supports UI will, uh, advertise the resource that's needed to display the UI. It can be cached ahead of time, or it can be, uh, served and downloaded and served right away when the tool call that needs UI to be rendered is made. Um, the conversational agent on the host will create this new iframe where the view will be displayed, and it will inject the tool results inside so that you have dynamic content rendered to the user.

If you take a closer look at what is inside the, um, DOM of the host when you take a, um... I was a bit curious. I wanted to know how it was working or how ChatGPT was actually rendering third-party UI inside the conversation. Um, I was a bit surprised, and I was met with, uh, not so much expectation about having a double iframe, having an iframe nested in a, inside another iframe.

Double Iframe3:32

Frédéric Barthelet3:59

And this gave me the idea for these talks. Uh, I want to, uh, bring you today with me deep diving into why the decision was made to do this kind of inception nesting of iframes and what are the benefits, what was it put in place, and what are the implications when you build apps, what you should be paying attention to, and how to make sure that your experience is very nice.

Before we go into that, let's take a close look at, uh, what ChatGPT, uh, was before MCP app were implemented. We'll be using ChatGPT as the example throughout this deep dive, but the exact same app on, on Claude AI if you want to take another look by yourself. Um, the initial thing to take a look at that is very important is something called content security policy.

CSP Basics4:19

Frédéric Barthelet4:39

Those are directives returned by a server as response header to document call. So when you load ChatGPT inside your browser, uh, ChatGPT will respond with document plus security policy directive on what the browser should be allowed to load and execute and what it shouldn't be allowed to load and execute. Uh, you've got multiple, uh, directives, including inside content security policy, some about, uh, which scripts you can run, which CSS stylesheet you can download, which image you can download, which, uh, API you can connect to and ask question to.

Um, I will not go into the details, but two are very important to remember here. Frame-src, which basically is the directive to allow, uh, specific website to render iframe inside the document, and script-src, which basically allows specific sites, um, uh, script to be run, uh, inside the browser. Um, to be able to run, uh, external UI inside ChatGPT, we will use a dedicated HTML element that has been made specifically for this purpose, which is the inline frame element or iframe, um, that is made to basically spawn up a nested browsing context inside your browser window.

So those small pieces of views will be rendered as almost separately br- Completely isolated browsing context. They are very convenient, and they have two ways to be used. First one is to provide a source for the iframe that you want to render, so basically a URL of another page to be loaded by your browser and executed locally and rendered inside the space it's made of, it's made for.

And the srcdoc, which is another attribute which allows you to push inside the iframe content that you want to render as is, without having the browser to load another content. So if we want to build this marketplace of app and have third-party UI rendered inside ChatGPT, why not use straightaway srcdoc as the attribute for, um, or, um, as the attribute for, uh, for injected context into?

And I'm realizing now that it's a little bit small, but I think I can zoom in a bit. No, I cannot. Okay. Sorry about that. You'll have to trust me about what is written inside here. Um, so, um, I will just put inside an iframe, uh, injecting context into the iframe. Context being, uh, the content being basically the resource that is being exposed by the MCP server.

Security Attempts6:37

Frédéric Barthelet6:57

So pure HTML loaded inside. Um, if I do that, um, it's not going to work, mostly because when you load up an iframe, uh, with srcdoc attribute, um, uh, specified, the iframe that you are spawning up is sharing the same origin and sharing the same therefore CSP as the host that is responsible for rendering it. So, um, any script that would be part of your application, that would be completely blocked by existing ChatGPT CSP on script src directive, which basically require every script in ChatGPT to be signed with a specific nonce produced ahead of time at each request, which is a cool security feature to put in, but it prevents any la- app to be able to execute, uh, JS.

So in order to do that, uh, what if we relax a little bit the content security policy of ChatGPT and make it, uh, so that it can execute any line of code? I would not suggest doing that into production. It's just an experimental thought here. Um, but if you do that, you, uh, face a new problem. Uh, basically you are sharing the same origin as your parent DOM, so the, uh, loaded iframe script would be able to access local storage or cookies that are indexed by origin.

Uh, so you would be able, as an app, to, uh, re- for example, get the existing local storage of, uh, ChatGPT and send it to your back-end server, which if you are OpenAI, you would not want people to be able to do. So let's roll back, put back the CSP as it was before, and instead, uh, sandbox the iframe.

Sandbox is another attribute that you can use on iframes, um, allowing iframes to be rendered in what we call an opaque origin. It would mean that basically the iframe will not share any more of the parent origin. It will be n- something equivalent to null, um, making sure that they don't share the same origin and won't have the same problem of script being able to, uh, access the parent DOM.

However, doing so, um, you lack any capabilities that are dependent on origin indexing. Because all content, all scripts that are handled inside your iframe will now be pointing towards a null origin, you cannot use local storage, you cannot use local indexedDB, you cannot use cookies, because those are indexed by origin. And the only way to actually provide an origin to a sandboxed iframe is to put allow-same-origin, which is an additional attribute that, uh, brings back the exact same origin as the parent back into the iframe, and you're back to square one, where you have an iframe with exactly the right condition to escape its sandboxing and access parent DOM, access parent local storage, access parent cookies.

Okay. So iframe srcdoc is not the way to go forward. Let's move on to the next best solution that we have using the source attribute. Source attribute basically allows me to reference an h, um, an endpoint that will be the content loaded by the browser inside this iframe. Um, I'm a developer of a ChatGPT app, an MCP app.

Um, why not expose my s- uh, view, my small HTML application, um, as a normal endpoint, like, on the view endpoint, for example, of my own server? Um, that would be a nice way to do it. However, it would require, um, uh, OpenAI to modify, um, the, another CSP directive, which is basically the frame source directive, uh, listing all domains that are allowed to actually render iframe on ChatGPT to include an infinite list of all the MCP applications that will be developed by various companies and brought into the store.

So every time a new app would come out, um, ChatGPT would have to update CSP to include the new domain so that the f- frame can be rendered, uh, on this specific domain. This is not doable full scale. So what we can do instead is, um, provide kind of a proxy, a controlled domain, s- single one that will be owned by, uh, ChatGPT in that case.

For example, openai- openaiausercontent.com. That's an actual domain that they are using for user content that they want to expose on their own domain. Um, and use this domain, uh, as a reference inside the frame source to make sure that the directive does not block rendering any iframes that are loaded on there. And you need to provide a server on the openai user content that's able to download the resource content from the MCP server, the HTML, and expose it so that it can be rendered, for example, on any subdomain, and use the first part of the subdomain as the routing key to the right application.

Doing so, you effectively needs to put in motion an infrastructure where your domain hosts external third-party UI from all apps that will be submitted on your store, which is not a very good position to be into, because once again, you will be responsible for code that you don't know what it's doing, and you will be exposing it on your own domain.

In addition to that, if you're not OpenAI or if you're not Anthropic, you might not have the resource as a host to put in place infrastructure required to, uh, serve this kind of dynamic serving of, uh, content. So what you can do instead Is, um, go to the double iframe mechanism. And what you will do with that is basically load the same script for everybody, which will be a simple script responsible to recover the resources and initiate an iframe with the srcdoc attribute, so we put the content inside.

Solution11:52

Frédéric Barthelet12:22

But this iframe will not be served at top level because it shares the same origin and it has the escaping problem we were mentioning before. It will be served inside an iframe with a dedicated domain that is different from ChatGPT to make sure the isolation stays there. And you won't- don't want to stop there. Actually, you want to, um, put subdomains for this exact, uh, script loader.

Um, it will be the exact same content that will be served every time, but you want to put it on value subdomains so that if your app uses any app, um, APIs that requires origin indexing, like local storage or cookie, you don't have collision in between your app. So you want app ABC123 not to be able to access local storage from app ABC456, for example.

Um, the infrastructure for that is much, uh, less intensive because you're serving the exact same script content of the first iframe for every subdomain that is- exists on this specific endpoint. Last but not least, um, you want to be able to provide the same kind of content security policy definition to the app itself so that it can prevent execution of malicious script or, uh, rendering of iframe, uh, directly inside the view.

Um, there is a way to provide this into the MCP spec, and the way you will render it is using a specific meta tag inside the first iframe. This is the actual solution that I implemented into production, and it's not a new solution. It's been around for a long time, and actually the first time this solution was implemented was back in Facebook days when they released the App Marketplace, which is exactly the same problem, that you have to run and render a third-party UI inside the context of your own application.

Dev Tips14:01

Frédéric Barthelet14:01

Um, what's important for you as an app developer is to make sure what are the specs available, uh, for you to be able to control the behavior that results from this double iframe nesting. And the thing that you will have to make sure to do is every time you build an application, declare all domains your application depends upon, uh, inside the provided metadata, uh, in the MCP app specs so that, uh, you are sure that they will be retranscri- uh, um, rewritten correctly inside the nested iframe.

For example, if you are from your app connecting to an external API to fetch data, you need to reference this domain inside the connectSrc directive of the, uh, metadata. Same thing for the script, image, uh, frame-based UI, not so much used. But the two first one are very important. And, um, it's reminded me of a very old problem that I had when I started my developer days, uh, back in, uh, 2016, I think it was.

Um, as a new, uh, developer in the space, I was experiencing a trouble getting, uh, cross-origin resource, uh, security right, CORS. Uh, I had trouble getting it right. CSP reminded me of those, uh, ugly day of not getting right for the first time with CORS. Um, so there has been effort in the ecosystem to make, uh, builders' life better, uh, especially, for example, on OpenAI side, uh, where they, uh, activated, they added an option in developer mode.

So if you're a developer and developing app, they have a, a specific mode, which is developer mode, which allows you to have access to additional features. Up to today, um, when you were in developer mode, all CSP were, were removed, so you were discovering when you were going into production if some of your servers could not be reached because of s- missing domains inside your CSP, which was not ideal.

Um, they are not the only one, uh, doing a bunch of work to make bu- builders' life much easier. Uh, at Alpic, we build an open source framework called Skybridge. Um, Skybridge is a super set of feature on top of, uh, the official, uh, app SDK, uh, Ido and Liad were mentioning. Uh, it brings a few things to the table, end-to-end type safety between your MCP server and your app, widget and views.

Um, you have a lot of APIs that provides a polyfill for features that are not part of the common specification and specific to some of the hosts, some of Claude and some of, uh, ChatGPT, uh, APIs. And, um, we provide a bunch of modern development feature, especially in the on dev environment features. And I wanted to show you, uh, one just specifically made for CSP, which we call the CSP Inspector.

Demo16:35

Frédéric Barthelet16:35

Um, time for a small demo. I do have a few minutes left. Perfect. Let me quickly switch to my screen. Up.

Uh, up.

Okay. So this is,

um, this is an example code base that will be generated when you create a new Skybridge application. Um, this example Skybridge application comes with, uh, a small application. It's a eight ball that you can ask any question to that will, uh, respond one of the 25 predefined answer. Um, it has a MCP tool that serves as, uh, generating the answer and a view, uh, that's just there to display the question and the answer to the question that you ask.

Uh, when I start the server, uh, of Skybridge, I have access in my browser to a dev tool, which is basically a small app that will give me inspection tooling to work with my app before I bring it to the GPT. For example, on the left, I have a list of tools that are exposed by my app.

Uh, I can, if I want, uh, ask any s- uh, execute any of the tool. So for example, here I'm executing the magic 8 ball tool, and if there are, uh, views associated with this tool, um, that will be rendered inside the inspector for me to have a closer look at it and make changes and see those changes reflected live in the UI.

Um, the neat thing I wanted to demonstrate is the CSP part of the inspector that we built, which basically looks at all the domains that you listed inside your metadata and all the domains that are actually accessed by calls made by your view, and compile them to make sure that none of them are not listed yet. So for example, here everything looks green, but if I go back to my actual code base and, for example, um, fetch some API to get, uh, info about my IP location, um, this, uh, will be reflected straight away in the inspector because the component has been re-rendered and I now have the exact domain that I want to, uh, that I just called, uh, listed as missing from the metadata.

And I can go back inside my application and add the missing domain, and now it should appear green if I reload it. Yeah, everything good. So neat little tool. There are a lot of other features that are packed inside Skybridge, but that's one of them that we made sure, uh, to be available to builders because we've seen a lot of rejection coming from the GPT app store submission because of missing CSP and app not working in production because of missing CSP, uh, domains.

Um, just to finish up quickly, if I can. Yeah. Up. Uh, up. This one. Yeah. Um, yeah, that's all. Uh, thank you again for your time. Um, if you wanna, uh, grab the slides and take a look, uh, later on, uh, feel free to scan the first QR code. Um, if you want to give a try to Skybridge, feel free to scan the second one.

Wrap-up19:17

Frédéric Barthelet19:40

Um, I will run right now a small lottery. We have, uh, Skiggles to win. If you star Skybridge repo in the next minute or so, um, I will draw a name at random, and you will win a Skiggle of mine. Thank you very much for your attention.

Why MCP and ChatGPT Apps Use Double Iframes — Frédéric Barthelet, Alpic

Topics

Mentioned

Transcript