Mike Cann's avatar
Mike Cann
14 days ago

Resilient AI End-to-End Tests with Stagehand and Convex

I've spent years writing Playwright tests that pass on Monday and break on Wednesday because someone renamed a button. So when I sat down to add AI end-to-end testing to a Convex app, I tried something different: a natural-language browser agent, Stagehand, driving a real browser against an ephemeral Convex backend, orchestrated by Vitest. This article walks through what I built, what it cost, where it broke, and what an AI end-to-end testing setup looks like on a reactive backend.

Per-run isolated backends, type-safe routing, test-mode auth that bypasses OAuth, and full CI integration in GitHub Actions. Each test run costs less than a cent on the model I settled on.

Why traditional E2E tests break

AI end-to-end testing exists because selector-based E2E testing has a maintenance problem. Tests written against CSS classes, XPath, or data-testid attributes break whenever the UI is refactored, and the breakage is often unrelated to whether the feature still works for a human user. By the time a deadline approaches, many teams comment the brittle tests out and ship anyway.

The root issue is that selectors describe how the DOM is structured rather than what the user is trying to do. A button renamed from #buy-now to #purchase is functionally identical from a user's perspective, but it breaks every test that referenced the old id. The test suite isn't testing the feature, it's testing one snapshot of the implementation, which is why selector-heavy suites rot faster than the code they cover.

The testing-pyramid trap

The classic testing pyramid puts unit tests at the wide base, integration tests in the middle, and E2E tests at the narrow top. The shape exists because E2E tests are slow, flaky, and expensive to maintain, so you write as few as possible. That's reasonable advice given the cost, but it leaves the highest-value tests, the ones that actually exercise what a user does, perpetually underinvested. The uncomfortable compromise is that the tests closest to real user behaviour are the ones you end up shipping the fewest of.

What self-directed QA in a box would actually require

If I imagine the version of this I want, it's something close to a junior QA engineer who reads the feature, opens the app, tries the obvious paths, and reports back. That requires three things:

  • A browser the agent can drive
  • A way to express intent in natural language rather than selectors
  • A backend that can be reset to a known state per run.

Stagehand and Convex cover the first two and the third respectively, which is why those are the pieces I reached for.

What Stagehand is and why it pairs well with Convex

Stagehand is a Playwright-based library from Browserbase that lets you drive a browser using natural-language instructions interpreted by an LLM, with a small, explicit API surface rather than open-ended prompting. It pairs well with Convex because Convex runs as a standalone open-source binary, which means each test run can have its own clean backend instead of fighting over shared dev data.

The combination matters because most AI-driven browser tooling assumes the backend is a black box you can't reset. With Convex, the backend itself is part of the test fixture, so you can seed exactly the state a test needs and tear it all down at the end of the run.

Act, Extract, Observe, Agent: the four APIs

Stagehand exposes four primitives:

  • act() performs a single action described in natural language, like "click the buy button."
  • extract() pulls structured data out of the page using a schema you provide.
  • observe() asks the model what's on the page or whether a condition holds.
  • agent() hands a high-level goal to an autonomous loop that plans and executes its own steps.

The first three are deterministic enough to write assertions against, because the surface area is small and the model is doing one bounded thing per call. The fourth isn't, and I'll come back to that.

Why a reactive backend matters for repeatable tests

Convex queries are reactive, so any data a test mutates is immediately visible to the frontend without manual refresh logic. For E2E tests this matters because the model never has to "wait and retry" on stale UI: the moment a test-only mutation seeds a record, the page reflects it, and the agent can proceed. Selector-based suites spend a surprising amount of code on polling, sleeping, and waiting for hydration, and most of that disappears when the backend pushes updates to the client. Convex's durable workflows also mean that test mutations apply atomically, so there's never partial state between your setup step and the agent's first action.

Splitting unit and E2E tests with Vitest projects

Unit tests and E2E tests have different runtime characteristics, so I split them into two Vitest projects inside the same repo:

  • Unit tests run in milliseconds against pure functions
  • E2E tests boot a backend, launch a browser, and call an LLM

Mixing them in one config means every vitest run pays the E2E setup cost, which is the kind of small friction that quietly stops you from running tests during development.

Configuring two projects in one repo

In vitest.config.ts I define a projects array with two entries: one named unit that globs src/**/*.test.ts, and one named e2e that globs e2e/**/*.test.ts and includes the setup file that boots Convex and Stagehand. From the CLI:

1bunx vitest run --project unit
2bunx vitest run --project e2e
3

The VS Code Vitest extension picks both up and lets me run either side from the gutter, so the unit-test feedback loop stays as tight as it would be in a single-project setup.

Where to put E2E tests in a Convex codebase

I keep E2E tests in a top-level e2e/ directory, separate from convex/ and src/. The setup file lives at e2e/setupE2E.ts and is responsible for the lifecycle: download the Convex binary if needed, start it on a free port, deploy functions, launch Stagehand, and tear everything down after the suite. Keeping all of that in one file means the lifecycle is visible in one place, which matters when something goes wrong and you need to know which step failed.

Spinning up an isolated Convex backend per test run

The cleanest pattern I found is to run a fresh Convex backend per test run as a child process of the test suite, using the same standalone binary you'd use for self-hosting Convex. This gives each run a clean database, no shared state with the dev deployment, and parallelism-safety if you ever want to fan tests out across multiple workers.

Why the dev deployment isn't safe to test against

If tests hit your dev deployment, two things go wrong. Tests pollute the data you're using for manual development, so the UI you're poking at in another tab keeps changing under you, and concurrent test runs (in CI, on a teammate's machine, in a watch loop) clobber each other. Running against an ephemeral instance solves both problems and makes test runs reproducible because the starting state is always empty.

Downloading and booting the standalone Convex binary

The setup script downloads the Convex backend binary for the current platform on first run and caches it, so subsequent runs skip the download entirely. It then spawns the binary on an available port, pointing it at a temporary working directory so the SQLite file is throwaway. The script waits for the health endpoint to respond before continuing, since starting tests against a not-yet-listening backend produces the kind of confusing connection error that wastes an hour of debugging.

Deploying functions to the ephemeral instance

Once the binary is up, the script runs bunx convex deploy against the local URL with the ephemeral admin key, which pushes the schema and all functions in convex/ to the new instance. The frontend, started by Stagehand under Vite, is configured via VITE_CONVEX_URL to point at the same local backend, so the app the agent drives is the real app, talking to a real Convex instance, just one that didn't exist five seconds ago.

Writing the first test with a public-user smoke flow

The first test I wrote was a smoke test for the public ticket-purchase flow: open the homepage, navigate to the tickets page, click buy, and confirm the success state. Stagehand drove the whole thing in about a dozen lines, which is the first moment I started believing this approach could replace meaningful chunks of my old Playwright suite.

Type-safe routing with a goTo wrapper

The app uses type-route for routing, so I wrote a small goTo helper that takes a typed route object and navigates the Stagehand page to its URL. That keeps tests refactor-safe, because if a route's parameters change, the test fails at compile time instead of at runtime with a 404. The test reads almost like a sentence:

1goTo(page, routes.tickets({ tourId: seededTourId }));
2

The route function is the one declared in the app, so the test and the app share a single source of truth for URLs. Renaming a route in the app produces a TypeScript error in the test, not a flaky failure three weeks later.

Using act and observe to assert behaviour

For the buy flow I called act("click the buy ticket button") and then observe("is there a success confirmation visible on the page?"). The observe call returns a structured boolean that I assert on. No selectors, no waits, no flakiness when a designer changes the button colour, and the test reads like a description of what the user is doing rather than a description of the DOM.

I stopped thinking about elements and started thinking about intents, which is the level the test should have been at the whole time.

Watching Playwright video recordings

Stagehand inherits Playwright's video recording, so I enabled it in the launch config and dropped the videos into test-results/. When a test fails, I watch the recording and see exactly what the agent saw, which is more useful than any stack trace I've ever had. A 12-second clip of the agent clicking the wrong tab tells me more in one viewing than half an hour of log archaeology used to.

Seeding data and extracting structured results with Zod

Most useful tests need preconditions: a tour to buy tickets for, votes already cast, a user with a specific role. I expose those preconditions through test-only Convex mutations, then use Stagehand's extract() with a Zod schema to assert on whatever the UI ends up showing.

Test-only mutations gated by an environment variable

I add functions like _testSeedTour to convex/_testing.ts and guard them with a runtime check on a IS_TEST environment variable set only on the ephemeral backend. If the variable is missing, the mutation throws. This pattern keeps the dangerous-by-default helpers from running in production, and the custom functions helpers make the guard a one-line wrapper that I apply uniformly to every test helper.

The naming convention (_test prefix) is a second line of defence. Even if someone managed to call a test mutation from the frontend in production, the function name is loud enough that it would surface in code review. The point isn't to make this impossible, just to make it implausible.

Asking Stagehand for a typed result set

For a test that needs to verify the leaderboard, I define a Zod schema for the row shape and call:

1const result = await page.extract({
2  instruction: "extract the top five entries from the leaderboard",
3  schema: z.object({
4    entries: z.array(z.object({ name: z.string(), votes: z.number() })),
5  }),
6});
7

The result is typed, and I can run normal assertions on the array. No DOM traversal, no fragile text-content matching, no regex over innerText. The schema is the contract: if the page doesn't contain extractable data shaped like that, the call fails with a useful error rather than returning a half-parsed mess.

What's interesting about extract() is that it does roughly the same job as a custom DOM scraper, but I get to specify the output shape instead of the input shape, which is the whole point. The test says what it needs and the model figures out how to find it, which is exactly backwards from the selector-first approach and exactly right.

Going fully autonomous with the agent API

The agent API hands a single high-level goal to Stagehand and lets it plan its own steps, which is the part of this stack that feels like a genuine shift in what's possible. Instead of scripting "click X, then click Y, then verify Z," you tell the agent "vote for the boat with the most lights and confirm your vote was recorded," and it figures out the path.

From step-by-step to high-level goals

I rewrote a multi-step voting test as a single agent() call. The scripted version was 40 lines. The agent version was three. When it worked, it worked exactly the way a human tester would have, including handling a mid-flow confirmation dialog I hadn't anticipated. That last part is the surprising one: the scripted version would have failed on the dialog because I hadn't written the click for it, whereas the agent just dismissed it and kept going.

What watching the agent struggle reveals about UX

The interesting failure mode wasn't the agent breaking, it was the agent taking too long to find the obvious next step, which usually meant the UI was hiding it. In one run the agent couldn't figure out how to enter the competition because the CTA was buried below the fold and styled like body text. That's a UX bug, and the agent surfaced it before any user did.

I started watching agent traces the way I'd watch a user-testing session, looking for the moments of hesitation. Each pause was a small piece of evidence that something on the page was harder to find than it should have been. None of my selector-based tests had ever produced that signal, because selectors short-circuit exactly the discovery process that the signal lives in.

The non-determinism trade-off

The same agent call can take three different paths across three runs, and occasionally one of those paths fails for reasons that are hard to attribute. I haven't fully separated model issues from prompt issues from genuine UI ambiguity. For now I run agent-mode tests with retries and treat their failures as a signal to investigate rather than a hard build-break.

The honest framing is that agent-mode tests are a different kind of test, closer to monitoring than to assertions. A scripted test fails when the code is broken, but an agent test failing means something is off, which could be the code, the model, or the page, and the value is in the investigation rather than in the red dot itself.

Picking a model for cost versus reliability

Model choice is the biggest variable in this whole setup, both for reliability and for per-run cost. I tried several before settling on GPT-5 mini as the default, and the difference between models was larger than I expected going in.

What Gemini Flash, GPT-5 mini, and others cost per run

Gemini Flash is the cheapest option I tried and worked for simple act() calls, but it stumbled on multi-step flows and on extract() with non-trivial schemas. GPT-5 mini was reliable across the suite and came in at less than a cent per test run for the test sizes I was using. A frontier model would have been more reliable still but multiplied the per-run cost by an order of magnitude, which adds up fast in CI when every PR triggers a run.

If you're prototyping and want cheapest-possible, start with a small model. If you're running this in CI on every PR, the small-but-not-tiny tier is the sweet spot. If your tests must never flake and budget isn't the constraint, jump to the frontier tier and revisit the bill at the end of the month.

Per-step model overrides

Stagehand lets you override the model per call, so the long-term optimization is to use a cheap model for simple act() calls and reserve a stronger model for agent() and complex extract(). I haven't tuned this fully, but the API supports it and it's the obvious next lever. The mental model is the same as picking instance types in a cloud setup: don't pay for capability you don't need on the cheap parts of the workload.

Test-mode authentication with Convex Auth

For tests that need a signed-in user I added a test-only credentials provider to Convex Auth that accepts a username and returns a session, bypassing the production Google OAuth flow entirely. The provider is registered only when a VITE_IS_TEST_MODE flag is set, so it can't be reached from a production build.

Bypassing Google OAuth for tests

Driving a real OAuth flow from an agent is possible but brittle and slow, since Google actively detects and blocks automated sign-ins. A test-only provider sidesteps the problem by issuing a session for a known test user directly, which is faster, more reliable, and doesn't depend on Google's bot-detection mood that day.

A test-only credentials provider

In convex/auth.ts, when IS_TEST_MODE is true, I register a credentials provider that takes { username } and looks up or creates the corresponding user. The frontend, also gated on VITE_IS_TEST_MODE, renders a small "sign in as test user" form that the agent can interact with. This doesn't exactly mirror the production OAuth code path, which is the honest trade-off. I accept that gap because the alternative (driving real OAuth in CI) would cost more in flakiness than the coverage gap costs in confidence.

The provider isn't registered without the env var, and the UI isn't rendered without it, so two independent gates mean a single misconfiguration can't expose the test path in production.

Running AI E2E tests in GitHub Actions

The whole setup runs in GitHub Actions with one workflow file: checkout, install Bun, cache the Playwright browser download, run the E2E project, and upload videos on failure. The agent's API key is supplied as a repository secret, and the standalone Convex binary boots inside the runner just as it does on my laptop.

1- uses: actions/checkout@v4
2- uses: oven-sh/setup-bun@v2
3- uses: actions/cache@v4
4  with:
5    path: ~/.cache/ms-playwright
6    key: playwright-${{ runner.os }}-${{ hashFiles('bun.lockb') }}
7- run: bun install
8- run: bunx playwright install chromium
9- run: bunx vitest run --project e2e
10  env:
11    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
12- uses: actions/upload-artifact@v4
13  if: failure()
14  with:
15    name: e2e-videos
16    path: test-results/
17

Caching the Playwright browser

Playwright downloads Chromium on first install, which is slow in CI. Caching ~/.cache/ms-playwright keyed on the lockfile cuts cold-run time substantially and only invalidates when dependencies change, so the cache stays warm across most PRs and the cold-cache penalty hits only when someone touches the lockfile.

Uploading run videos as artifacts

The if: failure() upload is the single best debugging feature in this whole setup. When a test fails on CI, I download the artifact, watch the agent's session, and almost always see the cause within seconds. No log archaeology required, no guessing at what state the page was in when the assertion blew up. The video shows the agent hovering over the wrong element, or the page rendering an unexpected modal, or the network tab silently failing, and the fix is usually obvious from there.

Limitations and honest trade-offs

This is an early-stage approach, and there are things it doesn't do well yet. I want to be specific about them rather than wave them away.

When agents get stuck

The agent occasionally fixates on a wrong interpretation of the page and burns tokens trying variations of the same failed action. I've capped step counts to bound the cost, but a confused agent still produces a test that's slow and uninformative. The fallback is to break that test back down into scripted act() calls, which is annoying but recovers the determinism I gave up by going autonomous in the first place.

There's also a subtler failure mode where the agent succeeds for the wrong reason, like clicking a different button that happens to produce a success-looking page. I've only seen this once, but it's the kind of false-pass that's worse than a false-fail because it doesn't show up until something downstream breaks. Schema-typed extract() assertions after the agent call are a partial defence: they make the test verify the actual outcome rather than just "did anything succeed-looking happen."

What this approach doesn't replace

This isn't a replacement for unit tests, contract tests on Convex functions, scheduled function validation, or hand-written Playwright tests for flows where determinism matters more than expressive intent. The right shape, in my experience, is to use scripted Playwright (or scripted Stagehand act()) for the critical-path flows you must never let regress, and use agent() for exploratory coverage and UX-smell detection. I expect this balance will shift as models improve, but it's where I'd start today.

The other thing it doesn't replace is taste. Watching an agent run through your app is a useful design review tool because it exposes the parts that are hard to navigate, but it can't tell you whether the feature is the right feature or whether the copy lands. Those are still human judgements, just now informed by a slightly weirder set of data points than before.

Frequently asked questions

Q: What is AI end-to-end testing? A: AI end-to-end testing means driving a real browser against your app using natural-language instructions interpreted by an LLM, rather than CSS or XPath selectors. A library like Stagehand exposes primitives (act, extract, observe, agent) that translate intent into browser actions, so tests describe what a user does instead of how the DOM is structured.

Q: How much does it cost to run AI E2E tests? A: On the test suite described above, using GPT-5 mini, a single test run cost less than a cent. Costs scale with the number of model calls per test, the model tier, and how much page content the model has to reason over, so an agent()-heavy suite on a frontier model can be substantially more expensive.

Q: Can AI E2E tests replace traditional Playwright tests entirely? A: Not yet. Scripted tests are still more deterministic and cheaper to run, so they remain the right choice for critical paths you can't afford to have flake. AI-driven tests complement them by covering flows that change often and by surfacing UX problems through agent behaviour.

Q: How do you handle login in AI E2E tests? A: Add a test-only credentials provider to Convex Auth that's only registered when a test-mode environment flag is set, then have the agent sign in through a simple test-mode form instead of the production OAuth flow. This avoids the brittleness of driving real OAuth in automation.

Q: Does this work in CI? A: Yes. A standard GitHub Actions workflow can install Bun, cache the Playwright browser, boot the standalone Convex backend, run the Vitest E2E project, and upload run videos as artifacts on failure.

Q: Is Convex open source? A: Yes, the Convex backend is available as a standalone binary and as a container, which is what makes it practical to spin up an ephemeral, fully isolated backend per test run.

Where this approach goes next

AI end-to-end testing on a Convex stack is workable today for exploratory coverage, smoke tests, and UX-smell detection, and the per-run cost is low enough that I'd run it on every PR. The honest gap is non-determinism at the agent() layer, which I expect to narrow as models and the Stagehand API mature.

If you want to try this on your own project, the working example, including the Vitest setup, ephemeral backend boot, test-mode auth, and the GitHub Actions workflow, is on GitHub at mikecann/port-geo-christmas-lights-cruise at the video-release tag, and a fresh project can be bootstrapped from the Convex quickstart.

If you're using Convex's Agent component for AI workflows in your app, the same ephemeral backend pattern lets you test those flows end-to-end without touching production threads or message history. If you've built something similar or want Convex to invest in a first-class E2E testing story, I'd like to hear about it

Build in minutes, scale forever.

Convex is the backend platform with everything you need to build your full-stack AI project. Cloud functions, a database, file storage, scheduling, workflow, vector search, and realtime updates fit together seamlessly.

Get started