Debugging & Test Observability

Q: How do I tell a flaky test from a real failure?

A real failure reproduces consistently while a flaky test passes on retry without any code change. Configure retries in CI plus trace on-first-retry so any test that only passes on retry is flagged with a trace attached, then triage by root cause rather than silently accepting the green.

A Playwright suite that is green on your laptop and red in CI is not a testing problem — it is an observability problem. When a run fails, the only question that matters is how fast you can answer "what did the browser actually do?" without rerunning the test, adding console.log, or guessing. This guide treats every failure as a signal that should arrive pre-diagnosed: a trace you can step through, an artifact attached to the report, and enough context to separate a real regression from environmental noise. Below, each area links to a focused guide — capturing and reading traces, taming flaky tests, and configuring reporters and artifacts — so you can build a feedback loop where a CI failure tells you the cause before you open your editor.

Observability turns a bare failure into an artifact bundle that names the root cause before you reopen the code.

Why observability beats re-running

The default debugging loop — see red, add logging, push, wait for CI, repeat — costs minutes per iteration and fails entirely for failures that do not reproduce locally. Observability inverts the loop: you instrument the suite once so that the first failure already carries its own explanation. A captured trace records every action, network call, and DOM snapshot; a screenshot freezes the page at the moment of failure; a reporter aggregates these into a browsable report your whole team can open. The investment is small and the payoff compounds, because the same instrumentation that diagnoses today's failure diagnoses every future one. This discipline pairs directly with Reliable Selector Strategies for Playwright: a flaky locator and a flaky test look identical in a log but obvious in a trace.

The rest of this guide is organized around the three foundations of a diagnosable suite. Read them in order if you are starting from scratch, or jump to the one that matches your current pain.

Trace-first debugging

A Playwright trace is a self-contained recording of a test run: a time-travel timeline of every action, before-and-after DOM snapshots, the full network log, console output, and source mapping back to the line of test code that triggered each step. When a test fails, the trace answers "what did the browser see?" with pixel-accurate fidelity, which is why trace-first debugging is the single highest-leverage habit in this entire guide. Instead of reproducing a failure, you open the recording of the failure that already happened.

The Trace Viewer & Debugging guide covers turning capture on (--trace on, or trace: 'on-first-retry' in config), where the trace.zip lands, and how to open it with npx playwright show-trace. From there it walks through reading the timeline: scrubbing through action snapshots, inspecting the network panel to see which request returned a 500, and using the DOM snapshot to confirm a locator pointed at the element you intended.

import { defineConfig } from '@playwright/test';

export default defineConfig({
  use: {
    // Record a full trace only when a test is retried, keeping green runs cheap.
    trace: 'on-first-retry',
  },
});

Two focused walkthroughs live beneath that guide. One shows how to read a real failed-run trace end to end and pin the exact action that broke; the other covers the interactive tooling — Inspector, UI mode, and page.pause() — for stepping through a test live while you author it. Trace capture also intersects with Network Interception Basics: a trace's network panel is where you confirm a mocked or intercepted request actually fired the way you expected.

Flaky-test management

A flaky test passes and fails without any code change, and it is the most corrosive failure mode a suite can have, because it trains the team to ignore red. The fix is not "add a retry and move on" — it is to detect flakiness systematically, classify each instance by root cause (timing, shared state, network, animation), and either fix the underlying race or quarantine the test with a tracking ticket. Retries are a measurement tool and a safety net, not a cure.

The Flaky Test Management guide explains how to surface flaky tests from CI history, how retries and trace: 'on-first-retry' work together to capture evidence the moment a test misbehaves, and how to budget timeouts so a slow-but-correct test is not mistaken for a broken one. Most flakiness traces back to implicit waits or unstable selectors, so this area leans heavily on Reliable Selector Strategies for Playwright and on the auto-waiting assertions documented under Advanced Interactions & Test Assertions.

import { defineConfig } from '@playwright/test';

export default defineConfig({
  // Retry only in CI; a retry that turns red green is a flake to investigate, not ignore.
  retries: process.env.CI ? 2 : 0,
  use: { trace: 'on-first-retry' },
});

Beneath it, one walkthrough shows the detection-and-fix loop for a specific flaky test, and another covers tuning retries and timeouts so CI is stable without masking real defects. Flaky-test work is where observability and configuration meet: the right Playwright Config & Fixtures settings decide how much evidence each failure leaves behind.

Reporters and artifacts

Reporters decide what a finished run looks like to a human. The built-in HTML reporter renders a browsable summary with per-test status, attached traces, screenshots, and video; the list and dot reporters are for terminals; JSON and JUnit reporters feed dashboards and CI integrations. Artifacts — screenshots on failure, video of the run, and the trace itself — are the evidence a reporter links to. Configured well, the report your CI publishes is the only place anyone needs to look after a failed run.

The Reporters & Test Artifacts guide covers choosing reporters per environment, wiring screenshot: 'only-on-failure' and video: 'retain-on-failure', and publishing the HTML report as a CI artifact so a failed pipeline links straight to the evidence.

import { defineConfig } from '@playwright/test';

export default defineConfig({
  // A terminal-friendly reporter plus a rich HTML report published as a CI artifact.
  reporter: [['list'], ['html', { open: 'never' }]],
  use: {
    screenshot: 'only-on-failure',
    video: 'retain-on-failure',
  },
});

Reporters are where observability connects to your pipeline. Publishing artifacts is a CI concern, so this area sits next to CI/CD Integration: the same workflow that shards your tests should upload the HTML report and traces so every failed job is one click from its evidence.

Reading a failure: the diagnosis order

When a run fails, the order in which you look at evidence matters as much as having it. A disciplined diagnosis order keeps you from guessing and from fixing symptoms instead of causes. Work outward from the most specific signal to the most general:

The error message and the failing action. Playwright's error already names the action, the locator, and what it expected. Read it literally — toBeVisible() failed is a different problem from strict mode violation: locator resolved to 3 elements. The second is a selector defect; the first could be timing, layout, or data.
The DOM snapshot at the moment of failure. Open the trace and scrub to the failing action. Was the element present? Present but covered? Absent entirely? This single observation collapses most of the possibility space.
The network panel. If the element was absent, the data that feeds it probably did not arrive in time, or arrived as an error. The waterfall shows you the request, its status, and its timing relative to the assertion.
The console panel. A thrown exception in app code, a failed module load, or a CORS error surfaces here and explains an otherwise mysterious blank page.
The source and call panels. Finally, map the failure back to the line of test code and the resolved arguments to confirm the test asked for what you intended.

Following this order turns "the test is flaky" into "the POST /api/orders resolved 400ms after the assertion's timeout, so the banner had not rendered yet." That sentence is a fix; "it's flaky" is not.

The same order applies whether you are looking at a one-off local failure or triaging a backlog of intermittent CI failures. What changes is the source of evidence: locally you might step through with the Inspector, while in CI you read the artifacts the run left behind. Both lead to the same five questions.

Common failure signatures and where they point

Most Playwright failures fall into a handful of signatures. Recognizing the signature short-circuits the diagnosis.

toBeVisible() / toHaveText() timeout, element absent in snapshot. A data race. The UI had not rendered the element when the locator's timeout expired. Fix by waiting on the underlying signal — a waitForResponse(), a network-idle state, or an explicit assertion on a precursor element — rather than the wall clock.
Assertion timeout, element present in snapshot. A retry-window mismatch or an overlay. Auto-waiting assertions retry until timeout, so if the element was visible the whole time, suspect a covering element, a transform that moved it off-screen, or a second matching element. This is selector territory, covered in Reliable Selector Strategies for Playwright.
strict mode violation. The locator matched more than one element. Tighten it with a role, an accessible name, or getByRole scoping rather than reaching for .first(), which hides the ambiguity.
Action fails with element is not stable or not attached. The DOM was mutating during the action — common with animations and React re-renders. The fix is to target a settled element; the dynamic-content patterns under Advanced Interactions & Test Assertions address this directly.
Passes locally, fails only in CI. Almost always timing or environment: CI is slower, headless, and differently sized. The trace from on-first-retry is the only reliable way to see what CI actually rendered. This signature is the central concern of Flaky Test Management.
Passes on retry, fails on first attempt. A textbook flake. The retry masks it, but the first-attempt trace holds the evidence. Never let a retry-green test pass triage silently.

Building a shared vocabulary of these signatures across the team is itself an observability investment: a failure described as "strict mode violation in the nav" is already half-diagnosed before anyone opens the trace.

Observability and the test configuration

Every observability behavior in this guide is ultimately a configuration decision. Where traces, screenshots, and videos come from, how many retries you allow, what timeouts bound an action, and which reporters publish the results all live in playwright.config.ts or are derived from fixtures. Getting these defaults right once means every test inherits diagnosability for free.

import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  // Bound how long a whole test may run before it is killed.
  timeout: 30_000,
  expect: {
    // Auto-waiting assertions retry within this window before failing.
    timeout: 5_000,
  },
  // Retry in CI so flakes are surfaced (and traced), never in local dev.
  retries: process.env.CI ? 2 : 0,
  // Fail the build if someone commits test.only.
  forbidOnly: !!process.env.CI,
  reporter: [['list'], ['html', { open: 'never' }]],
  use: {
    trace: 'on-first-retry',
    screenshot: 'only-on-failure',
    video: 'retain-on-failure',
  },
  projects: [
    { name: 'chromium', use: { ...devices['Desktop Chrome'] } },
  ],
});

This single block encodes a complete observability policy: bounded timeouts so a hang fails fast, retries that surface flakes with traces attached, artifacts captured precisely when a test fails, and a reporter that publishes them. The deeper rationale for each field belongs to Playwright Config & Fixtures, and the per-area guides below explain how to tune each setting for its job.

Observability in CI

A failure you cannot see is a failure you cannot fix, and in CI the browser is invisible by default — headless, on a machine you do not have a shell into, on a run that has already finished. CI observability is therefore entirely about what the run leaves behind. Three practices make CI failures as diagnosable as local ones.

First, capture evidence on retry, not always. trace: 'on-first-retry' plus retries: 2 means a passing run costs nothing extra, while any flake is recorded the instant it misbehaves. Pair it with screenshot: 'only-on-failure' and video: 'retain-on-failure' so failed tests carry a screenshot and a video without bloating green runs.

Second, publish the artifacts. The HTML report and the test-results directory must be uploaded as job artifacts so a developer can open them without checking out the branch. This is a pipeline concern handled in CI/CD Integration; the reporter produces the bundle, the workflow exposes it.

// In CI, the html reporter writes to playwright-report/ and traces to
// test-results/. Your workflow uploads both as artifacts. Reference only —
// the upload step lives in the pipeline YAML, not in test code.
import { defineConfig } from '@playwright/test';

export default defineConfig({
  outputDir: 'test-results',
  reporter: [['html', { outputFolder: 'playwright-report', open: 'never' }]],
  use: { trace: 'on-first-retry' },
});

Third, shard without losing evidence. Splitting the suite across parallel jobs speeds CI, but each shard produces its own artifacts that must be merged for a single report. The HTML reporter supports merging blob reports from sharded runs, so a 10-way shard still yields one browsable report. Sharding is covered alongside the rest of pipeline setup in CI/CD Integration.

The payoff is a pipeline where a red check links to a report, the report links to the failing test, and the test links to its trace, screenshot, and video — no rerun, no SSH, no guessing.

How the pieces fit together

These three areas are not independent — they form one pipeline. Capture decides what evidence exists (traces, screenshots, video). Flaky-test management decides when that evidence is collected (on retry, on failure) and how you act on it. Reporters decide how the evidence is presented and where it is published. A suite that gets all three right has a property worth naming: any failure, anywhere, can be diagnosed from the report alone. Start with trace capture because it has the highest payoff per minute of setup, then layer retries-with-trace for flakiness, then publish everything through a reporter your CI exposes.

If you are still standing up the suite itself, begin one level up with Playwright Setup & Core Architecture, which establishes the config, fixtures, and project structure these observability settings plug into.

Building the diagnosis habit across a team

Tooling alone does not produce a diagnosable suite — habits do. A team can have traces, screenshots, and a beautiful HTML report and still waste hours per failure if no one knows the diagnosis order or trusts the evidence. The cultural side of observability matters as much as the configuration.

The first habit is never re-run to make red go away. A green re-run without an explanation is not a fix; it is a deferral. When a test passes on retry, the retry's trace exists precisely so someone can answer why it failed the first time. Treat every retry-green as a small bug report that ships with its own evidence attached.

The second habit is diagnose from the artifact, not the branch. If the first instinct on a failure is to check out the branch and reproduce locally, the observability investment is being wasted. The trace, screenshot, and video from the failed run already contain the answer for the overwhelming majority of failures. Reproducing locally should be the fallback for the rare case the artifacts cannot explain — usually genuine environment differences — not the default move.

The third habit is describe failures by signature. A failure reported as "the checkout test is broken" forces every reader to start from zero. The same failure reported as "strict mode violation: the Save locator matches two buttons after the modal opens" is already diagnosed and routed. Shared vocabulary for the common signatures — race, overlay, strict-mode, late response — compresses triage from minutes to seconds.

The fourth habit is keep the config honest. Timeouts that are too generous hide slow regressions; retries that are too high mask flakes; an artifact policy that records nothing leaves failures undiagnosable. Review these settings periodically the way you review any other production configuration, because for a test suite they are production configuration. The reference defaults in Playwright Config & Fixtures are a starting point, not a destination.

Together these habits turn observability from a set of files into a feedback loop: a failure arrives pre-diagnosed, the team reads it in the same language, the fix targets the root cause, and the next run proves it. That loop is the entire point of this guide, and every area below — traces, flaky-test work, reporters — exists to make one part of it faster.

Frequently Asked Questions

What is the single most useful debugging feature in Playwright?

The Trace Viewer. A trace is a complete recording of a test run — actions, DOM snapshots, network, and console — so you can step through a failure after the fact instead of reproducing it. Enable trace: 'on-first-retry' and you get this evidence automatically the first time a test misbehaves.

How do I tell a flaky test from a real failure?

A real failure reproduces consistently; a flaky test passes on retry without any code change. Configure retries in CI plus trace: 'on-first-retry' so any test that only passes on retry is flagged with a trace attached, then triage by root cause rather than silently accepting the green.

Where should test artifacts live so the whole team can see them?

Publish the HTML reporter output and trace files as CI artifacts from your pipeline. That makes every failed job link directly to its screenshots, video, and trace, so a developer can diagnose the failure without checking out the branch or rerunning anything.