Flaky Test Management
A flaky test passes and fails against the same commit with no code change between runs. It is the most expensive failure mode in an automated suite because it trains engineers to ignore red builds, and once a real regression hides behind that noise it ships. Playwright removes most flakiness by design through auto-waiting and web-first assertions, but the remaining cases trace back to a small set of root causes: race conditions, animation timing, shared state between tests, and network non-determinism. This guide, part of Debugging & Test Observability, explains how to detect flakiness early, quarantine it so it stops blocking releases, and eliminate the underlying cause rather than papering over it with retries.
What flakiness actually is
A test is flaky when its outcome is not a pure function of the code under test. Some hidden input — wall-clock timing, scheduler order, leftover data from a previous run, a slow network hop — changes between executions and flips the result. The fix is never "add a longer wait"; that only widens the window in which the hidden input usually settles. The fix is to remove the dependency on that input, or to wait on the precise condition the test cares about. Playwright's auto-waiting already does this for most actions, so a flaky Playwright test almost always means an explicit waitForTimeout(), an assertion against a value that has not arrived yet, or state bleeding in from outside the test.
Root cause 1: race conditions
A race condition is an assertion that runs before the application has reached the state it asserts. The classic shape is reading a value immediately after triggering an async action:
import { test, expect } from '@playwright/test';
test('flaky: reads count before it updates', async ({ page }) => {
await page.goto('/cart');
await page.getByRole('button', { name: 'Add to cart' }).click();
// BAD: the badge text is read synchronously, before the request resolves.
const text = await page.getByTestId('cart-count').textContent();
expect(text).toBe('1'); // fails whenever the network is a few ms slower
});
The remedy is a web-first assertion, which polls until the expectation holds or the timeout expires:
import { test, expect } from '@playwright/test';
test('stable: waits for the rendered count', async ({ page }) => {
await page.goto('/cart');
await page.getByRole('button', { name: 'Add to cart' }).click();
// GOOD: expect() retries until the badge shows 1 or the timeout is hit.
await expect(page.getByTestId('cart-count')).toHaveText('1');
});
Whenever you assert against content that loads asynchronously, lean on the patterns in Handling Dynamic Content instead of guessing a delay.
Root cause 2: animation and transition timing
Elements that slide, fade, or expand are mid-flight for a few hundred milliseconds. A click dispatched during a CSS transition can land on the element's old position, and a screenshot taken mid-animation differs pixel-for-pixel between runs. Playwright's actionability checks already wait for an element to be stable (not moving) before clicking, which covers most cases. For visual comparisons, disable animations so the rendered frame is identical every time:
import { test, expect } from '@playwright/test';
test('stable screenshot with animations frozen', async ({ page }) => {
await page.goto('/modal');
await page.getByRole('button', { name: 'Open' }).click();
// 'disabled' fast-forwards CSS animations/transitions to their final state.
await expect(page.getByRole('dialog')).toHaveScreenshot({ animations: 'disabled' });
});
Root cause 3: shared state between tests
If two tests touch the same user account, database row, or browser storage, the order they run in changes the result — and Playwright runs files in parallel by default. The cure is isolation. Each test should create the data it needs and never assume a clean global. Use per-test fixtures and a fresh browser context, both covered in Playwright Config & Fixtures, so no two tests can collide:
import { test, expect } from '@playwright/test';
// A fixture that provisions a unique account per test removes cross-test coupling.
const it = test.extend<{ account: string }>({
account: async ({}, use) => {
const id = `user-${Date.now()}-${Math.random().toString(16).slice(2)}`;
await use(id); // each test gets its own id; nothing is shared
},
});
it('signs in with an isolated account', async ({ page, account }) => {
await page.goto(`/login?seed=${account}`);
await expect(page.getByRole('heading', { name: 'Dashboard' })).toBeVisible();
});
Root cause 4: network timing
Tests that hit a live backend inherit its latency, rate limits, and shifting seed data. The same spec can pass when the API is warm and fail when it is cold. Mock the unstable dependency so the response is fixed, and reserve live calls for a dedicated contract suite. The interception patterns under Network Interception Basics make a request deterministic with page.route() and route.fulfill().
Detecting flakiness before it reaches main
Flakiness that only appears once in fifty runs is invisible to a single CI pass. Force it into the open by running a test many times in a row with --repeat-each, and let CI surface intermittent failures through the retry count. The full workflow — running under load, reading the flaky markers in the report, and converting hard waits to web-first assertions — is covered in Detecting and Fixing Flaky Playwright Tests, which walks through a numbered diagnosis-to-fix procedure.
Quarantine: protect the build without hiding the problem
When a flaky test blocks an urgent release, do not delete it and do not silently retry forever. Tag it so it runs but cannot fail the build, and track every quarantined test as a bug with an owner and a deadline. A common pattern is a @flaky annotation filtered into a non-blocking job:
import { test, expect } from '@playwright/test';
test('checkout total updates', { tag: '@flaky' }, async ({ page }) => {
await page.goto('/checkout');
await page.getByRole('button', { name: 'Apply coupon' }).click();
await expect(page.getByTestId('total')).toHaveText('$90.00');
});
Run the blocking suite with --grep-invert @flaky and the quarantine suite separately. Quarantine is a holding pen, not a graveyard — a test that sits there for a month should be fixed or deleted.
Tuning retries and timeouts
Retries hide flakiness from the build but also mask it from you, so a retried test must always be reported as flaky, never silently green. The right retry count, the global timeout, and the per-assertion expect.timeout all belong in playwright.config.ts, with per-test overrides for the rare slow path. The complete configuration reference lives in Configuring Retries and Timeouts for Stable CI.
Confirming a fix with the trace
Once you believe a test is fixed, prove it. Reproduce the original failure, open the recorded trace, and step through the exact action that flipped to find whether the application or the assertion was at fault. The Trace Viewer & Debugging guide shows how to read the action timeline, network panel, and DOM snapshots. Pair that with a tuned CI pipeline — see CI/CD Integration — so the same test runs identically on every machine and a fix on your laptop is a fix everywhere.
Frequently Asked Questions
Is adding retries enough to fix a flaky test?
No. Retries keep a flaky test from blocking the build, but they do not remove the underlying race or shared-state bug, and the test will still fail intermittently. Use retries as a safety net while you find and eliminate the root cause, and always surface retried tests as flaky in the report rather than treating them as passes.
Why does my test pass locally but fail in CI?
CI machines are usually slower and more contended, which widens timing windows that your faster laptop hides. The failure is almost always a race condition or a hard waitForTimeout() that happened to be long enough locally. Replace timed waits with web-first assertions and inspect the CI trace to see the exact action that ran too early.
Should I delete a flaky test I cannot fix quickly?
Quarantine it first so it runs without blocking the build, and track it as a bug with an owner. Only delete a test if the behavior it covers is no longer relevant or is verified elsewhere; a deleted test is lost coverage, while a quarantined one still reports signal.