Web Scraping & Data Extraction

Q: How do I scrape responsibly without overloading a site?

Read and honor robots.txt and the site's terms of service, pace your requests with delays a human could plausibly produce, and back off with increasing waits whenever you receive a 429 or 503 response. Cache results so you never re-fetch the same page, and prefer an official API whenever one is available. If a site signals that automated access is unwelcome, stop rather than trying to evade its controls.

A request library fetches the HTML the server first emits; a great deal of the modern web only exists after JavaScript runs. Single-page apps hydrate their content from XHR calls, lists grow as you scroll, and the data you want lives in a DOM that never appears in view-source. Playwright drives a real browser, so it sees exactly what a user sees — the rendered DOM, the resolved network responses, the authenticated session. That makes it a precise instrument for extracting structured data from pages that defeat plain HTTP clients. This guide covers the problems every durable extraction job must solve: rendering JavaScript reliably, mapping the DOM to clean typed records, walking through paginated and infinite-scroll result sets, surviving anti-bot defenses through respectful pacing, running the whole job in CI, and keeping it alive as the target site evolves.

Why naive HTTP scraping fails

The fastest way to understand the problem is to compare what two tools receive for the same URL. A curl or fetch call against a server-rendered blog gets the full article in the response body, because the server assembled the HTML before sending it. The same call against a modern e-commerce listing, a dashboard, or a social feed gets a near-empty shell: a div with an id like root, a bundle of script tags, and nothing resembling the products, rows, or posts you want. The content does not exist yet. It will exist only after the browser downloads the JavaScript bundle, executes it, fires one or more XHR or fetch calls to a backend API, receives JSON, and renders that JSON into DOM nodes. A plain HTTP client performs none of those steps.

This is why request-based scrapers are so brittle on client-rendered sites. People work around it by reverse-engineering the private API the page calls — which works until the endpoint changes auth, adds a signed header, or moves behind a gateway — or by parsing fragments of JSON embedded in script tags, which breaks the moment the build tool reshuffles its output. Both approaches fight the application instead of using it. Playwright takes the opposite stance: run the same engine the application was built for, let it render exactly as a user's browser would, and read the result. You trade raw speed for correctness and durability. For a JS-rendered or single-page-application target, that trade is almost always worth it, because a scraper that returns wrong or empty data quickly is worse than no scraper at all.

The extraction mental model

A durable scraper is not a script that runs once; it is a pipeline you can reason about stage by stage. Hold this sequence in your head and every later decision becomes local to one stage: browser context → render → wait → extract → paginate → persist, with a rate-limit and retry layer wrapping the whole loop.

A browser context is the isolated session — its own cookies, storage, and cache — established once per identity. Render is the navigation that downloads and executes the page's JavaScript. Wait is the synchronization step where you block on a concrete signal that the data has actually arrived, never on a fixed sleep. Extract maps the rendered DOM (or an intercepted API response) into typed records. Paginate feeds the next page or scroll batch back to the render stage and repeats. Persist writes shaped JSON to disk, a queue, or a database. The cross-cutting rate limiter governs how fast the loop turns, and the retry layer decides what happens when a navigation returns a 429 or a transient 503. The diagram below traces that flow; the rest of this guide treats one stage per section.

A durable scraper is a loop, not a one-shot: render, extract, map, and emit — with a rate limiter and retry layer wrapping every navigation.

Rendering JavaScript-heavy pages

The first decision is whether you even need a browser. If the data arrives in the initial HTML, a plain HTTP fetch is faster and cheaper. The moment content depends on client-side rendering — React, Vue, Svelte, or any framework that hydrates from an API — you need a real engine, and that is what Playwright supplies. Navigate with page.goto(url) and then wait for the data, not for an arbitrary timeout. Auto-waiting locators resolve when the element you target is actually present, which is the synchronization strategy detailed in Handling Dynamic Content. Prefer waiting for a concrete signal — a row count, a specific text node, or a settled network response — over waitForTimeout().

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();

// Navigate, then wait for the rendered data — not a fixed sleep.
await page.goto('https://example.com/products', { waitUntil: 'domcontentloaded' });

// Wait for the first product card to attach, proving the SPA has hydrated.
await page.getByRole('listitem').first().waitFor();

const titles = await page.getByRole('heading', { level: 3 }).allTextContents();
console.log(titles);
await browser.close();

waitUntil: 'domcontentloaded' returns as soon as the HTML parses, which is usually correct for SPAs because the meaningful work happens afterward in scripts you then wait on explicitly. Reserve 'networkidle' for pages where you genuinely cannot name a DOM signal — it is slower and can hang on sites that poll continuously. The synchronization principle is the same one that keeps a test suite stable: assert on a state the application reaches, not on a clock. Auto-waiting locators such as getByRole() and the explicit locator.waitFor() call both encode that idea, and you should reach for them before any timeout.

Reading the API instead of the DOM

When a list view fetches /api/items?page=1, the cleanest extraction often bypasses the rendered HTML. Listen for the response with page.waitForResponse(), parse its JSON, and you get the data in its native shape with no selectors to maintain. This is the most resilient extraction path of all because it does not break when a designer reshuffles the markup — the markup is downstream of the JSON you are already reading. The DOM-parsing techniques in the Structured Data Extraction guide remain the right tool when no clean API exists or when the rendered view merges data from several calls. Reading responses directly is the core idea behind Network Interception Basics, and it pairs naturally with extraction.

import { chromium } from 'playwright';

interface Item { id: number; title: string; }

const browser = await chromium.launch();
const page = await browser.newPage();

// Start listening BEFORE the action that triggers the request, then navigate.
const responsePromise = page.waitForResponse(
  (r) => r.url().includes('/api/items') && r.status() === 200,
);
await page.goto('https://example.com/products');

const response = await responsePromise;
const payload = (await response.json()) as { items: Item[] };

// Shape the raw payload into the exact records you want to keep.
const items: Item[] = payload.items.map((i) => ({ id: i.id, title: i.title.trim() }));
console.log(items);
await browser.close();

The ordering matters: register waitForResponse() before the navigation or click that triggers the request, or you will race the network and miss the response. Once you hold the Response object you can call response.json() for an API or response.body() for binary payloads, then map the result into your own typed shape so downstream code never depends on the server's field names.

Structured Data Extraction

Raw text scraped from a page is nearly worthless until it is shaped into records with stable field names and types. The Structured Data Extraction guide is dedicated to this step: using locator.evaluateAll() to run a mapping function over every matching node in a single round trip, reading textContent() and getAttribute(), normalizing whitespace and currency, and producing an array of typed objects ready to serialize. Two patterns dominate.

The first runs entirely in the page context with evaluateAll, returning plain data across the bridge in one call — ideal for large lists because it avoids a round trip per field. The second uses Playwright locators in Node, which is more readable and lets you reuse accessibility-first selectors like those in getByRole & Accessibility Selectors, at the cost of more cross-process calls. Choose evaluateAll for throughput and locators for clarity and stability.

import { chromium } from 'playwright';

interface Product { name: string; price: number; url: string; }

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/products');

// One evaluateAll call maps every card into a typed record set.
const products: Product[] = await page.locator('.card').evaluateAll((cards) =>
  cards.map((c) => ({
    name: c.querySelector('h3')?.textContent?.trim() ?? '',
    // Strip the currency symbol and parse to a number for clean output.
    price: Number(c.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '') ?? 0),
    url: c.querySelector('a')?.getAttribute('href') ?? '',
  })),
);
console.log(products);
await browser.close();

Two long-form walkthroughs build on this: Extracting Tables and Lists to JSON with Playwright handles tabular layouts and writes a file, and Scraping Data Behind Login Sessions covers reaching data that requires authentication.

Shaping and validating records

The extraction call gets you raw strings; a record set you can trust needs one more pass. Normalize whitespace with trim() and collapsed spaces, parse numbers and dates out of their display formats, resolve relative URLs against the page origin, and drop or flag rows that fail a basic shape check. Doing this at the extraction boundary means every downstream consumer — your file writer, your database loader, your diff against yesterday's run — receives clean, typed objects rather than presentation text.

import { chromium } from 'playwright';
import { writeFile } from 'node:fs/promises';

interface Row { sku: string; price: number; inStock: boolean; }

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/inventory');
await page.getByRole('row').first().waitFor();

// Pull raw cells in one round trip, then validate and coerce in Node.
const raw = await page.getByRole('row').evaluateAll((rows) =>
  rows.map((r) => ({
    sku: r.querySelector('[data-sku]')?.textContent ?? '',
    price: r.querySelector('.price')?.textContent ?? '',
    stock: r.querySelector('.stock')?.textContent ?? '',
  })),
);

const records: Row[] = raw
  .map((r) => ({
    sku: r.sku.trim(),
    price: Number(r.price.replace(/[^0-9.]/g, '')),
    inStock: /in stock/i.test(r.stock),
  }))
  .filter((r) => r.sku && Number.isFinite(r.price)); // discard malformed rows

// Persist as pretty JSON so diffs between runs stay readable.
await writeFile('out/inventory.json', JSON.stringify(records, null, 2));
await browser.close();

That filter step is your last line of defense against silent corruption: when a layout change turns half your prices into NaN, you would rather drop those rows and notice the gap than write garbage. Pair the validation with a logged count so a sudden drop in record volume becomes visible immediately.

Pagination & Infinite Scroll

Most datasets do not fit on one screen. The Pagination & Infinite Scroll guide covers the two shapes this takes. Classic pagination exposes numbered links or a "next" control; you extract the current page, follow the next link, and stop when it disappears. Infinite scroll appends rows as the viewport approaches the bottom; you scroll, wait for the new batch to attach, and repeat until the count stops growing.

Both are loops with a clear termination condition, and the cardinal rule is the same: never trust a fixed sleep to mean "the next batch loaded." Wait for the row count to increase or for a sentinel element to appear. A counter that never grows is your signal to stop, and a guard on maximum iterations protects you against pages that load forever.

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/feed');

const rows = page.getByRole('article');
let previous = 0;

// Scroll until the article count stops changing — the natural end of the feed.
for (let i = 0; i < 50; i++) {
  const count = await rows.count();
  if (count === previous) break;          // no growth means we reached the end
  previous = count;
  await rows.nth(count - 1).scrollIntoViewIfNeeded();
  await rows.nth(count).waitFor({ timeout: 5000 }).catch(() => {}); // wait for next batch
}
console.log(`Collected ${await rows.count()} items`);
await browser.close();

Classic numbered pagination follows the same loop shape with a different stop condition: extract the current page, click or navigate to the next control, wait for the new page's content to attach, and stop when the next control disappears or is disabled. Track which page you are on so a crash can resume rather than restart, and accumulate records as you go.

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/results?page=1');

const all: string[] = [];
const next = page.getByRole('link', { name: /next/i });

// Walk pages until the "next" control is gone or disabled.
for (let guard = 0; guard < 200; guard++) {
  await page.getByRole('listitem').first().waitFor();
  all.push(...(await page.getByRole('listitem').allTextContents()));

  if ((await next.count()) === 0 || (await next.isDisabled())) break;
  await next.click();
  await page.waitForLoadState('domcontentloaded'); // let the next page render
}
console.log(`Collected ${all.length} records across pages`);
await browser.close();

A dedicated walkthrough, Scraping Infinite Scroll Pages with Playwright, takes the scroll pattern to production with deduplication and progress checkpoints.

Sessions and authentication

Data behind a login is reachable, but logging in on every run is slow and increases load on the target. Playwright's storageState serializes cookies and local storage to a file after one successful login, and every later run loads that file to start already authenticated. Sessions are scoped to a BrowserContext, so the isolation model in Browser Contexts & Isolation is the foundation here — one context per identity, no leakage between jobs.

import { chromium } from 'playwright';

const browser = await chromium.launch();

// Reuse a saved session so each run starts logged in.
const context = await browser.newContext({ storageState: 'auth/session.json' });
const page = await context.newPage();
await page.goto('https://example.com/account/orders');

await page.getByRole('table').waitFor();
await context.close();
await browser.close();

Capturing the session is a one-time bootstrap: log in interactively or programmatically once, then call context.storageState() to serialize cookies and local storage to disk. Every later run loads that file and starts authenticated, which removes the login from the hot path and sharply reduces the load you place on the target's auth system.

import { chromium } from 'playwright';

const browser = await chromium.launch();
const context = await browser.newContext();
const page = await context.newPage();

// Log in once, interactively or via filled fields, then save the session.
await page.goto('https://example.com/login');
await page.getByLabel('Email').fill(process.env.SCRAPE_USER ?? '');
await page.getByLabel('Password').fill(process.env.SCRAPE_PASS ?? '');
await page.getByRole('button', { name: 'Sign in' }).click();
await page.getByRole('heading', { name: /dashboard/i }).waitFor();

// Persist cookies + storage so future runs skip the login entirely.
await context.storageState({ path: 'auth/session.json' });
await browser.close();

Read credentials from environment variables, never from the source — the same secret-handling discipline a CI pipeline expects. The full lifecycle, including refreshing state when it expires and detecting a logged-out redirect mid-run, is the subject of Scraping Data Behind Login Sessions.

Anti-Bot Defenses & Rate Limiting

Extraction at scale is a courtesy negotiation with the site you depend on. The Anti-Bot Defenses & Rate Limiting guide covers pacing requests so you never overwhelm a server, honoring robots.txt and a site's terms of service, identifying yourself with an honest user agent, and backing off when you receive a 429 Too Many Requests. The goal is robust, considerate automation — not circumventing security controls. If a site signals that automated access is unwelcome, the correct response is to stop and seek a permitted data source or an official API, not to disguise your traffic.

Resilience and respect are the same engineering work. A request rate a human could plausibly produce keeps you below the thresholds that trigger defensive responses in the first place, and exponential backoff on 429 and 503 responses both protects the host and keeps your job alive when you do cross a limit. The retry mechanics overlap heavily with the test stability patterns covered under Flaky Test Management: a transient failure is a transient failure whether it comes from a CI runner or a busy origin server. A practical treatment lives in Handling Rate Limits and Retries When Scraping.

import { chromium, type Page } from 'playwright';

// Navigate with backoff so a 429/503 pauses us instead of hammering the host.
async function politeGoto(page: Page, url: string, attempt = 0): Promise<void> {
  const response = await page.goto(url);
  const status = response?.status() ?? 0;
  if ((status === 429 || status === 503) && attempt < 5) {
    // Honor the server's Retry-After header when present, else back off exponentially.
    const header = Number(response?.headers()['retry-after']);
    const wait = Number.isFinite(header) && header > 0 ? header * 1000 : 2 ** attempt * 1000;
    await page.waitForTimeout(wait); // 1s, 2s, 4s, 8s, 16s, or the server's wait
    return politeGoto(page, url, attempt + 1);
  }
}

const browser = await chromium.launch();
const page = await browser.newPage();
await politeGoto(page, 'https://example.com/listing');
await browser.close();

When a server sends a Retry-After header it is telling you exactly how long to wait; honoring it is both more polite and more effective than a guess. Identify your crawler with an honest, descriptive user agent and a contact URL so an administrator can reach you rather than block you blind. These are the behaviors of a good network citizen, and they are also the ones least likely to get your traffic flagged.

A note on ethics and the law

Just because a browser can load a page does not mean every use of that data is permitted. Always read and obey robots.txt, respect the site's terms of service, and avoid collecting personal data without a lawful basis. Throttle to a rate a human could plausibly produce, cache aggressively so you never re-fetch what you already have, and prefer official APIs whenever one exists. Treat the target's infrastructure as a shared resource you are borrowing. Robust automation and good citizenship are not in tension — a well-behaved scraper is also the one least likely to be blocked.

Cross-cutting integration

Extraction reuses the same primitives that make a Playwright test suite stable, and reading the sibling guides will save you from reinventing them. The single most important is synchronization: the rules for waiting on asynchronously rendered content are identical whether you are asserting on it or scraping it, and Handling Dynamic Content is the reference for waiting on React and other client-rendered components without flaky sleeps.

The second is network awareness. Reading an API response directly with waitForResponse(), or stubbing a noisy third-party call so it does not slow your run, is the territory of Network Interception Basics. You can also use route() to abort requests for images, fonts, and analytics beacons you do not need — a scraper that never downloads a megabyte of hero imagery is both faster and lighter on the host.

import { chromium } from 'playwright';

const browser = await chromium.launch();
const context = await browser.newContext();

// Block heavy, irrelevant resources so each page render is fast and cheap.
await context.route('**/*', (route) => {
  const type = route.request().resourceType();
  if (type === 'image' || type === 'font' || type === 'media') return route.abort();
  return route.continue();
});

const page = await context.newPage();
await page.goto('https://example.com/products');
await page.getByRole('listitem').first().waitFor();
await browser.close();

The third is isolation. Each extraction identity — each logged-in account, each set of cookies — belongs in its own BrowserContext so sessions never bleed together, which is exactly the model laid out in Browser Contexts & Isolation. One context per identity also lets you run several extractions in parallel inside a single browser process without them sharing state.

Running extraction at scale in CI

A scraper that works on your laptop is not yet a scraper you can depend on. The next step is running it unattended: headless, on a schedule, with the work split across parallel workers and the output collected as an artifact. The mechanics are the same ones used to run a test suite in continuous integration, so CI/CD Integration is the companion guide here.

Run headless in CI — it is the default and the only sensible mode on a runner with no display. Split a large job by giving each worker a slice of the input, the same idea as test sharding: if you have 10,000 product URLs and four workers, worker i handles every URL where the index modulo four equals i. Each worker writes its own output file, and a final step merges them.

import { chromium } from 'playwright';
import { writeFile } from 'node:fs/promises';

// CI sets these so each parallel worker scrapes a disjoint slice of the URLs.
const shard = Number(process.env.SHARD_INDEX ?? 0);
const total = Number(process.env.SHARD_TOTAL ?? 1);

const allUrls: string[] = JSON.parse(process.env.URL_LIST ?? '[]');
const mine = allUrls.filter((_, i) => i % total === shard); // this worker's share

const browser = await chromium.launch(); // headless by default on CI
const page = await browser.newPage();
const out: { url: string; title: string }[] = [];

for (const url of mine) {
  await page.goto(url, { waitUntil: 'domcontentloaded' });
  out.push({ url, title: await page.title() });
}

await writeFile(`out/shard-${shard}.json`, JSON.stringify(out, null, 2));
await browser.close();

Schedule the job with whatever your CI offers — a cron trigger in GitHub Actions, a scheduled pipeline elsewhere — and keep runs short enough to fit comfortably inside the runner's time limit; if they do not, add workers rather than lengthening a single run. Upload the merged JSON as a build artifact so every run leaves an auditable record of exactly what was collected and when.

Reliability & maintenance

Scrapers rot. The target site is outside your control, so a layout change, a renamed CSS class, or a new consent banner can break extraction overnight with no warning. The defense is the same engineering hygiene that keeps a flaky test suite healthy, and Flaky Test Management is the reference for the failure-triage mindset.

The biggest source of breakage is selector drift: brittle CSS chains like div > div:nth-child(3) > span shatter when the markup shifts. Prefer the accessibility-first and attribute-based selectors from Reliable Selector Strategies for Playwright — a getByRole() call or a data- attribute survives a visual refactor that a positional CSS chain does not. Where you must use CSS, anchor on stable, semantic attributes rather than presentational classes.

The second defense is monitoring. A scraper that returns zero rows should fail loudly, not write an empty file and exit cleanly. Assert a minimum record count, compare today's volume against the recent baseline, and alert when extraction falls off a cliff — a sudden drop almost always means the page changed, not that the data vanished. Log per-page timings and HTTP status distributions so a creeping rise in 429s is visible before it becomes a block.

import { chromium } from 'playwright';

const MIN_EXPECTED = 20; // a healthy run always returns at least this many rows

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/products');

const count = await page.getByRole('listitem').count();
if (count < MIN_EXPECTED) {
  // Fail the run so monitoring catches the breakage instead of writing junk.
  throw new Error(`Extraction returned ${count} rows, expected >= ${MIN_EXPECTED}`);
}
console.log(`Healthy run: ${count} rows`);
await browser.close();

The third is respectful pacing as a maintenance practice, not just an ethical one. A scraper that hammers a host gets blocked, and a blocked scraper is a broken scraper. Keep the rate modest, cache what you have already fetched, and treat a rising 429 rate as a signal to slow down rather than retry harder. The cheapest scraper to maintain is the one the target barely notices.

Where to go next

Each section above has its own guide. Start with Structured Data Extraction to nail the DOM-to-record mapping, move to Pagination & Infinite Scroll once a single page works, and layer in Anti-Bot Defenses & Rate Limiting before you run anything at volume. For the browser fundamentals underneath all of it, see Playwright Setup & Core Architecture.

Frequently Asked Questions

When should I use Playwright instead of a plain HTTP request for scraping?

Use a plain HTTP client when the data you need is present in the initial server-rendered HTML, because it is faster and uses far fewer resources. Reach for Playwright when content is rendered by client-side JavaScript, when you need an authenticated session, or when the page only assembles its data after XHR calls resolve. A real browser sees the final rendered DOM that a request library never receives.

How do I scrape responsibly without overloading a site?

Read and honor robots.txt and the site's terms of service, pace your requests with delays a human could plausibly produce, and back off with increasing waits whenever you receive a 429 or 503 response. Cache results so you never re-fetch the same page, and prefer an official API whenever one is available. If a site signals that automated access is unwelcome, stop rather than trying to evade its controls.

What is the most resilient way to extract data from a single-page app?

When the page fetches its data from an API, read that response directly with page.waitForResponse() and parse the JSON, because it does not break when the markup changes. When no clean API exists, wait for a concrete DOM signal such as a row count or specific element before extracting, and prefer accessibility-first locators over brittle CSS chains so the scraper survives layout refactors.