Structured Data Extraction

Scraping is only half the job. The other half — the half that decides whether your output is usable — is turning a messy, presentational DOM into clean records with stable field names and correct types. A page renders a price as "$1,299.00" inside a nested span; your dataset needs the number 1299. A date shows as "3 days ago"; you want an ISO string. This guide, part of the broader Web Scraping & Data Extraction discipline, covers the mechanics of mapping the rendered DOM to typed objects: choosing between locator.evaluateAll() and Node-side locators, reading textContent() and getAttribute(), normalizing values, and serializing the result to JSON you can trust downstream.

Extraction is a mapping: many repeated elements collapse into one typed array, with normalization happening in the map step.

Two extraction strategies

There are two ways to pull data out of a Playwright page, and the right choice depends on volume and complexity.

The first is locator.evaluateAll(), which serializes a function, runs it inside the browser against every matching node, and returns plain data across the bridge in a single round trip. For a list of a thousand cards with five fields each, this is dramatically faster than five thousand individual locator calls because nothing crosses the Node-to-browser boundary per field. The trade-off is that the callback runs in the page, so you write DOM APIs (querySelector, textContent) rather than Playwright locators, and you cannot use the auto-waiting that locators provide.

The second is Node-side locators: you iterate elements and call locator.textContent() or locator.getAttribute() from your test code. This is more readable, lets you reuse accessibility-first selectors such as those in getByRole & Accessibility Selectors, and inherits auto-waiting, but it costs one cross-process call per field. Use it for small or irregular result sets and where selector stability matters more than raw throughput.

import { chromium } from 'playwright';

interface Article { title: string; author: string; href: string; }

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/blog');

// Strategy A: evaluateAll — one bridge call maps every <article> to a record.
const articles: Article[] = await page.locator('article').evaluateAll((nodes) =>
  nodes.map((n) => ({
    title: n.querySelector('h2')?.textContent?.trim() ?? '',
    author: n.querySelector('.byline')?.textContent?.trim() ?? '',
    href: n.querySelector('a')?.getAttribute('href') ?? '',
  })),
);

console.log(articles.length, 'articles extracted');
await browser.close();

Reading text and attributes cleanly

The two workhorses are textContent() for the visible text of a node and getAttribute() for attribute values like href, src, or data-*. Two pitfalls recur. First, textContent returns all descendant text, including hidden helper spans and whitespace from the markup's indentation, so always .trim() and collapse internal whitespace with a regex when a field spans multiple inline elements. Second, a missing element returns null, so guard every read with a nullish fallback to avoid a record full of undefined.

For text that a screen reader would announce — which often excludes decorative markup — locator.innerText() respects CSS visibility and is closer to what a user perceives. Use textContent() when you want the raw DOM text regardless of styling, and innerText() when you want only what is visually rendered.

import { chromium } from 'playwright';

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/product/42');

// textContent for raw text; getAttribute for structured values.
const name = (await page.locator('h1').textContent())?.trim() ?? '';
const rawPrice = (await page.locator('.price').textContent()) ?? '';
const sku = await page.locator('[data-sku]').getAttribute('data-sku');

// Normalize the price string into a number for clean JSON.
const price = Number(rawPrice.replace(/[^0-9.]/g, ''));

console.log({ name, price, sku });
await browser.close();

Mapping to typed objects

A record is only useful if its shape is predictable. Define a TypeScript interface for the record up front, then make the mapping function return that type. This forces you to handle every field, makes missing data visible as a compile error rather than a silent undefined, and documents the dataset's schema in one place. Normalize at the boundary: parse numbers, convert relative dates to ISO strings, and resolve relative URLs to absolute ones with new URL(href, page.url()) so the output is portable.

When extraction spans repeated structures with optional fields, model the optionality explicitly (price?: number) rather than coercing missing values to zero, which would silently corrupt aggregates downstream. A record that honestly says a field is absent is more valuable than one that lies with a default.

Serializing to JSON

Once you hold an array of typed records, serialization is a single step. JSON.stringify(records, null, 2) produces readable output; write it with the Node fs module. For large runs, write newline-delimited JSON (one object per line) so the file can be streamed and appended incrementally rather than held entirely in memory, and so a crash mid-run does not lose everything collected so far.

import { chromium } from 'playwright';
import { writeFile } from 'node:fs/promises';

interface Row { name: string; price: number; }

const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/catalog');

const rows: Row[] = await page.locator('.item').evaluateAll((items) =>
  items.map((i) => ({
    name: i.querySelector('.name')?.textContent?.trim() ?? '',
    price: Number(i.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '') ?? 0),
  })),
);

// Pretty-print to a file the rest of the pipeline can consume.
await writeFile('out/catalog.json', JSON.stringify(rows, null, 2), 'utf-8');
await browser.close();

Putting it into practice

Two focused walkthroughs apply these mechanics to the cases you will meet most often. Extracting Tables and Lists to JSON with Playwright handles tabular markup — reading a header row to derive keys, then mapping each body row to an object — and writes a clean file. Scraping Data Behind Login Sessions covers reaching records that require authentication, using a saved session so you extract without logging in on every run. Both build directly on the mapping and normalization patterns above, and both fit inside the larger pipeline described in Web Scraping & Data Extraction.

Frequently Asked Questions

When should I use evaluateAll() instead of looping over locators?

Use evaluateAll() for large result sets where throughput matters, because it runs the mapping function inside the browser and returns all the data in a single bridge call instead of one cross-process call per field. Loop over Node-side locators for small or irregular sets where readability and locator auto-waiting matter more than raw speed, and where you want to reuse accessibility-first selectors.

Why is my extracted text full of extra whitespace and hidden content?

textContent() returns all descendant text, including whitespace from the markup's indentation and any visually hidden helper elements. Always trim the result and collapse internal whitespace with a regex, or use innerText() instead, which respects CSS visibility and returns only what is actually rendered to the user.

How do I keep my JSON output schema stable?

Define a TypeScript interface for the record before you write the mapping function and make the function return that type. This forces you to handle every field, surfaces missing data as a compile error rather than a silent undefined, and documents the dataset schema in one place. Normalize values such as prices, dates, and relative URLs at the mapping boundary so the serialized output is consistent.