Extracting Tables and Lists to JSON with Playwright
HTML tables and lists are the most common shape of structured data on the web — pricing grids, sortable reports, leaderboards, search results. The markup is regular, which makes it ideal for extraction, but the naive approach (grab every <td> into a flat array) throws away the column meaning and leaves you with positional data nobody can use. The goal is a JSON array where each row is an object keyed by its column header: { "rank": 1, "team": "Acme", "points": 42 }. This page walks through deriving those keys from the header row, mapping each body row into a typed record with locator.evaluateAll(), and writing the result to a file. It is a focused application of the techniques in Structured Data Extraction, within the broader Web Scraping & Data Extraction workflow.
Why positional extraction fails
Reading cells by index — cells[0], cells[1] — works until the day a column is added, removed, or reordered, at which point every record is silently mislabeled and your downstream consumers ingest corrupt data without an error. Binding values to header names instead makes the extraction self-describing and resilient: a reordered column still lands under the right key, and a renamed header surfaces as a visible change in the output rather than a quiet data swap. The same principle applies to definition lists and card grids — derive the field name from the markup, never from position alone.
Step-by-step extraction
- Navigate and wait for the table to render. Call
page.goto(url)and then wait for the table body to have rows withpage.locator('table tbody tr').first().waitFor(). On a single-page app the table is populated by JavaScript after load, so waiting for a concrete row — not a fixed sleep — is what the synchronization guidance in Handling Dynamic Content calls for. - Derive the keys from the header row. Read every
<th>withallTextContents(), then trim and normalize each into a clean key (lowercase, spaces to underscores). These become the object property names for every record, so the dataset is self-describing. - Map each body row to an object. Use
locator.evaluateAll()overtbody trso the whole mapping runs in the browser in a single bridge call. Inside the callback, read each row's cells, trim the text, and zip them against the keys you derived to build one object per row. - Normalize the cell values. Parse numeric columns into numbers, strip currency symbols and thousands separators, and convert relative dates to ISO strings. Do this in the map step so the JSON holds clean typed values rather than display strings.
- Handle lists the same way. For a
<ul>/<ol>or a grid of cards, replace the row/cell logic with a per-item map: read the item's text and anygetAttribute()values you need, and produce the same shape of object. The keying-by-meaning principle is identical. - Serialize and write the file. Pass the array to
JSON.stringify(rows, null, 2)and write it with the Nodefsmodule. For very large tables, write newline-delimited JSON so the file streams and a mid-run crash does not lose collected rows.
import { chromium } from 'playwright';
import { writeFile } from 'node:fs/promises';
interface Row { [key: string]: string | number; }
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/leaderboard');
// Step 1: wait for a real row so we know the table has rendered.
await page.locator('table tbody tr').first().waitFor();
// Step 2: derive keys from the header cells.
const rawHeaders = await page.locator('table thead th').allTextContents();
const keys = rawHeaders.map((h) => h.trim().toLowerCase().replace(/\s+/g, '_'));
// Step 3 + 4: map each body row to a keyed object inside the browser.
const rows: Row[] = await page.locator('table tbody tr').evaluateAll((trs, headerKeys) =>
trs.map((tr) => {
const cells = Array.from(tr.querySelectorAll('td'));
const record: Record<string, string | number> = {};
cells.forEach((cell, i) => {
const key = headerKeys[i] ?? `col_${i}`; // fall back if header is short
const text = cell.textContent?.trim() ?? '';
// Numeric columns become numbers; everything else stays a string.
const asNumber = Number(text.replace(/[^0-9.-]/g, ''));
record[key] = text !== '' && !Number.isNaN(asNumber) && /\d/.test(text) ? asNumber : text;
});
return record;
}, keys,
);
// Step 6: serialize the keyed records to a file.
await writeFile('out/leaderboard.json', JSON.stringify(rows, null, 2), 'utf-8');
console.log(`Wrote ${rows.length} rows`);
await browser.close();
Extracting a plain list
A list needs no header step — each item is already a record. Map every <li> (or card) directly, reading its text and any attributes such as a link target. Resolve relative URLs against the page URL so the output is portable.
import { chromium } from 'playwright';
import { writeFile } from 'node:fs/promises';
interface Link { label: string; href: string; }
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/sitemap');
const base = page.url();
// Map each list item to a record, resolving relative hrefs to absolute.
const links: Link[] = await page.locator('ul.links li a').evaluateAll((anchors, baseUrl) =>
anchors.map((a) => ({
label: a.textContent?.trim() ?? '',
href: new URL(a.getAttribute('href') ?? '', baseUrl).toString(),
})), base,
);
await writeFile('out/links.json', JSON.stringify(links, null, 2), 'utf-8');
await browser.close();
Verification
Confirm the extraction three ways. First, assert the row count matches the visible table: expect(rows.length).toBe(await page.locator('table tbody tr').count()). Second, spot-check the first and last records against the page so off-by-one errors in the header/body split surface immediately. Third, validate the JSON shape — every record should have the same keys, and numeric columns should be numbers, not strings; a quick zod schema or a manual key check catches a malformed header row. When a table spans multiple pages, combine this with the loop in Pagination & Infinite Scroll before writing the final file.
Frequently Asked Questions
How do I keep the JSON keyed by column even when columns are reordered?
Derive the keys from the table's header cells at runtime and bind each value to the header name rather than to a fixed index. A reordered column then still lands under the correct key, and a renamed header shows up as a visible change in the output instead of silently mislabeling data. Reading cells positionally is the source of most table-extraction bugs.
Should I use evaluateAll() or loop over row locators for a table?
Use evaluateAll() for tables of any meaningful size, because it runs the entire row-mapping function inside the browser and returns all records in a single call rather than making a cross-process call per cell. Looping over locators is only worth it for a handful of rows where you also need locator auto-waiting on individual cells.
How do I convert price or number cells into real numbers?
Strip non-numeric characters such as currency symbols, commas, and units with a regex, then parse with Number(). Do this inside the mapping step so the serialized JSON holds typed numeric values rather than display strings, which keeps downstream aggregation and sorting correct. Leave genuinely textual columns as strings.