Scraping Data Behind Login Sessions

Much of the most valuable data lives behind a login — account dashboards, order histories, members-only reports. Logging in on every extraction run is wasteful: it is slow, it adds avoidable load to the authentication endpoint, and repeated logins are exactly the pattern that gets an account flagged. Playwright solves this with storageState, which serializes a context's cookies and local storage to a JSON file after one successful login. Every later run loads that file and starts already authenticated, so the scraper goes straight to the data. This page shows how to capture the session, reuse it across runs, and refresh it when it expires. It applies the mapping work from Structured Data Extraction to authenticated pages and fits the wider Web Scraping & Data Extraction pipeline.

Authenticate once, persist the session, and every later run starts logged in — no repeated logins.

How storageState works

A browser keeps you logged in through cookies and tokens held in local storage. context.storageState({ path }) reads both out of a live, authenticated context and writes them to a JSON file. Later, browser.newContext({ storageState: path }) seeds a fresh context with that exact state, so the first navigation already carries valid session cookies. Because state is scoped to a BrowserContext, this is built directly on the isolation model described in Browser Contexts & Isolation — one saved session per identity, with no leakage between jobs. In a test suite you would wire the same mechanism through a setup project as covered in Playwright Config & Fixtures, so the login runs once and every spec reuses the result.

Step-by-step session reuse

Log in once in a setup script. Launch a browser, navigate to the login page, fill the credentials with getByLabel() and submit. Pull secrets from environment variables, never hard-coded strings.
Wait for proof of authentication. Do not save state the instant submit returns. Wait for a post-login signal — a redirect to the dashboard, or a visible account menu — so you only persist a genuinely authenticated session.
Save the session to a file. Call context.storageState({ path: 'auth/session.json' }) and keep the file out of version control by adding it to .gitignore; it contains live credentials.
Load the session in each extraction run. Create the context with browser.newContext({ storageState: 'auth/session.json' }). The first page.goto() to a protected URL now lands on real data instead of the login screen.
Detect expiry and re-authenticate. Sessions expire. After navigating, check for a login redirect or a missing authenticated element; if the session is dead, run the login script again to refresh the file, then retry.
Extract and serialize. With an authenticated page, apply the mapping and JSON-writing patterns from the Structured Data Extraction guide exactly as you would on a public page.

import { chromium } from 'playwright';

// Step 1-3: run this once to capture an authenticated session.
async function saveSession(): Promise<void> {
  const browser = await chromium.launch();
  const context = await browser.newContext();
  const page = await context.newPage();

  await page.goto('https://example.com/login');
  await page.getByLabel('Email').fill(process.env.SCRAPE_USER ?? '');
  await page.getByLabel('Password').fill(process.env.SCRAPE_PASS ?? '');
  await page.getByRole('button', { name: 'Sign in' }).click();

  // Step 2: wait for proof of login before persisting anything.
  await page.getByRole('navigation', { name: 'Account' }).waitFor();

  // Step 3: serialize cookies + local storage to disk (gitignored).
  await context.storageState({ path: 'auth/session.json' });
  await browser.close();
}

await saveSession();

import { chromium, type BrowserContext } from 'playwright';

// Step 4-5: load the saved session, refreshing it if it has expired.
async function authedContext(): Promise<BrowserContext> {
  const browser = await chromium.launch();
  let context = await browser.newContext({ storageState: 'auth/session.json' });
  const page = await context.newPage();

  await page.goto('https://example.com/account/orders');

  // If we were bounced to the login page, the session is dead — refresh it.
  if (page.url().includes('/login')) {
    await context.close();
    await saveSession();                                  // re-run the login flow
    context = await browser.newContext({ storageState: 'auth/session.json' });
  }
  return context;
}

Extracting from the authenticated page

Once the context is authenticated, extraction is identical to a public page. Navigate, wait for the data to render, and map it to typed records.

import { chromium } from 'playwright';
import { writeFile } from 'node:fs/promises';

interface Order { id: string; total: number; date: string; }

const browser = await chromium.launch();
const context = await browser.newContext({ storageState: 'auth/session.json' });
const page = await context.newPage();
await page.goto('https://example.com/account/orders');
await page.getByRole('row').first().waitFor();

// Map authenticated rows the same way as any other table.
const orders: Order[] = await page.locator('tbody tr').evaluateAll((rows) =>
  rows.map((r) => ({
    id: r.querySelector('.id')?.textContent?.trim() ?? '',
    total: Number(r.querySelector('.total')?.textContent?.replace(/[^0-9.]/g, '') ?? 0),
    date: r.querySelector('time')?.getAttribute('datetime') ?? '',
  })),
);

await writeFile('out/orders.json', JSON.stringify(orders, null, 2), 'utf-8');
await context.close();
await browser.close();

Security and ethics

A storageState file is a live credential — anyone who has it is logged in as you. Keep it out of version control, restrict its file permissions, and store secrets in environment variables or a secret manager rather than in code. Only scrape accounts and data you are authorized to access, honor the site's terms of service, and never share or repurpose another user's session. Reusing your own session is good engineering hygiene; using session reuse to disguise unauthorized access is not, and is out of scope here.

Verification

Confirm reuse works three ways. First, delete any login network calls from a normal run — a correctly seeded context never hits the login endpoint, which you can verify in the Network panel. Second, take the session file to a clean machine and confirm the run still lands on data, proving the state is self-contained. Third, deliberately expire the session (clear the file or wait it out) and confirm the expiry branch re-authenticates and recovers rather than silently scraping the login page. For data that spans many pages once authenticated, layer in the loop from Pagination & Infinite Scroll.

Frequently Asked Questions

What does storageState actually save?

storageState serializes the cookies and local storage of a browser context to a JSON file. Loading that file into a new context restores the session, so the first navigation already carries valid authentication and lands on protected data instead of the login page. It does not capture session storage that the site clears on reload, so always verify against a real authenticated element.

How do I handle a session that has expired?

After navigating to a protected page, check whether you were redirected to the login URL or whether an authenticated element is missing. If the session is dead, re-run the login flow to regenerate the storageState file and then retry the navigation with a fresh context. Building this check into every run makes the scraper self-healing rather than failing silently on the login screen.

Is it safe to commit the session file to my repository?

No. The storageState file contains live credentials, and anyone with it is effectively logged in as you. Add it to .gitignore, restrict its file permissions, and keep the underlying username and password in environment variables or a secret manager rather than in source code. Only ever reuse sessions for accounts and data you are authorized to access.