Playwright & Web Automation Hub

Playwright architecture, selector reliability, and advanced interaction patterns.

Scraping Infinite Scroll Pages with Playwright

An infinite-scroll feed never exposes its full dataset in the DOM at once. New items appear only when you scroll near the bottom, and a virtualized list quietly removes off-screen rows to keep the page fast. A scraper that reads the page once captures the first batch and nothing else; one that scrolls blindly with fixed sleeps either stops early or loops forever. This page shows a deterministic approach: scroll to trigger loading, wait for the actual data response with waitForResponse(), read each batch as it arrives, and stop on a stable, observable signal. It is the detailed walkthrough beneath Pagination & Infinite Scroll, itself part of Web Scraping & Data Extraction.

Infinite scroll loop with response-gated waiting A scroll triggers an IntersectionObserver, which fires a paged API request; the scraper waits for that response, reads the batch, then loops until the response reports no more data. Scroll to bottom Observer fires GET ?page=n waitForResponse read batch More? loop / stop repeat while hasMore
The loop is gated on the data response, not a timer, so each scroll reads exactly the batch it triggered.

Root cause: lazy loading via IntersectionObserver and virtualization

Two front-end techniques make these pages hard to read. The first is lazy loading: the app attaches an IntersectionObserver to a sentinel element near the end of the list, and when that sentinel scrolls into view the observer fires a request for the next page of data. Nothing loads until you scroll, so a one-shot read sees only the initial batch. The second is virtualization: to render tens of thousands of rows without choking the browser, libraries keep only the visible window in the DOM and recycle nodes as you scroll. That means the final DOM state does not contain the rows you scrolled past — you must capture each batch while it is on screen, or read the data straight from the API responses. Synchronizing with content that arrives asynchronously is the broader skill taught in Handling Dynamic Content.

Minimal reproducible example

The cleanest approach intercepts the paged API the feed calls. Each scroll triggers a GET /api/items?page=n; you wait for that response, accumulate its JSON, and stop when the server says there is no more. This reads the data at the source, so virtualization is irrelevant.

import { test, expect } from '@playwright/test';

type Item = { id: number; title: string };

test('scrapes an infinite feed via its paged API', async ({ page }) => {
  const collected: Item[] = [];
  await page.goto('/infinite');

  let hasMore = true;
  while (hasMore) {
    // Register the waiter BEFORE scrolling so the triggered response is caught.
    const responsePromise = page.waitForResponse(
      (r) => r.url().includes('/api/items') && r.status() === 200,
    );
    // Scroll the sentinel into view to trip the IntersectionObserver.
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

    const response = await responsePromise;
    const body = await response.json();
    collected.push(...body.items); // accumulate this batch
    hasMore = body.hasMore === true; // authoritative stop signal from the API
  }

  expect(collected.length).toBeGreaterThan(0);
});

Step-by-step fix

  1. Identify the data request. Open the page, scroll once, and watch the Network panel (or log page.on('response', r => console.log(r.url()))). Find the request the scroll triggers and note its URL shape and the field that signals more pages — commonly hasMore, a nextCursor, or an empty items array.
  2. Register the response waiter before you scroll. Create the page.waitForResponse() promise first, then perform the scroll. A waiter registered after the request has fired will miss it, exactly as with any interception pattern.
  3. Scroll to trip the observer. Use window.scrollTo(0, document.body.scrollHeight) to push the sentinel into view, or scroll the specific scroll container if the list is not the document body. This is what causes the next request to fire.
  4. Read the batch from the response, not the DOM. Await the response promise and parse await response.json(), then append its rows to your accumulator. Reading from the response sidesteps virtualization entirely, since recycled DOM nodes never held the full set.
  5. Stop on the API's own completion flag. Loop while the response reports more data (hasMore === true, a non-null cursor, or a non-empty page). This is a positive signal, far more reliable than waiting for the DOM to stop changing.
  6. Add a safety cap and idle fallback. Bound the loop with a maximum iteration count, and if no API exists, fall back to counting rendered items and stopping after several scroll rounds with no growth. Persist batches to disk as you go so a long run stays memory-bounded and resumable.

Troubleshooting variants

The loop never stops

The completion field is being misread. Log the parsed body each iteration and confirm the exact key and value that means "done" — some APIs return hasMore: false, others a null cursor, others simply an empty items array. Break on whichever the API actually uses, and keep the iteration cap as a backstop.

The response waiter times out

The scroll did not trigger a request. The list may live in an inner scroll container rather than the document, so document.body.scrollHeight does not move the sentinel. Scroll the container directly with locator.scrollIntoViewIfNeeded() on the last item, or element.scrollTop = element.scrollHeight inside page.evaluate(). Confirm the sentinel is actually entering the viewport.

Rows are missing when reading from the DOM

The list is virtualized and recycled the rows you scrolled past. Switch to reading from the API responses as shown above, or capture each batch immediately after it renders and before the next scroll recycles it. Never rely on the final DOM holding every row. When you also need to assert on the outgoing requests, combine this with Intercepting and Modifying Network Requests.

Verification

Confirm completeness three ways. First, compare your collected count against a known total if the page exposes one (a "1,248 results" header), or against the sum of every API page's length. Second, re-run with npx playwright test --repeat-each=5 and confirm the count is identical every time — a stable count proves the stop condition is deterministic, not timing-dependent. Third, capture a trace with --trace on and review the Network tab in the Playwright Trace Viewer to verify one data response was read per scroll and the final response carried the completion flag.

Frequently Asked Questions

Why not just scroll with a fixed delay and read the DOM at the end?

Fixed delays are guesses: too short and you stop before the last batch loads, too long and the scrape crawls. Worse, virtualized lists recycle off-screen rows, so the final DOM never contains everything you scrolled past. Gating on the data response and reading each batch as it arrives is both faster and complete.

How do I scrape an infinite list that has no JSON API?

Fall back to a DOM strategy: scroll, wait for the rendered item count to increase, read the newly visible rows into a deduplicated set keyed by a stable id, and stop after several scroll rounds produce no new items. Deduplicate because virtualization can re-render the same row, and keep an iteration cap as a hard backstop.

How do I keep memory under control on a very long feed?

Do not accumulate everything in one array. Flush each batch to a file or database as you read it, keep only a set of seen ids in memory for deduplication, and persist a cursor or page number so a crash resumes from the last completed batch instead of the top.

Back to overview