Pagination & Infinite Scroll

A listing page rarely shows all of its records at once. It splits them across numbered pages, hides them behind a "load more" button, or streams them in as you scroll. Extracting the full dataset means driving that mechanism to its end and knowing, with certainty, when there is nothing left to fetch. The hard part is not reading rows — it is the loop control: advancing deterministically, waiting for the new batch to actually arrive, and stopping on a stable signal rather than a guessed timeout. This guide covers the three traversal patterns you will meet, the stop conditions that make each one reliable, and how to keep memory bounded when a list runs to thousands of items. It sits under Web Scraping & Data Extraction and feeds the detailed walkthrough in Scraping Infinite Scroll Pages with Playwright.

Each traversal pattern pairs with a distinct, observable stop condition — never a fixed sleep.

Pattern one: numbered pages

Numbered pagination is the most extraction-friendly because the page set is addressable. A listing at /products?page=1 continues at ?page=2 and so on, and the total is usually discoverable from a "last" link, a result count, or a 404/empty body past the end. The reliable loop is to navigate to a page, wait for the row container, read it, then either follow the "next" link or increment the query parameter — and stop the moment the next signal is absent.

Two stop conditions are robust. The first is the presence of a "next" control: when getByRole('link', { name: 'Next' }) is no longer present or is disabled, you are on the last page. The second is content identity: if the first row of the new page equals the first row of the previous page, the server clamped you to the last valid page and you should halt to avoid an infinite loop.

import { test, expect, type Page } from '@playwright/test';

// Walk numbered pages until the "Next" control disappears.
async function scrapeAllPages(page: Page): Promise<string[]> {
  const titles: string[] = [];
  for (let pageNum = 1; ; pageNum++) {
    await page.goto(`/products?page=${pageNum}`); // addressable URL per page
    // Wait for the row container, not a timeout, before reading.
    await page.getByRole('row').first().waitFor();
    const rows = await page.getByRole('row').allInnerTexts();
    titles.push(...rows);
    // Stop signal: the Next link is gone on the last page.
    const next = page.getByRole('link', { name: 'Next' });
    if (await next.count() === 0) break;
  }
  return titles;
}

test('collects every product across pages', async ({ page }) => {
  const all = await scrapeAllPages(page);
  expect(all.length).toBeGreaterThan(0);
});

Prefer driving the URL directly over clicking "next" when the parameter is stable: it survives a crashed run (you can resume from the last completed page), it parallelizes across Browser Contexts & Isolation, and it sidesteps client-side state that a click would mutate.

Pattern two: load-more buttons

A "load more" button appends the next batch to the existing list instead of replacing it. The loop is: locate the button, click it, wait for the row count to grow, and repeat until the button is hidden, disabled, or removed from the DOM. The mistake that produces flaky scrapes is clicking again before the previous batch lands — so the wait must key off an observable change, either the network response that delivers the batch or the increased item count.

import { test, expect, type Page } from '@playwright/test';

async function clickThroughLoadMore(page: Page): Promise<number> {
  const items = page.getByTestId('list-item');
  await items.first().waitFor();
  while (true) {
    const before = await items.count();
    const loadMore = page.getByRole('button', { name: 'Load more' });
    // Stop when the control is gone or disabled — the list is exhausted.
    if (await loadMore.count() === 0 || await loadMore.isDisabled()) break;
    await loadMore.click();
    // Wait for the count to grow rather than guessing with a sleep.
    await expect.poll(() => items.count()).toBeGreaterThan(before);
  }
  return items.count();
}

test('exhausts a load-more list', async ({ page }) => {
  await page.goto('/feed');
  const total = await clickThroughLoadMore(page);
  expect(total).toBeGreaterThan(0);
});

When the button triggers an XHR you control, pairing the click with waitForResponse() is even tighter than counting, because it confirms the data arrived before you re-read the DOM. That technique — registering the waiter before the action — is covered under Network Interception Basics and is the same discipline used throughout reliable automation.

Pattern three: infinite scroll

Infinite scroll fires new requests when a sentinel element near the bottom enters the viewport, usually via an IntersectionObserver. There is no button and often no page parameter — you advance by scrolling and you stop when the item count stops changing across consecutive scrolls. Because the list may be virtualized (the DOM only holds the visible window while off-screen rows are recycled), you generally cannot read everything from the final DOM state; you read each batch as it appears, or you intercept the data responses directly.

The robust stop condition is a stable count: scroll, wait for either a new data response or a count increase, and break after a fixed number of scrolls that yield no growth. The full numbered walkthrough — scroll loop, waitForResponse(), and a debounced stop — lives in Scraping Infinite Scroll Pages with Playwright.

import { test, expect, type Page } from '@playwright/test';

async function scrollToEnd(page: Page, maxIdleRounds = 3): Promise<number> {
  const items = page.getByTestId('card');
  await items.first().waitFor();
  let idle = 0;
  while (idle < maxIdleRounds) {
    const before = await items.count();
    // Scroll the document to the bottom to trip the IntersectionObserver.
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    // Give the observer a chance; count growth is the real signal.
    await page.waitForTimeout(500);
    const after = await items.count();
    idle = after > before ? 0 : idle + 1; // reset idle streak on growth
  }
  return items.count();
}

test('reaches the end of an infinite feed', async ({ page }) => {
  await page.goto('/infinite');
  const total = await scrollToEnd(page);
  expect(total).toBeGreaterThan(0);
});

The waitForTimeout here is a deliberate, bounded settle window between scrolls, not a load guess — and it is replaced by waitForResponse() in the detailed guide when the feed exposes a paged API. Synchronizing reads with asynchronously arriving rows is the same problem covered in Handling Dynamic Content, which underpins every pattern on this page.

Knowing when extraction is complete

Every pattern needs a positive completion signal, never the absence of evidence after a timeout. Numbered pages are done when the "next" control vanishes or content repeats. A load-more list is done when the button leaves the DOM or becomes disabled. An infinite feed is done when the item count holds steady across several scroll rounds, or when the backing API returns an empty page or hasMore: false. Where the listing is backed by a JSON endpoint, that API field is the most authoritative stop condition of all — intercept it rather than inferring completion from the DOM.

Keeping memory and state bounded

Long lists punish naive loops. Three habits keep a run healthy. Flush each batch to disk or a database as you read it instead of accumulating everything in one array, so a 50,000-row scrape does not grow unbounded in memory. For virtualized lists, read rows per batch as they render rather than expecting the final DOM to hold them all. And make the loop resumable: persist the last completed page number or scroll cursor so a crash restarts from there instead of the top. For large scrapes you should also pace requests politely — that is the subject of Anti-Bot Defenses & Rate Limiting.

Frequently Asked Questions

How do I know when an infinite scroll list has reached the end?

Track the rendered item count after each scroll and stop when it stays the same across several consecutive rounds. If the feed is backed by a JSON API, intercept the response and stop on an empty page or a hasMore: false flag, which is more reliable than inferring completion from the DOM.

Should I click the next button or change the page URL directly?

Prefer changing the URL when the page parameter is stable, because it is resumable after a crash, parallelizable across contexts, and free of client-side state side effects. Click the control only when the URL does not encode the page, such as cursor-based or token-based pagination.

Why does my load-more loop sometimes skip rows?

It is clicking again before the previous batch finished loading. Wait for an observable change — either the network response that delivers the batch via waitForResponse() or a confirmed increase in the item count with expect.poll() — before clicking the button again.