The CSS Class Lottery: Scraping Websites with Randomized Class Names

18 min read

Modern websites increasingly ship obfuscated front ends, hashed selectors, scoped styling systems, and frequent layout churn. If your scraper depends on concrete class names, it will eventually break. The page may still render correctly for users, the underlying data may still be present, and the workflow may still work in the browser, yet the scraper starts failing because the extraction logic was built on styling details rather than durable page meaning.

That is why scraping websites with randomized class names requires a different mindset. When classes are generated, hashed, rotated, or bundled dynamically, the safest extraction logic is usually not based on classes at all. Instead, it should rely on stable semantics, meaningful attributes, visible text, DOM relationships, embedded data, and, when possible, the actual network endpoints powering the page.

This guide explains how to handle dynamic and randomized CSS classes using more resilient strategies such as XPath, attribute selectors, text-based matching, DOM anchoring, script-layer extraction, and network-first approaches. It also covers the production failure modes that matter most, including hidden duplicates, silent wrong matches, and selector drift that does not throw obvious errors.

Why randomized CSS classes are such a persistent scraper problem

Randomized class names are not just a minor front-end nuisance. They are a structural mismatch between how front ends are built and how brittle scrapers are often written.

Modern class generation comes from systems such as:

  • CSS Modules
  • CSS-in-JS runtimes
  • component-scoped style compilation
  • minification and tree-shaking pipelines
  • utility class composition
  • experiment frameworks and A/B tests
  • anti-bot systems that inject unstable markup or decoys

From the front end’s point of view, classes are implementation details. They exist for styling, scoping, and payload management.

From the scraper’s point of view, that means a selector like:

.product-title-2A9Qk

or:

._3rA9c

may work perfectly today and disappear in the next build, even though the title remains in the same visual place.

This is why randomized classes are so disruptive. They break the scraper without breaking the page.

The real failure is not class churn. It is extraction design.

A lot of teams react to randomized classes by looking for the next class name to target. That is usually the wrong response.

The real problem is that the scraper was anchored to the least stable surface of the page.

A stronger question is:

What about this field is likely to stay true even after the next deployment?

Useful sources of stability often include:

  • IDs, names, and stable data attributes
  • visible text and labels
  • ARIA roles and accessible names
  • structural relationships in the DOM
  • nearby anchors such as headings or section titles
  • stable URL patterns
  • embedded JSON-LD or hydration state
  • API responses behind the page

The strongest selector strategy is usually built around meaning, not presentation.

Why brittle selectors survive just long enough to become expensive

Class-based scraping often feels productive in the short term.

You inspect the page, copy a selector, verify it works, and move on. The scraper passes testing and may even run successfully for days or weeks.

The problem is that brittle selectors often fail unpredictably. They may:

  • stop matching anything after a deploy
  • start matching a duplicate hidden node
  • match the wrong element after a layout variation
  • continue extracting a value, but from the wrong component

That last case is especially dangerous. A hard SelectorNotFound error is easy to notice. A selector that still returns a value, but the wrong one, is often much worse.

This is why randomized classes are not only a maintenance problem. They are a silent data quality problem.

What expert scrapers optimize for instead of class names

An experienced scraper engineer is usually not asking, “Which class contains the price?”

The better question is:

What is the most durable representation of price across templates, locales, and front-end releases?

That may be:

  • a JSON-LD offers.price field
  • a product API response
  • a value next to a stable label
  • a data-testid or data-qa attribute
  • a value inside a known card anchored by a stable product link

The best extractor is often the one closest to the source of truth, not the one closest to what devtools highlighted first.

A practical selector confidence hierarchy

When dealing with randomized classes, it helps to think in terms of selector confidence.

A useful hierarchy looks like this:

Confidence LevelExtraction SurfaceTypical Stability
HighestAPI response or structured backend payloadUsually strongest
HighJSON-LD or embedded state blobsStrong when present
Medium-highStable data attributes, IDs, ARIA labelsStrong if semantically meaningful
MediumScoped text-plus-structure selectorsGood with careful design
LowerBroad XPath or partial attribute fallbacksUseful, but higher risk
LowestFull hashed class selectorsUsually fragile

This framework helps teams stop treating all selectors as equal.

Strategies for scraping websites with randomized class names

The strongest scrapers use multiple strategies in layers. Start at the most durable surface and only move deeper if needed.

1. Prefer non-class selectors

One of the best alternatives is to target attributes that are more likely to exist for application logic rather than styling.

Useful examples include:

  • id
  • name
  • data-* attributes
  • data-testid
  • data-qa
  • data-sku
  • aria-* attributes
  • role
  • href
  • src
  • stable form field types

Examples:

button[data-testid="add-to-cart"]
[data-qa="price"]
a[href*="/product/"]

These selectors are often more durable because they support testing, accessibility, routing, or business logic, which tends to change less often than styling.

2. Use text anchors, but use them carefully

Visible text can be one of the strongest anchors when it reflects a stable user-facing label.

Useful targets include:

  • button text
  • headings
  • field labels
  • breadcrumb text
  • section titles
  • stable visible product or article names

Examples:

  • a button whose text is “Add to cart”
  • an input associated with the label “Email”
  • a value under the heading “Specifications”

However, text-based matching has real risks:

  • localization changes labels across markets
  • A/B tests may alter wording
  • repeated text can create ambiguous matches
  • hidden duplicates can still contain the same text

That is why text should rarely be used alone. It is much stronger when combined with container scoping, visible-state checks, or role constraints.

3. Use XPath for relational logic

XPath is especially useful when you need to express relationships that plain CSS handles poorly.

Good XPath use cases include:

  • find a label, then select the related value
  • find a heading, then extract the next section
  • anchor to a known title, then move inside the same card
  • match partial attributes inside a structural path
  • traverse sibling or ancestor relationships deliberately

Examples:

//div[normalize-space()='Price']/following-sibling::div[1]
//h2[text()='Specifications']/following::table[1]

The reason XPath works well here is that randomized classes often change, but relationships like “the value next to this label” or “the button inside this card” often survive much longer.

4. Scope to the right container before selecting anything inside it

One of the most effective ways to survive front-end churn is to stop thinking about elements globally.

A global selector may find matches across:

  • hidden templates
  • duplicate desktop and mobile trees
  • inactive tabs
  • repeated cards
  • modals
  • collapsed sections
  • cloned responsive components

A safer pattern is:

  1. identify the correct container
  2. confirm it is the active or visible one
  3. extract only within that scope

For example:

  1. find the correct product card by product link or title
  2. stay inside that card
  3. extract price, rating, and button from within the same block

This reduces the risk of matching the wrong duplicate node.

5. Treat hidden duplicates and templates as first-class risks

This is one of the most common expert-level failure modes that simpler articles skip.

Even when you stop using class names, the scraper can still fail if the DOM contains:

  • hidden templates
  • dormant component trees
  • mobile and desktop duplicates
  • inactive tab content
  • placeholders for hydration
  • experiment variants hidden behind CSS

A selector may still return a result, but from the wrong branch.

That means production scrapers should verify things like:

  • is the node visible or active
  • is the container currently rendered for the user path you care about
  • is the element inside the live interaction region rather than a dormant template
  • does the extracted value match nearby visible context

This is a major reason why “wrong-but-present” data is often more dangerous than missing data.

6. Use partial attribute matching carefully

Sometimes the full class or attribute is unstable, but part of it is meaningful.

Examples include:

  • stable prefixes in classes
  • route fragments in URLs
  • partially stable data attribute values

Examples:

[class*="product-card"]
//a[contains(@href, '/product/')]

This can work well, but only if the stable fragment is narrow enough to avoid noisy matches.

7. Parse embedded data before scraping the visual DOM deeply

Many pages include machine-readable data that is much more stable than the visible DOM.

Useful sources include:

  • JSON-LD schema blocks
  • window.__INITIAL_STATE__
  • window.__NEXT_DATA__
  • window.__NUXT__
  • window.dataLayer
  • inline state assignments
  • meta tags such as Open Graph and product metadata

These sources are often less likely to be obfuscated and may expose:

  • product ID
  • SKU
  • price
  • availability
  • canonical URLs
  • breadcrumb data
  • result counts
  • internal entity identifiers

This is one of the strongest escape hatches when the CSS layer becomes a lottery.

8. Discover and use the underlying APIs when possible

Open DevTools and inspect the Network tab.

Look for:

  • XHR requests
  • fetch calls
  • GraphQL queries
  • JSON endpoints powering cards, listings, or search results

If allowed by the site’s rules and your workflow, calling the actual data endpoint can be much more stable than scraping the rendered interface.

However, API-first extraction also needs caution. Those endpoints may:

  • require auth headers or cookies
  • depend on anti-CSRF tokens
  • paginate differently than the UI
  • omit fields shown only in the rendered experience
  • change independently from the visible page

That means API extraction is often the best first choice, but it still needs validation against what the page actually shows.

9. Render only when you need the browser

If data is not available through structured scripts or network endpoints, browser automation becomes useful.

Modern headless frameworks such as Playwright and Puppeteer provide more durable locator options than raw class selectors, including:

  • text locators
  • role-based locators
  • label-based selectors
  • accessible name matching
  • scoped locators inside visible containers
  • XPath for relational selection

Browser automation should not be the default for every target. It should be used when:

  • the data appears only after complex client-side rendering
  • the workflow requires clicks, expands, or scrolls
  • the page depends on dynamic user-like interaction

Extraction locators and interaction locators are not always the same

This distinction matters.

A locator that is acceptable for reading a value is not always safe for clicking.

For example:

  • a text match may be enough to extract a nearby value
  • but clicking a button may require a visible, enabled, top-layer, interactable control

That means interaction logic should usually add more checks than extraction logic, such as:

  • visible state
  • enabled state
  • overlay checks
  • active container validation
  • expected page-state transitions after the action

This is especially important on modern sites with duplicates, hidden templates, and layered UI states.

Robust selector patterns you can reuse

Here is a practical comparison of selector strategies and when they tend to work best:

StrategyWhen it shinesExample
Data attributeApps with testing or analytics tags[data-testid='price']
Text anchor plus relative pathLabels are stable but layout varies//div[normalize-space()='Price']/following-sibling::*[1]
ARIA role or accessible nameAccessible sites or browser automationrole=button[name='Add to cart']
JSON-LDProduct and article pages with schema markupExtract offers.price from JSON-LD
Network APISPAs or headless CMS pagesCall the product or GraphQL endpoint directly
Scoped container extractionRepeated cards or duplicated layoutsFind the card, then extract fields inside it

The strongest systems often chain multiple strategies as fallbacks.

For example:

  • try API first
  • then JSON-LD or script state
  • then stable attributes
  • then text-anchored XPath inside a container
  • then a browser-rendered fallback if needed

Code examples

Playwright: scoped, semantic selection instead of hashed classes

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.firefox.launch(headless=True)
    context = browser.new_context()
    page = context.new_page()

    page.goto('https://example.com/product/123', wait_until='networkidle')

    product_region = page.locator("main").first
    title = product_region.get_by_role('heading').first.inner_text()

    price_label = product_region.get_by_text('Price', exact=True)
    price_el = price_label.locator('xpath=following-sibling::*[1]')
    price = price_el.inner_text()

    print({'title': title, 'price': price})

    browser.close()

This is still simplified, but it is safer than relying on a global class match.

Requests plus lxml: text-anchored XPath without hashed classes

import requests
from lxml import html

url = 'https://example.com/product/123'
resp = requests.get(url, timeout=20, headers={'User-Agent': 'Mozilla/5.0'})
resp.raise_for_status()

doc = html.fromstring(resp.text)

price = doc.xpath("normalize-space(//div[normalize-space()='Price']/following-sibling::*[1])")
title = doc.xpath("normalize-space(//h1)")

print({'title': title, 'price': price})

Network-first: using the API behind the page

import requests

session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})

api_url = 'https://example.com/api/products/123'
r = session.get(api_url, timeout=20)
r.raise_for_status()

payload = r.json()
print({'title': payload['name'], 'price': payload['price']})

Build around source-of-truth layers, not one selector per field

A resilient scraper usually does not say:

  • title comes from this one selector forever

It says:

  • try API field first
  • if unavailable, parse JSON-LD
  • if unavailable, use a scoped DOM extractor
  • validate the value before accepting it

That layered model is much more robust than treating the DOM as the only source.

Add validation so selectors do not fail silently

A dangerous property of randomized-class environments is that selectors can keep returning values after they become wrong.

This means validation is essential.

Useful checks include:

  • price must be numeric and within expected range
  • extracted title should roughly match page title or JSON-LD name
  • product link should match the canonical route pattern
  • duplicate matches inside one container should be treated as suspicious
  • hidden or empty values should not be accepted just because the selector returned a node

This is one of the most important upgrades from a merely working scraper to a production scraper.

Choose the right proxy and session strategy

Even strong selectors can fail if the target starts treating the session as suspicious.

Match your proxy and session strategy to the target:

  • datacenter proxies for lower-friction targets and speed-sensitive collection
  • residential proxies for harder targets with stronger IP reputation checks
  • rotating sessions to distribute load
  • sticky sessions for carts, logins, or stateful journeys

Good extraction logic still depends on good operational behavior:

  • control request rate
  • tune concurrency by domain
  • avoid mechanical timing in browser flows
  • keep headers and browser identity consistent
  • use realistic session continuity where the target expects it

A dependable proxy layer such as InstantProxies can help with IP diversity, session control, and geography without adding unnecessary operational complexity.

Building scrapers that tolerate layout churn

Selectors will age. Plan for that explicitly.

A stronger scraper design usually includes:

  • layered extractors such as API → JSON-LD → attributes → text-anchored DOM
  • ordered fallback paths per field
  • field-level success metrics
  • site-specific extractor versions
  • canary URLs checked regularly
  • monitors for missing fields, null spikes, or implausible values
  • saved HTML or DOM snapshots for debugging

A practical test pipeline looks like this:

  1. run canary URLs on a schedule
  2. compare extracted fields against expected thresholds
  3. alert on anomalies such as missing price, wrong title, or suspicious fallback usage
  4. attach selector path, fallback layer, or HTML snippet to the alert
  5. record the fix in a short playbook entry

This turns selector maintenance into an operational process instead of an emergency ritual.

Common mistakes that make randomized class problems worse

Copying the most specific class from devtools

Specificity is not the same as stability.

Matching long hashed class chains

These are often the first selectors to break in the next deploy.

Using global selectors with no scoping

This increases the chance of matching hidden, duplicated, or unrelated elements.

Ignoring semantic or accessibility attributes

Many pages expose better anchors than classes, but scrapers never use them.

Extracting from the DOM when APIs or script data exist

That creates unnecessary fragility.

Not testing across variants

A selector that works on one page may fail across locale, template, or experiment differences.

Treating a present value as proof of correctness

Wrong-but-present extraction is one of the most expensive failure modes.

A pragmatic workflow you can use now

Use this checklist whenever you tackle a new target:

  1. Inspect the DOM for stable anchors such as IDs, data attributes, labels, headings, and ARIA roles
  2. Search for JSON-LD and embedded state objects before writing DOM selectors
  3. Check the Network tab for JSON or GraphQL endpoints
  4. Build layered extractors in this order: API → JSON-LD → structured DOM → text-anchored DOM
  5. Add fallbacks and sanity checks for each critical field
  6. Choose a proxy strategy that matches the target’s difficulty and session needs
  7. Add canary URLs, validation rules, and alerting
  8. Document the site’s quirks, duplicate-node risks, and change patterns

Example: resilient product detail extraction plan

Imagine a retailer page that renames classes on each release.

A stronger extraction plan would be:

  • first parse JSON-LD for name, SKU, price, and availability
  • if missing, call the product API observed in the Network tab
  • if that path is unavailable, render the page and anchor on visible labels such as “Price” and “Availability” inside the active product container
  • extract the title from the h1 or main heading region
  • validate numeric ranges and currency codes
  • compare the DOM value against script-layer or API values where available
  • cache results to reduce load and improve consistency

This is much more resilient than rebuilding class selectors after every deployment.

Maintenance metrics that actually matter

Track metrics that reveal whether the scraper is really tolerating layout churn:

  • selector success rate per field and extraction layer
  • fallback usage by field
  • null rate by page type
  • wrong-value anomaly rate
  • mean time to repair after layout changes
  • requests per successful record
  • 429 and block rates by target
  • canary failure rate

These metrics help you distinguish a robust scraper from one that is only surviving temporarily.

When to use a headless browser versus pure HTTP

Use a headless browser when:

  • critical data appears only after complex client-side rendering
  • you need user-like interactions such as click, expand, or scroll
  • the page uses dynamic state that is hard to reproduce with raw HTTP

Prefer pure HTTP when:

  • JSON endpoints are available
  • server-rendered HTML exposes JSON-LD or semantic markup
  • you want speed, scale, and fewer moving parts

A useful decision flow is:

  • Is the data in JSON-LD or script state?
  • If not, is there an API?
  • If not, can semantic DOM extraction work safely?
  • If not, render with a headless browser and use semantic locators

Frequently asked questions about scraping websites with randomized CSS classes

Why do randomized class names break scrapers so often?

Because they are often generated automatically during front-end builds and are not intended to stay stable across deployments. They support styling, not durable extraction.

Are CSS selectors useless on modern websites?

No. CSS selectors are still useful, but class names are often a weak foundation when they are generated dynamically. Attribute selectors, scoped selectors, and semantic anchors are usually much stronger.

When should I use XPath instead of CSS?

Use XPath when you need relational logic such as matching by text, moving between siblings, anchoring to headings, or selecting elements based on nearby structure that plain CSS handles poorly.

Is text-based matching always safe?

No. Text can vary across languages, experiments, and repeated components. It is strongest when combined with container scoping, visible-state checks, and structural context.

What is the most durable fallback when selectors keep breaking?

If the data is exposed in JSON-LD, framework state, or embedded APIs, script-based or network-based extraction is often more durable than trying to chase front-end class changes.

Better selector strategy starts with ignoring the CSS lottery

Randomized classes are frustrating because they make the scraper feel fragile even when the page itself is stable. But the class names are often not the real source of truth. They are just styling details produced by the front-end build.

The strongest scrapers survive that churn by anchoring to meaning instead of implementation. They use stable attributes, visible text, structural relationships, scoped containers, embedded data, and network endpoints instead of betting everything on whatever hashed class happened to exist during the last inspection.

If your scraper is breaking every time the front end deploys, that is usually a signal to redesign the locator strategy rather than keep replacing one brittle class with another. For production scraping infrastructure, pair that selector strategy with the right network layer from InstantProxies, compare current plans on the pricing page, and review available proxy types on the proxies page so your extraction logic and proxy layer fail less often together.