The Data Acquisition Stack Scraping Your Site

By Q4 2025, publishers on the TollBit network were seeing one AI bot visit for every 31 human visits. At the start of that same year, the ratio was 1 in 200. The growth is obvious enough. What's less visible is how the scraping actually works, and why the tools most publishers rely on to stop it are structurally inadequate.

The short answer is that there isn't just a bot hitting your site. There's a tech stack behind the scenes powering it.

The data acquisition stack
AI developers don't typically run one crawler and call it done. They combine multiple data acquisition layers, escalating to more sophisticated approaches wherever they encounter resistance. Based on our analysis across nearly 40 scraping vendors, a complete stack can include any of the following:

First party crawlers - either identified or disguised
Secondary source scraping - targeting Google’s search results page to access indexed content without touching the publisher site directly.
Residential IP proxies - routing requests through real home internet connections, making traffic look like human visitors
Circumvention services - are layered on top of proxy networks to bypass detection systems
Cloud-based headless browsers - load full pages in remote browser environments and are often paired with a residential proxy so the request appears to come from a human.
Third-party scraping services - handle the end-to-end process as a managed service

The structural problem this creates is asymmetric. Most publisher websites deploy a single cybersecurity tool. The scraping side can cycle through dozens of services interchangeably, switching approaches when one gets blocked. Defenders have to respond to each new tactic as it emerges. Attackers just need to stay one step ahead.

Reddit and Google vs. the scrapers

The Reddit lawsuit, filed in October 2025 against Perplexity and three vendors most publishers won't have heard of — Oxylabs, AWMProxy, and SerpApi — is the clearest public window we have on how the stack operates in practice. Two of the defendants provide IP proxies, disguising scrapers as ordinary human traffic. SerpApi takes a different route: it scrapes Google's search results pages to access Reddit content, never touching Reddit's servers directly.

Google filed its own lawsuit against SerpApi in December, describing the company's methods in its filing: "SerpApi uses shady back doors — like cloaking themselves, bombarding websites with massive networks of bots and giving their crawlers fake and constantly changing names — circumventing our security measures to take websites' content wholesale." Google's internal tool built to prevent this, called SearchGuard, was the product of tens of thousands of person-hours and millions of dollars of investment. SerpApi got around it anyway.

If that's the situation for Google, media companies without comparable engineering resources are at a steep disadvantage.

What we observed in testing

When we built the Scraper Audit — a tool that lets publishers run commercial scrapers against their own URLs — we used it ourselves first, across 30 high-authority publisher sites split between paywalled and non-paywalled properties. We purchased top-tier enterprise subscriptions for each service to test their full capabilities.

The paywall finding was the most striking: some scrapers retrieved full versions of paywalled articles. Non-paywalled sites showed no meaningful advantage in protection. In the three cases where scrapers failed completely, the differentiating factor was server-side rendering of the paywall. Meanwhile, client-side paywall implementations were consistently bypassed.

Running the scrapers against our own test server revealed specific behavioral patterns in the logs. None of the tested scrapers fetched the robots.txt file before accessing the target page. All tested scrapers used current Chrome user-agents to blend into legitimate traffic rather than identifying themselves. One tool hit our server from at least ten different IP addresses within a single second. Others switched to Firefox user-agents when Chromium-based detection was active. Several combined modern user-agents with spoofed Google search referrers to simulate organic traffic arriving from a search result. Some used older, static browser profiles to avoid the behavioral signatures of modern headless frameworks.

This is the unfortunate reality websites are currently dealing with. A defense goes up, and scrapers find a way around it. It’s an endless cat-and-mouse game.

The cost of the arms race

For AI developers, this infrastructure isn't cheap. Advanced scraping services charge up to $22.50 per 1,000 pages. At the volumes required to serve a consumer AI product, data acquisition costs likely run into the tens of millions of dollars annually, before factoring in legal fees from suits like Reddit's and Google's.

For publishers, the cost is in defense: CDN configuration, bot detection tooling, engineering time, and the ongoing monitoring required to keep pace with new evasion techniques.

Neither side is getting much out of the arrangement. Publishers aren't getting paid and AI developers are spending heavily on infrastructure that delivers unreliable access. Our Q3 & Q4 State of the Bots report found that click-through rates from AI platforms fell from 0.8% to 0.27% across publishers without licensing deals, while AI applications as a whole drove just 0.12% of referral traffic compared to Google's 80.5%.

The volume of scraping is up. The return to publishers is effectively zero. The cost of maintaining the infrastructure to support that dynamic is high on both sides.

Using the Scraper Index and Scraper Audit

The Scraper Index catalogs the vendors driving this ecosystem. It outlines what they advertise, what techniques they use, whether they default to robots.txt compliance (many don't), and what output formats they produce. It's a reference for understanding who is likely hitting your site and what they're capable of.

The Scraper Audit lets you test your actual exposure to scrapers as a website. Use it to see if scrapers are bypassing your cybersecurity measures and/or paywalls. Paste any article URL, select which scrapers to run against it, and see what gets extracted. Please email team@tollbit.com to request access.

The full data behind this blog is in our https://tollbit.com/state-of-the-bots/q3-q4-2025/.

Further reading: https://digiday.com/media/in-graphic-detail-new-data-shows-publishers-face-growing-ai-bot-third-party-scraper-activity/

The Data Acquisition Stack Scraping Your Site

You may also be interested in

Where Does the Ad Money Go?

TollBit Enters Japan Through Strategic Partnership with Japan Business Press and BI.Garage

TollBit Integration Now Available in WordPress VIP: Monitor, Manage & Monetize AI Traffic