NEWTollBit State of the AI Bots - Q3 & Q4 2025
New

April 13, 2026

The Data Acquisition Stack Scraping Your Site

By Q4 2025, publishers on the TollBit network were seeing one AI bot visit for every 31 human visits. At the start of that same year, the ratio was 1 in 200. The growth is obvious enough. What's less visible is how the scraping actually works, and why the tools most publishers rely on to stop it are structurally inadequate.

The short answer is that there isn't just a bot hitting your site. There's a tech stack behind the scenes powering it.

The data acquisition stack
AI developers don't typically run one crawler and call it done. They combine multiple data acquisition layers, escalating to more sophisticated approaches wherever they encounter resistance. Based on our analysis across nearly 40 scraping vendors, a complete stack can include any of the following:

  • First party crawlers - either identified or disguised
  • Secondary source scraping - targeting Google’s search results page to access indexed content without touching the publisher site directly.
  • Residential IP proxies - routing requests through real home internet connections, making traffic look like human visitors
  • Circumvention services - are layered on top of proxy networks to bypass detection systems
  • Cloud-based headless browsers - load full pages in remote browser environments and are often paired with a residential proxy so the request appears to come from a human.
  • Third-party scraping services - handle the end-to-end process as a managed service

The structural problem this creates is asymmetric. Most publisher websites deploy a single cybersecurity tool. The scraping side can cycle through dozens of services interchangeably, switching approaches when one gets blocked. Defenders have to respond to each new tactic as it emerges. Attackers just need to stay one step ahead.

Reddit and Google vs. the scrapers

The Reddit lawsuit, filed in October 2025 against Perplexity and three vendors most publishers won't have heard of — Oxylabs, AWMProxy, and SerpApi — is the clearest public window we have on how the stack operates in practice. Two of the defendants provide IP proxies, disguising scrapers as ordinary human traffic. SerpApi takes a different route: it scrapes Google's search results pages to access Reddit content, never touching Reddit's servers directly.

Google filed its own lawsuit against SerpApi in December, describing the company's methods in its filing: "SerpApi uses shady back doors — like cloaking themselves, bombarding websites with massive networks of bots and giving their crawlers fake and constantly changing names — circumventing our security measures to take websites' content wholesale." Google's internal tool built to prevent this, called SearchGuard, was the product of tens of thousands of person-hours and millions of dollars of investment. SerpApi got around it anyway.

If that's the situation for Google, media companies without comparable engineering resources are at a steep disadvantage.

What we observed in testing

When we built the Scraper Audit — a tool that lets publishers run commercial scrapers against their own URLs — we used it ourselves first, across 30 high-authority publisher sites split between paywalled and non-paywalled properties. We purchased top-tier enterprise subscriptions for each service to test their full capabilities.

The paywall finding was the most striking: some scrapers retrieved full versions of paywalled articles. Non-paywalled sites showed no meaningful advantage in protection. In the three cases where scrapers failed completely, the differentiating factor was server-side rendering of the paywall. Meanwhile, client-side paywall implementations were consistently bypassed.

Running the scrapers against our own test server revealed specific behavioral patterns in the logs. None of the tested scrapers fetched the robots.txt file before accessing the target page. All tested scrapers used current Chrome user-agents to blend into legitimate traffic rather than identifying themselves. One tool hit our server from at least ten different IP addresses within a single second. Others switched to Firefox user-agents when Chromium-based detection was active. Several combined modern user-agents with spoofed Google search referrers to simulate organic traffic arriving from a search result. Some used older, static browser profiles to avoid the behavioral signatures of modern headless frameworks.

This is the unfortunate reality websites are currently dealing with. A defense goes up, and scrapers find a way around it. It’s an endless cat-and-mouse game.

The cost of the arms race

For AI developers, this infrastructure isn't cheap. Advanced scraping services charge up to $22.50 per 1,000 pages. At the volumes required to serve a consumer AI product, data acquisition costs likely run into the tens of millions of dollars annually, before factoring in legal fees from suits like Reddit's and Google's.

For publishers, the cost is in defense: CDN configuration, bot detection tooling, engineering time, and the ongoing monitoring required to keep pace with new evasion techniques.

Neither side is getting much out of the arrangement. Publishers aren't getting paid and AI developers are spending heavily on infrastructure that delivers unreliable access. Our Q3 & Q4 State of the Bots report found that click-through rates from AI platforms fell from 0.8% to 0.27% across publishers without licensing deals, while AI applications as a whole drove just 0.12% of referral traffic compared to Google's 80.5%.

The volume of scraping is up. The return to publishers is effectively zero. The cost of maintaining the infrastructure to support that dynamic is high on both sides.

Using the Scraper Index and Scraper Audit

The Scraper Index catalogs the vendors driving this ecosystem. It outlines what they advertise, what techniques they use, whether they default to robots.txt compliance (many don't), and what output formats they produce. It's a reference for understanding who is likely hitting your site and what they're capable of.

The Scraper Audit lets you test your actual exposure to scrapers as a website. Use it to see if scrapers are bypassing your cybersecurity measures and/or paywalls. Paste any article URL, select which scrapers to run against it, and see what gets extracted. Please email team@tollbit.com to request access.

The full data behind this blog is in our https://tollbit.com/state-of-the-bots/q3-q4-2025/.

Further reading: https://digiday.com/media/in-graphic-detail-new-data-shows-publishers-face-growing-ai-bot-third-party-scraper-activity/

You may also be interested in

March 8, 2024

Where Does the Ad Money Go?

The current model of digital advertising is on the cusp of a seismic shift. For as long as most of us have known the internet, a silent transaction underpinned most of what we encountered online: advertisers funded websites in exchange for the promise of our attention. In the background, ads loaded alongside articles, or played before videos, and revenue trickled in based on impressions or clicks. Yet, the rise of AI assistants threatens to turn this familiar system on its head.

Read more

February 2, 2024

The New Era of Internet

The Internet we've known, for all its promise of accessibility, often falls short of true information democracy. Finding accurate, in-depth knowledge frequently means navigating complex paywalls, biased search rankings, and siloed platforms built to capture our attention, not empower our minds. With the advent of AI assistants, there might be a chance to break down barriers and rewire how we access the collective knowledge humanity has gathered online.

Read more

October 22, 2024

Announcing Our Series A

TollBit founders Olivia Joslin and Toshit Panigrahi today announced that the company has raised $24M led by Lightspeed in Series A funding for their platform, which has created a fluid payment system benefitting both AI companies and content creators. TollBit allows AI agents and applications to pay websites directly to use their content or data. TollBit empowers publishers and content owners to monitor AI bot traffic and put up a tollbooth to monetize usage. The platform makes it easy for publishers to onboard multiple AI and LLM partners in real-time. The first clients have begun to use the system, which will benefit content creators and make AI more useful and up-to-date.

Read more