NEWTollBit State of the AI Bots - Q2 2025

State of the Bots

2025 Q1, The Rise of RAG Bots

Executive Summary

1

RAG bot scrapes now exceed Training bot scrapes across the TollBit network. From Q4 2024 to Q1 2025, RAG bot scrapes per site grew 49%, nearly 2.5X the rate of Training bot scrapes (which grew by 18%). This is a clear signal that AI tools require continuous access to content and data for RAG vs for training.

2

Among websites with TollBit Analytics set up before January 2025, AI Bot traffic volume nearly doubled in Q1, rising by 87%.

3

Publishers have attempted to block 4x more AI bots between January 2024 and January 2025 by disallowing them in their robots.txt file. However, AI bots are increasingly ignoring robots.txt. The percentage of AI Bot scrapes that bypassed robots.txt surged from 3.3% in Q4 2024 to 12.9% by the end of Q1 2025 (March). In March 2025, over 26M scrapes from AI bots bypassed robots.txt for sites on TollBit.

4

Bot traffic directed to the TollBit Bot Paywall has increased by 732% from Q4 2024 to Q1 2025 as publishers explore more proactive defenses that don't depend solely on the honor system of robots.txt.

5

Some AI companies are now attempting to set a problematic precedent of treating bot traffic the same as human visitors. Recent updates to numerous major AI companies' terms of service state that their AI bots can act on behalf of user requests, explicitly stating they can ignore robots.txt guidelines when being used for RAG.

6

On average across TollBit sites, for every 11 scrapes, Bing returns one human visit. Scrape-to-referral ratios for AI-only apps are: OpenAI 179:1, Perplexity 369:1, and Anthropic 8692:1.

7

AI apps and agents continue to drive limited traffic back to publishers. Across the TollBit network, AI apps drove 0.04% of total external referral traffic to sites. Meanwhile, Google drove 85% of external referral traffic.


Section 1: The Scale of AI Scraping

RAG Scrapes Surpass Training Bot Scrapes

Key Definitions:

Retrieval augmented generation (RAG) Agent — These bots retrieve information in real-time to respond to user prompts on AI tools, such as Perplexity or ChatGPT, by searching the web. The responses they generate often include links or citations to the original sources of the information.

Training Data Crawler — These bots collect data to train large language models (LLMs), such as Claude 3.7 Sonnet or Meta's Llama. These bots move around the web — following links, parsing sitemaps, and downloading content — to build massive datasets. The collected data is then used to "teach" the model how to generate language.

RAG bot scraping activity has now surpassed the scraping activity of training bots. Monetizing RAG, aka real-time access, is the recurring revenue opportunity for publishers, as AI applications need ongoing access to maintain the utility of their services.

By analyzing the functions of the bots driving this overall increase in AI web traffic, we see that RAG bot traffic now exceeds Training bot traffic across the TollBit network. From Q4 2024 to Q1 2025, RAG bot traffic per site grew 49%, nearly 2.5X the rate of Training bot traffic (18%).

This expansion of server requests from RAG agents is a signal of continued AI adoption, particularly in search-like settings. The more that users turn to these services to retrieve information from the real-time web, even if they don't explicitly instruct the AI applications to do so, we would expect this figure to continue increasing. Some of these journeys would previously have resulted in users landing on publisher websites — probably via Google — with monetization opportunities for the content creators that have now been lost.

Figure 1.1. AI bot traffic by user-agent type (average, per domain)

AI bot traffic by user-agent type

AI User Agent Traffic Continues to Increase

AI bot traffic continues to grow rapidly, making up an ever-larger proportion of overall publisher traffic. OpenAI, Perplexity, and Meta own the most active AI-specific bots.

AI user agent traffic has grown steadily throughout the first quarter of 2025. Among websites that had Analytics set up before January 2025, traffic volume nearly doubled in Q1, rising by 87% during the quarter.

Figure 1.2. Daily Total AI Bot Traffic — Cohort of websites that onboarded before Jan'25

Daily Total AI Bot Traffic

Activity levels in the quarter vary widely across AI user agents and between quarters. Average scraping activity across the top 6 AI bots increased by 56% from Q4 to Q1, though individual bot trends varied significantly. Notably, the scraping activity of PerplexityBot — the hybrid RAG agent and AI search indexing crawler for Perplexity — increased markedly (+359%) from the previous quarter.

Figure 1.3. Monthly Average Scrapes per site by AI Bots — All sites from Q4 '24 vs Q1 '25

Monthly Average Scrapes per site by AI Bots

Out of the top 12 AI bot's scraping activities, ChatGPT, META, and Perplexity are the most active, making up a total of about 70% of the top AI Bots scrapes. Looking at the ownership of the most active bots, OpenAI's user agents are three of the top five most active crawlers and collectively account for over 46% of top 12 AI bot scraping across TollBit partner websites in Q1 of 2025.

Table 1.4 A breakdown of the Top 12 Bots

AI BotQ1 Share
ChatGPT-User28.50%
Meta-ExternalAgent25.45%
PerplexityBot16.71%
OAI-SearchBot11.31%
GPTBot6.92%
ClaudeBot5.04%
Bytespider3.14%
Timpibot1.02%
DuckAssistBot0.94%
Perplexity-User0.42%
CCBot0.37%
ai2bot-dolma0.18%

Note: There are other players like Google Overviews, Microsoft Co-pilot, and Apple's AI tools who may have larger consumer reach, but because they do not separate their AI user agents from their crawlers, it is not possible to include them in analyses specific to AI bot traffic.

Scraping Trends by Content Category

Analyzing scraping trends across different content categories largely shows consistent growth across the board. This analysis focuses on a cohort of websites that onboarded before Q4 2024 and compares how bot scraping has increased by content category.

Notably, the one decline in scraping noted was in "Deals & shopping" category, which spiked up ahead of holiday shopping at the end of 2024. This is likely a great reflection of user queries, and why AI bots would be fetching more content in accordance with what is being searched.

Figure 1.5 Average scrapes per page for each category — cohort — Q4 '24 vs Q1 '25

Average scrapes per page for each category

AI Bot Scraping as a Percentage of Googlebot and Bingbot Activity

In Q2 2024, the scraping activity of the top 6 AI bots was roughly 10% the size of Googlebot's scraping activity. In Q1 2025, AI bot access to sites is now 60.29% that of Bingbot's activity, and 30.55% of Googlebot's total scraping.

QuarterTop AI bot scrapes to Google botsTop AI bot scrapes to Bingbot
2024 Q29.89%10.86%
2024 Q317.85%17.36%
2024 Q421.40%42.44%
2025 Q130.55%60.29%
AI bot scraping vs Google and Bing

This data shows the dramatic increase in AI bot market share when compared to Googlebot and Bingbot activity. Google and Microsoft are companies that publishers have obviously had relationships with for 20+ years. In one year they are being hammered by new crawlers where the value exchange is no longer clear, especially if they don't drive traffic back to sites.


Section 2: AI App Click-through Rates Remain Extremely Low

Click-through rates from AI applications remain a small fraction of the click-through rate of Google's organic search results.

Click-through rate is calculated by dividing the number of site visits from an AI application by the number of accesses from that AI app's agent, for which we have a strong signal they act in real-time to fetch content on behalf of an end user.

This is a proxy and in some cases may overstate the actual click-through rate from AI applications as a) content may be held in an offline cache, or b) third-party or masked user agents may be in use. Both of these would have the effect of artificially inflating the click-through rate by obfuscating the true number of times content was accessed.

The true number is, of course, with the AI companies.

Figure 2.1. AI application click-through rates vs Google for Q1 2025

Figure 2.2. Average Google click-through rates, 1-10 organic positions

AI application click-through rates vs Google

Referral Traffic from AI Apps Is a Tiny Fraction of External Referrals

Looking in aggregate across TollBit's publishing partners, traffic from AI applications represented just 0.04% of all external referrals to sites in Q1 2025. This is insignificant when compared with Google, which delivers around 85% of traceable external visits.

Even as AI usage accelerates — often in settings which have the potential to substitute engagement on a website — the number of referrals from AI applications remains minuscule. The data reinforces what is intuitively the case — these interfaces are simply not designed to require the user to click through to a website. As a consequence, unless some other form of a fair exchange of value occurs, it is hard to make the case for any course of action other than taking all possible measures to prevent AI bots from scraping publisher content.

In comparison, Google continues to contribute to the bulk of a publisher's total external referral traffic, at 85%. "Other" platforms account for the remaining 15% and is comprised of search engines like Bing, Yahoo, and DuckDuckGo as well as social media platforms like LinkedIn and TikTok.

There are two considerations for the industry as this continues to unfold:

  1. Referrals from AI apps seem to have doubled from Q4 2024 to Q1 2025. However, this is likely driven by more adoption of those platforms, not an indication that the platforms are driving more traffic back. This number should be compared to the drop in Google referrals from 90.75% Q2 2024 to 85.056% in Q1 2025. If users are changing their behavior to use more AI tools, this drop from Google referrals is not nearly made up for in referrals by AI search platforms. Part of this drop may also be due to the impact of AI summaries in search.

  2. As this trend continues, the web will see increasing amounts of traffic from bots. This incurs growing cost on publisher infrastructure. In the past, the value exchange for crawling was that the platforms would drive traffic. However, this value exchange needs to be reexamined as AI visitors grow if they do not send back traffic to sites.

Figure 2.3. Sources of referral traffic across publisher sites

Sources of referral traffic across publisher sites

Note: The blue line is not representative of 0.04%; this line would not even appear on the chart if accurately displayed.

Figure 2.4. Sources of external referral traffic across publisher sites (Q2 2024 — Q1 2025)

Sources of external referral traffic

A key reminder is that Google referral traffic is also falling each quarter. Referral traffic from AI Bots is still minuscule and nowhere near enough to offset the broader decline.

Crawl-to-Referral Ratio

In Q1 2025, on average across TollBit sites, for every 11 crawls, Bing returns one human visitor to sites. This means that Bing's crawl-to-referral ratio is 11:1, up from 8:1 in Q4 2024 and 6:1 in Q3. Crawl-to-referral ratios for AI-only apps are as follows:

  • OpenAI's ratio is 179:1 — improved slightly, dropping from 286:1 in Q4 2024, but still far exceeds Bing's.
  • Perplexity's ratio is 369:1 — crawling activity surged, increasing from 136:1 in just a quarter.
  • Anthropic's ratio is 8692:1 — the highest gap, scraping content 8,692 times for every visit it sent, up from 5,880:1 in Q4 2024.

It is important to note that each AI platform behaves differently, with a UI that is unique. This crawl to referral ratio shows how these changes contribute widely to click throughs from the platforms. It is also worth commending proper use of the user agents for the platforms that separate out crawlers for indexing, training, and retrieval. Many of the up and coming "agent" platforms do not discernibly identify themselves, and worse, actively try to pass off as humans by using residential IP addresses or headless browsers.

Figure 2.5. Number of crawls for each referral by source

Number of crawls for each referral by source

Section 3: Robots Exclusion Protocol — Adoption and Compliance

Disallowing RAG Agents on Robots.txt Isn't Enough to Protect Sites

Our data demonstrates that, for certain AI applications, disallowing real-time scraping by RAG bots on Robots.txt has zero impact on the referrals the AI apps deliver. This is likely due to usage of third-party scrapers or masked user agents who continue to scrape sites under the radar.

In our conversations with publishers, many tell us that they are opting to allow AI bots to scrape their sites in order to ensure that they do not miss out on any referrals that might flow from AI applications. In our first State of the Bots report we demonstrated that the click-through rate from AI platforms is 95.7% lower than for organic results on the first page of Google's search engine results page (we revisit this data for Q1 2025 above).

In addition to the very low traffic, we also see that, for some AI applications, these external referrals continue to stay consistent even when bots stop scraping sites after being disallowed on robots.txt. This highly likely indicates that some AI apps continue to obtain access to the content even if they appear to be abiding by being disallowed on robots.txt — whether it's by using a third-party scraper or a masked user agent.

Figure 3.1. PerplexityBot scrape count, cohort of sites that disallow Perplexity on Robots.txt

PerplexityBot scrape count for disallowed sites

Figure 3.2. Perplexity referrals for a cohort of sites that disallow PerplexityBot on Robots.txt

Perplexity referrals for disallowed sites

Publishers Are Trying to Control AI Use of Their Intellectual Property

Over the last year, the number of AI bots that publishers have attempted to block AI bots by disallowing them via their robots.txt files has increased by 4x. Website owners are more active in their attempts to restrict unauthorized use of their content. Notably, AI companies have recently begun changing the descriptions of their user agents to specify that they do not always have to abide by restrictions on robots.txt.

TollBit has analysed the robots.txt files of a large cohort of its publishing partners and whether, and how, they have changed from January 2024 to January 2025. We can see that the number of explicit disallow requests for AI bots (i.e. a single site disallowing a single bot would count as one) has increased from 559 to 2,165 (+287%). The average number of AI bots explicitly disallowed per website has grown from 2.2 to 8.6, a 4x increase.

Figure 3.3. Robots.txt analysis example

Example of the evolution of a publisher's robots.txt file from January 2024 to January 2025

Figure 3.4. Total AI bot disallow requests

Total AI bot disallow requests

When we look at the specific disallow requests, we can see widespread attempts by publishers to restrict AI bots access to their sites. Interestingly, the number of disallow requests to OpenAI's training data crawler have decreased; this could be due to a combination of OpenAI's partnership program and publishers hoping that referrals from ChatGPT will follow. As we explore in section 2 of this report, the data suggests referrals from AI apps remain extremely low.

Figure 3.5. Change in Disallowing Bots via Robots.txt Across Websites (2024 vs 2025)

Change in Disallowing Bots via Robots.txt

AI Companies Changing User Agent Descriptions

Seemingly in response to the increases in websites' attempts to block bots by disallowing them on robots.txt, AI companies are changing the descriptions of their user agents. Some bot descriptions are explicitly stating that, particularly in circumstances where the agent is fetching information in real-time to service a user request, the crawlers will ignore robots.txt instructions.

Since the first State of the Bots report in Q4 2024, Perplexity has added a second user agent, Perplexity-User, which operates as a RAG bot. While this separation is beneficial in that it could help publishers control how their content is used, Perplexity-User is among the bots that does not appear to respect robots.txt instructions.

Robots.txt has been a guideline/handshake agreement that bots agreed to follow since the beginning of the Internet. It sufficed in a world where humans were the primary drivers of economic value on the internet. AI visitors will vastly outnumber human visitors in the future, and we need governance on how agentic visitors identify themselves on the internet.

Agents and bots should have unique identities on the Internet, separate from human visitors. Bots mimicking humans or being able to take actions as an extension of a human without connected identification to the human sets a dangerous precedent. This is poised to have massive impacts on advertiser trust, site analytics, and the traffic load to sites, all with limited benefits to the sites being scraped.

Screenshot 1. Perplexity's developer notes (taken 13th April 2025)

Perplexity developer notes

Screenshot 2. Google's developer notes (taken 13th April 2025)

Google developer notes

Screenshot 3. Meta's developer notes (taken 13th April 2025)

Meta developer notes

Bot Paywall as an Alternative Defense

Instead of just relying on disallowing bots on Robots.txt, one way websites are forcing AI companies off their sites is by exploring bot-blocking solutions. A cybersecurity tool can be used to detect and redirect AI bot traffic away from the human version of your website. TollBit provides basic cybersecurity measures to its sites and also partners with cybersecurity vendors (like HUMAN Security) to provide a more robust solution to websites on the platform.

This allows us to build towards the future. Instead of just blocking sites, websites on TollBit can forward the traffic to the tollbit subdomain (e.g. tollbit.time.com). This allows websites to expose parallel infrastructure that is optimized for AI traffic. The number of bots directed to the TollBit Bot Paywall has increased by 732% in Q1 2025 versus Q4 2024.

This alternative gateway of sanctioned web access helps to:

  • Prevent any ads from being inadvertently served to AI visitors
  • Ensure human visitor site metrics don't get impacted by bot traffic
  • Allow AI bots to be presented with an option to pay for access with added benefits

Figure 3.6. AI bots directed to TollBit Bot Paywall

AI bots directed to TollBit Bot Paywall

Appendix: AI User Agent Profiles

New User Agents in Q1 2025

Perplexity-User

Announced at the start of March, Perplexity's new user agent acts in real-time to 'support user actions' within Perplexity. We have therefore categorised Perplexity-User as a RAG agent however, in testing at the start of April 2025, TollBit has found that both PerplexityBot and Perplexity-User will scrape a page in real-time when Perplexity is prompted.

Claude-User

Also announced in March — and coinciding with Claude's new web-access functionality — Claude-User is Anthropic's new RAG agent.

Claude-SearchBot

With details published at the same time as Claude-User, Claude-SearchBot is the search indexing crawler to power Claude's web access.


AI Bot Profiles by Operating Organization

OpenAI

Developer of ChatGPT, OpenAI's chatbot was first-to-market and still holds a ~60% share with 400M weekly active users as of April 2025. Its foundation models — and therefore user agents — power both ChatGPT and an array of third-party applications, including Microsoft's Co-Pilot and Bing search engine.

User agents:

  • ChatGPT-User — Accesses websites in real-time and on-demand so that ChatGPT can formulate responses to user prompts based on the live web. It visits websites to gather information then processes, summarizes and synthesizes this to provide the output. It is not used to gather data for model training.
  • OAI-SearchBot — An indexing crawler used to power ChatGPT's search capabilities.
  • GPTBot — Gathers data from the web for the training of OpenAI's large language models. It operates continuously in the background and the data it gathers is collected and used offline for model development, rather than real-time responses to user prompts.

Robots.txt policy: OpenAI respects the signals provided by content owners via robots.txt, allowing them to disallow any or all of its crawlers.

Publisher partnerships: OpenAI has an extensive publisher partnerships program, having signed bilateral deals with ~35 publishers at the time of writing.

This index also refers to the Robots Exclusion Protocol or 'robots.txt'. This mechanism allows website owners to give instructions to bots about accessing a website. It uses a machine-readable file (named robots.txt) which specifies — for individual bots or all bots collectively — which pages or sections they can or cannot crawl. Robots.txt operates simply as a signal though and does not actively block access. Not all developers program their bots to comply with these instructions.


Perplexity

Perplexity has developed an AI answer engine, effectively AI-powered search that provides users with a natural language response to a prompt alongside a list of links and sources. It has 15 million active users. The standard free product primarily uses OpenAI's GPT3.5 model, whereas the premium, paid-for service includes access to a range of models including Claude 3.5 from Anthropic, Llama 3 from Meta and Grok-2 from xAI.

User agents:

  • PerplexityBot — Gathers data from across the web to index it for Perplexity's search function. It also acts in real-time, gathering data to respond to specific user queries as they are placed.
  • Perplexity-User — Announced in March 2025, acts in real-time to 'support user actions' within Perplexity.

Robots.txt policy: Perplexity claims that its web crawler respects robots.txt. However, there have been widely-reported complaints from publishers that it has ignored their signals, leading to an investigation from Amazon, its cloud provider. In its FAQs, Perplexity explains that 'if a page is blocked, we may still index the domain, headline, and a brief factual summary'.

Publisher partnerships: In 2024 Perplexity launched a publisher partner program under which it provides a share of advertising revenues generated by responses that are based upon the content of media partners. Around 20 deals have been signed so far, mostly (although not exclusively) with smaller or niche publishers. The revenue shared with publishers is reported to be capped at 25%.


Anthropic

Founded by seven former OpenAI employees, Anthropic is an AI developer with a focus on privacy, safety and alignment (with human values). Its foundation models power its Claude chatbot — which has a free and premium version — and a multitude of third-party applications.

User agents:

  • ClaudeBot — Used to gather data from the internet for AI training. The Claude chatbot does not have access to the real-time web and has no search functionality; the data ClaudeBot retrieves is only used offline for the development of new large language models.
  • Claude-User — Announced in March — and coinciding with Claude's new web-access functionality — Claude-User is Anthropic's RAG agent.
  • Claude-SearchBot — The search indexing crawler to power Claude's web access.

Robots.txt policy: Anthropic's bots respect publisher signals in robots.txt files. Notably they also respond to any disallows for Common Crawl's CCBot.


Google

Google is one of the foremost AI developers with its own proprietary models that have been integrated extensively into its consumer and enterprise applications, including search. It also operates Gemini as a standalone AI chatbot with real-time web access. This has around 14% of the chatbot market with an estimated 42 million active users.

User agents:

  • Google-Extended — Used to gather data to train and improve Google's AI models. It operates independently of the crawlers used to power Google's search product. It should be noted that both Gemini, when it requires data from the live web, and AI Overviews (the natural language AI response to search queries) do not rely on Google-Extended for real-time data retrieval and therefore disallowing this bot does not control whether a publisher's content is used to inform the outputs of these products.

Robots.txt policy: Google's bots respect robots.txt signals. However, publishers do not have granular controls over the use of their content in real-time for AI Overviews or Gemini's responses as these products appear to use the data collected for Google's general search product rather than a discrete user agent. In order to prevent content being used for these applications publishers need to use the no-snippet directive or signal Google to stop indexing a page for search entirely. Both of these would have negative effects on prominence and referral traffic.

Publisher partnerships: Whilst Google has signed a $60M AI content licensing deal with Reddit, the large number of partnerships it has with publishing businesses are built around its News Showcase product, rather than content for AI. As competition authorities examine its conduct around Gemini and search, it may soon have to start securing explicit authorization for access to content to fuel these products.


Meta

Meta has developed the Llama family of open-source AI models. These are used in Meta's own products and extensively across third-party applications. Whilst historically it has relied on external datasets for model training, it has recently launched a new web crawler to collect data for its LLMs.

User agents:

  • Meta-ExternalAgent — Described as crawling the web for 'use cases such as training AI models or improving products by indexing content directly'. Launched in summer 2024.
  • Meta-ExternalFetcher — Accesses websites in real-time in response to actions by users. Meta is transparent that this crawler may bypass robots.txt on account of being user-'initiated'.
  • FacebookBot — Previously described as crawling public web pages to improve language models for speech recognition technology. It has since removed this description from its developer site and it is not known whether the bot is now being used for different purposes.

Robots.txt policy: Meta's crawlers respect robots.txt signals although the Meta-ExternalFetcher bot may bypass the protocol because it performs crawls that were user-initiated.

Publisher partnerships: Meta has signed an AI licensing deal with just one publisher — Reuters.


Apple

Apple has been investing in AI — including the development of its own foundation models — to enhance its products, particularly in privacy-centric applications and on-device AI capabilities.

User agents:

  • Applebot-Extended — Apple's primary user agent is AppleBot. This is used to collect data to feed into a variety of user products in the Apple ecosystem, including Spotlight, Siri and Safari. Applebot-Extended is a secondary user agent that allows publishers to opt-out of their content being used to train Apple's foundation models. Applebot-Extended does not crawl webpages; it is only used to determine how Apple can use the data crawled by the primary Applebot user agent.

Robots.txt policy: Applebot-Extended respects robots.txt directives, allowing website owners to control the use of their content for AI training.

Publisher partnerships: Apple has not publicly announced any AI content licensing deals at time of writing but is reportedly in negotiations with a number of publishers including Condé Nast and NBC News.


Amazon

As well as a strategic partnership with Anthropic, Amazon has developed its own 'Nova' family of AI models which emphasize speed and value.

User agents:

  • AmazonBot — Described as being used to improve services, 'such as enabling Alexa to answer even more questions for customers'. There are no published details of how it operates, or what the data it captures is used for.

Robots.txt policy: AmazonBot respects standard robots.txt rules.

Publisher partnerships: Amazon is reportedly in licensing negotiations with a number of news outlets for access to content that will give a revamped Alexa the ability to answer questions about current events.


ByteDance

At the time of writing ByteDance, owner of TikTok, has no live AI applications but is widely reported to be developing its own foundation models.

User agents:

  • Bytespider — Has been scraping the web at a high rate since it first appeared in early 2024. ByteDance has published no information on the function it serves or what the data collected is being used for.

Robots.txt policy: ByteDance has no published robots.txt policy. There are widespread reports of Bytespider ignoring robots.txt signals.


Other Notable AI User Agents

  • DuckAssistBot — DuckDuckGo's web crawler that crawls pages in real-time to source information for answers by DuckAssist. Data collected is not used to train AI models and it respects robots.txt signals.
  • Timpibot — Web crawler for Timpi, a decentralized search index, accessible for a cost to businesses. Data is also available to AI developers for model training.
  • YouBot — Crawler for You.com, an AI-powered search engine that integrates AI query responses alongside conventional search links.
  • Diffbot — Web crawler focused on extracting data from web pages which it then converts into structured datasets for businesses and developers. Fully respects robots.txt directives.