State of the Bots
2024 Q4, The First Issue
Executive Summary
Despite AI search companies' claims, TollBit sees that AI bots on average are sending 95.7% less referral traffic than traditional Google search.
AI bot scraping as a percentage of all traffic to sites more than doubled (increased by approximately 117%) from Q3 to Q4 2024.
When sites block Perplexity, we see that they continue to send referrals which means they appear to be continuing to scrape sites under the radar.
AI bot scrapes that bypassed robots.txt grew by over 40% between Q3 to Q4. Blocking AI bots via robots.txt remains an insufficient mechanism to prevent unwanted scraping.
Unidentified user agent scraping (aka hidden scraping) is almost equivalent to the average scrapes from identified AI bots per website for Q4.
Foreword
We are excited to be releasing this, the first TollBit State of the Bots Report.
The adoption of artificial intelligence (AI) over the last two years has occurred at a lightning pace. Consumer habits are reforming around this technology which, we believe, carries the potential to make our lives better in a multitude of ways.
One such area of adoption relates to how we retrieve information. The advent of the internet revolutionized this; at the click of a button we could access practically all the knowledge humankind has created. Artificial intelligence is taking this further. We can now task an AI system with a highly-specific information request and it will do the work for us, sifting through limitless sources and providing us with what we're looking for.
These information-based AI use cases have the potential to disrupt the current business models of publishers. It looks highly likely that in future we will all spend more time interacting with AI applications and that, in some instances, this will supplant our current interactions with search engines and publisher websites.
We are optimistic about what this means for content creators though. Our thirst for knowledge is not going to be diminished as a result of this technology. If anything, we believe that the ease with which AI can allow us to access information will create new avenues for our inquisition. Unconstrained by time, screen space and human reading speed, as AI consumes knowledge on our behalf the total volume of information consumed will far eclipse the levels today. These systems are nothing without the materials they were trained on, and access in real-time for RAG (retrieval-augmented generation). Professional media will be more important than ever; the rate at which AI systems are accessing the digital properties of our publishing partners — as is seen in the first TollBit Artificial Intelligence User Agent Index — is evidence of this.
Much like previous platform shifts — such as that from print to digital, or web 1 to the social internet — in order to succeed in the age of AI, publishers will need to adapt their strategy. As well as delivering deeply engaging experiences on their owned-and-operated platforms, an AI licensing revenue line will allow content businesses to monetize their intellectual property when it is used in an AI setting.
This is our mission at TollBit. We believe that a vibrant, liquid and fair market for AI access to content needs to develop. And we have a product in market where transactions are flowing between both sides. Our vision is for a world in which professional content businesses can set the price and terms on which they will provide content to AI applications and a frictionless transaction takes place.
The starting point for this is transparency. We're in the nascent days of AI and the race to build and release models has led to practices that undermine rights holders and threaten the economics of content creation. We're launching this index in an effort to begin creating transparency and, ultimately, to help this licensing market develop. We also want publishers to understand the value of their content to AI systems and the scale of the demand for it.
Toshit Panigrahi and Olivia Joslin Co-founders, TollBit
Section 1: The AI User Agent Landscape
Since TollBit's technology went live at the beginning of April 2024, it has been gathering data on the rise of user agents serving artificial intelligence (AI) applications.
In this, the first TollBit Artificial Intelligence User Agent Index, we are providing an overview of the AI user agents that are currently active, what it is believed they do and a short explanation of the user-facing tools they serve.
Note that this only includes those user agents that access websites on behalf of AI applications; not the many other bots that are active across the web serving other purposes, e.g. search indexing. It has also been reported that some AI applications are served by 'unofficial' user agents. These operate opaquely so have not been included individually although are discussed at the end of this section.
User agent — A user agent is any piece of software that sends requests to websites. This includes browsers (which do so on behalf of the user operating them) and bots, which send requests automatically, typically for indexing or data-mining. Each user agent identifies itself to the website via a 'string' (a piece of metadata which is read by the site's servers), called a 'User-Agent Header'.
Bot — A bot is any type of user agent that is not directly instructed by a human to access that website. Bots include AI user agents that operate independently to retrieve information from a website following a human prompt to an AI application. They also include crawlers.
Crawler — A crawler is a type of bot that navigates websites in the background to collect data. These are used for indexing by search engines and data mining for research, analysis, or more recently to gather content for training large language models. Crawlers operate autonomously, typically following links across the web to discover and retrieve content at scale.
1.1 User Agent Profiles
The table below lists the AI user agents currently active together with what it is believed their purpose is. There are three primary types of agent, with distinct functions in the development and operation of AI applications:
Retrieval augmented generation (RAG) agent — These bots retrieve information in real-time to respond to user prompts. They use an index of the web, gathered by an indexing crawler (either proprietary or third-party such as Bing or Google) to locate the relevant content which is then retrieved and synthesized into a response.
Training data crawling agent — Large language models — such as Llama from Meta or GPT-4o from OpenAI, both of which power a multitude of consumer applications — are trained on vast quantities of data. Training data crawlers move around the web — following links from websites to websites or working through sitemaps — downloading content which is then processed and stored for offline use.
AI search indexing agent — AI systems with access to the real-time web need an index of the internet. This is used to direct RAG agents to the right sources when collecting data needed for responses to prompts. AI search indexing crawlers build these indexes by systematically navigating the web, collecting and organizing content and metadata.
| User agents | Operated by | Type |
|---|---|---|
| ChatGPT-User | OpenAI | RAG agent |
| OAI-SearchBot | OpenAI | RAG agent |
| GPTBot | OpenAI | Training data crawling agent |
| PerplexityBot | Perplexity | Hybrid RAG and AI search indexing agent |
| ClaudeBot | Anthropic | Training data crawling agent |
| Claude-Web | Anthropic | Unknown |
| anthropic-ai | Anthropic | Unknown |
| Google-Extended | Control agent for training data | |
| Meta-ExternalAgent | Meta | Training data crawling agent |
| Meta-ExternalFetcher | Meta | RAG agent |
| FacebookBot | Meta | Unknown |
| Applebot-Extended | Apple | Control agent for training data |
| AmazonBot | Amazon | Training data crawling agent |
| Bytespider | ByteDance | Training data crawling agent |
| DuckAssistBot | DuckDuckGo | RAG agent |
| Timpibot | Timpi | Training data crawling agent |
| YouBot | You.com | RAG agent |
| Diffbot | Diffbot | Training data crawling agent |
1.2 Third-party and Hidden User Agents
This index only covers the official, first-party user agents operated by AI developers. There are alternative automated means of accessing information from the internet which are harder for a website publisher to uncover.
Firstly, it is possible for AI developers to obtain data crawled by third-party user agents. Some of these datasets are free and publicly available, such as CommonCrawl, whilst others are available for a fee, such as Timpi's search index. In these instances the end-user is unknown to the website owner at the time of scraping and indeed may only access the data some time after the bot crawled their site.
It is also possible for AI developers to use hidden or masked user agents which do not make their real operators known through the user agent string. This practice is technically challenging to uncover but it has been reported to be deployed at scale by major AI companies. Cybersecurity tools can assist with detecting such activity. TollBit has partnered with cybersecurity partners to bring these capabilities to its publishing partners.
Section 2: Scale of AI Scraping
As AI models and applications proliferate — and users turn to AI technologies in growing numbers for a wide range of use cases — the scale of AI user agent web access will increase. The level of AI scraping can be seen as a proxy for the rise of AI itself. The TollBit Artificial Intelligence User Agent Index will chart the rise of this activity across publisher properties.
For this analysis, we looked at the scale of AI scraping across all onboarded TollBit sites in Q4 (section 2.A.) and we looked at a smaller select cohort to see the change in AI bot traffic from Q3 to Q4 (section 2.B.).
2.A All Sites Aggregate Scraping Levels
We analyzed all sites that had fully set up TollBit analytics prior to Q4. On average, these sites on TollBit were scraped over 2 million times in Q4 and the average scraping rate per page was approximately 7 times.
Table 2.A.1. Average AI user agent scraping rates
| Metric | Q4 2024 |
|---|---|
| Average scraping rate per website | 2,056,658 |
| Average scraping rate per page | 7.199 |
Deals and shopping content were scraped most times per page in Q4, followed by national news, consumer tech, and lifestyle content.
Figure 2.A.2. Scraping levels per page by content category in Q4
2.A.3 User Agent Per Page Scraping Levels
The scale of scraping can also be analyzed at an individual user agent level. A note of caution on this data, however; the use of third-party or hidden user agents is likely to mask the true scale of scraping by some AI developers.
Table 2.A.3. User agent per page scraping levels for Q4
| User agent | Q4 total scrapes / page |
|---|---|
| ChatGPT-User | 64.63 |
| FacebookBot | 16.79 |
| PerplexityBot | 16.75 |
| meta-externalagent | 6.86 |
| Timpibot | 5.51 |
| OAI-SearchBot | 5.36 |
| omgili | 5.05 |
| Bytespider | 4.43 |
| DuckAssistBot | 4.42 |
| GPTBot | 4.24 |
| meta-externalfetcher | 4.00 |
| cohere-ai | 2.80 |
| CCBot | 2.66 |
| ClaudeBot | 2.59 |
| anthropic-ai | 2.25 |
| Amazonbot | 1.05 |
2.A.4 AI Bot Traffic Breakdown
Here is a breakdown of AI bot traffic distribution, showing which bots contribute the most to total AI activity. ChatGPT-User accounts for the largest share at 15.6% of all AI bot traffic, followed by Bytespider (12.44%) and Meta-ExternalAgent (11.34%).
Table 2.A.4. AI bot share of total AI traffic
| AI Bot | Q4 Share |
|---|---|
| ChatGPT-User | 15.60% |
| Bytespider | 12.44% |
| Meta-ExternalAgent | 11.34% |
| OAI-SearchBot | 10.81% |
| GPTBot | 10.32% |
| DuckAssistBot | 9.37% |
| ClaudeBot | 8.62% |
| PerplexityBot | 7.79% |
| CCbot | 4.82% |
| Timpibot | 4.39% |
| AmazonBot | 3.83% |
| omgili | 0.60% |
Figure 2.A.5. AI Bot Traffic as a Percentage of Total Traffic to Sites
AI bot traffic as a percentage of total web traffic shows a clear upward trend over time.
Evidence of the use of third-party or unidentified AI user agents can be clearly seen in TollBit data. To give an example, the referral rate from Perplexity to multiple domains exceeds the scraping rate. The most plausible explanation for this is that the RAG scrapes that surface the links which generate the referral traffic came from bots that did not identify as 'PerplexityBot' in their user agent string.
Table 2.A.6. Linkless traffic from Perplexity
| Domain | Total scrapes in Q3 and Q4 2024 | Total referrals in Q3 and Q4 2024 |
|---|---|---|
| Domain A | 937 | 7,203 |
| Domain B | 545 | 10,382 |
| Domain C | 577 | 1,920 |
Many scrapers don't announce their user agent when scraping the content from the website. The average number of hidden scrapes per website in Q4 was 1.886M (excluding SEO bots). This unidentified scraping is almost equivalent to the average scrapes from AI bots per website for Q4 which was 2M. With hidden scrapers operating at nearly the same scale as identified AI bots, we need to find a monetization model that is a true ecosystem solution and incentivizes AI bots to identify their user agents and pay in scenarios when their scraping directly results in a loss of traffic.
2.B Cohort Analysis of AI Bot Scraping Levels
We analyzed a cohort of websites that joined TollBit before Q3. These websites were selected because their bot-blocking strategies remained constant during Q3 and Q4 so that we could see the cleanest view of how AI bot traffic evolved over the two quarters. Our analysis found that AI bot scraping activity increased significantly between these quarters in 2024.
On average, these sites were scraped 5.05 million times per website in Q4, up from 3.13 million in Q3 — an increase of more than 61%. The average number of scrapes per page also rose from 4.2 to 5.0, reflecting a 19% increase in per-page scraping.
Table 2.B.1. Average AI user agent scraping rates Q3 vs Q4
| Metric | Q3 2024 | Q4 2024 |
|---|---|---|
| Average scraping rate per website | 3,127,608 | 5,050,539 |
| Average scraping rate per page | 4.2 | 5.0 |
Figure 2.B.1. Average per-page, per-website AI user agent scraping rates
2.B.2 Scraping Levels by Bots
A closer look at AI bot activity within the cohort analysis reveals substantial differences in scraping patterns among various AI user agents. ChatGPT-User exhibited the most aggressive growth, increasing its scraping activity by 6,767.60% quarter-over-quarter. Other notable increases include Timpibot (+1,323.58%), ClaudeBot (+566.79%), and DuckAssistBot (+248.64%).
Figure 2.B.2. Q3 vs Q4 percentage change in total scraping by AI bots
2.B.3 User Agent Per Page Scraping Levels (Q3 vs Q4)
Table 2.B.3. User agent per page scraping levels for Q3 vs Q4
| User agent | Q3 total scrapes / page | Q4 total scrapes / page | Percentage Change |
|---|---|---|---|
| OAI-SearchBot | 3.2 | 18.0 | 461.65% |
| FacebookBot | 1.3 | 4.0 | 210.32% |
| PerplexityBot | 2.9 | 5.0 | 72.91% |
| ChatGPT-User | 26.6 | 35.5 | 33.63% |
| Timpibot | 2.2 | 2.8 | 30.75% |
| omgili | 4.2 | 5.1 | 22.93% |
| DuckAssistBot | 2.1 | 2.6 | 22.64% |
| Ai2Bot | 1.6 | 1.9 | 22.62% |
| ClaudeBot | 1.9 | 2.2 | 17.99% |
| Diffbot | 2.2 | 2.5 | 14.02% |
| CCBot | 1.2 | 1.3 | 8.01% |
| Bytespider | 9.8 | 8.3 | -15.38% |
| Amazonbot | 3.3 | 2.6 | -22.08% |
| meta-externalagent | 5.3 | 3.2 | -39.72% |
| GPTBot | 16.7 | 3.0 | -82.15% |
This user-agent level data suggests that those bots operating in real-time to retrieve information in response to a user prompt are growing the fastest. There is a clear signal of the increase in the use of AI for information retrieval.
This real-time user agent activity is likely to be substitutional to human access to publisher websites. For example, an AI agent may access five media websites in order to respond to a prompt that satisfies a user's need without them needing to visit another online property. Prior to the advent of AI, this may have resulted in multiple Google searches and visits to websites.
Licensing revenues aside, there are no means of monetizing non-human visitors to a website; advertisements are not served and robots do not purchase subscriptions.
2.B.4 AI Bot Traffic as a Percentage of Total Traffic
AI bot traffic from the top 15 AI bots as a percentage of total traffic grew by 117% between Q3 and Q4 2024 for this cohort of websites.
Figure 2.B.4 Growth in AI Bot Traffic as a Percentage of Total Traffic
Section 3: Referrals from AI Applications
AI applications with access to the real-time web such as Perplexity, ChatGPT's browser mode and DuckDuckGo's AI assistant — as well as new entrants into this market, such as ProRata with their Gist product — incorporate a natural language response to a query alongside a list of sources and links which the user can click on if they wish to explore further.
The level of traffic that publishers can expect from these interfaces is a topic of debate. TollBit has been collecting data on traffic to its publisher partners' properties from AI applications since its technology went live at the beginning of April 2024.
Whilst the total volume of referrals will vary based upon the number of TollBit publisher partners and the usage of AI applications, the referral rate — how many referrals occur each time a piece of content is scraped in real-time to answer a user query — gives a signal of the effectiveness of AI applications in driving traffic to publishers.
The TollBit AI User Agent Index includes an averaged referral rate across the AI applications that access publisher data in real time and provide traceable traffic back to publishers. In Q4 2024 this referral rate is 0.37%.
This figure should be considered an absolute maximum as some AI developers use third-party or hidden user agents and each scrape may result in multiple presentations of content and links.
Table 3.1. Average AI application referral rate
| Q4 Real-time scrapes | Q4 Total referrals | Q4 Referrals per real-time scrape |
|---|---|---|
| 163,365,722 | 605,179 | 0.37% |
3.2 Comparing Referral Rates of AI Interfaces and Conventional Search
The AI applications with access to the real-time web and therefore capable of driving traffic to a publisher website fall into two broad categories of user interface — general-purpose chatbot and AI search engine. TollBit data finds that AI search engines deliver a superior referral rate of 0.74% per scrape, double the 0.33% rate for chatbots. Again, these figures should be considered a maximum; the use of third-party and hidden user agents may be artificially inflating the rate for AI search products.
These rates remain extremely low when compared to referrals from a conventional (non-AI) Google search engine results page (SERP). Even when taking the average click-through-rate across the top 10 organic search results (8.63%) AI search engine interfaces deliver 91% fewer referrals and chatbots 96% fewer.
Figure 3.2. Comparing AI interface and Google SERP referral rates
These figures give an indication of the change in traffic levels publishers can expect if user queries move from conventional search to AI applications. They also starkly highlight the difference in the value for publishers when providing their content for use by AI applications versus conventional search. The return cannot be expected to come in the form of traffic from AI applications. The alternative is that it comes from licensing revenues. Our mission at TollBit is to help publishers realize this opportunity.
Section 4: Compliance with the Robots Exclusion Protocol
Whilst every AI developer with a published policy claims its crawlers respect the robots exclusion protocol, TollBit data finds that in many instances bots continue scraping despite explicit disallow requests for those user agents in publishers' robots.txt files.
Some of this activity may be accounted for by the delay between a robots.txt file being updated by the publisher and the user agent responding by adapting its activity. Most AI developer notes state that it takes 24-72 hours for changes to be logged and reflected.
Table 4.1. Aggregate unauthorized scraping levels — all user agents
| Metric | Q4 bypassed scrapes | Q4 total scrapes | Q4 percent of scrapes that bypass |
|---|---|---|---|
| Aggregate scraping levels | 9,713,100 | 294,102,061 | 3.3% |
To understand AI demand for your content, TollBit's analytics solution provides granular data on AI user agent activity across your websites. It costs nothing and is simple to implement. To find out more, please visit www.tollbit.com or email us at team@tollbit.com.
Appendix: AI User Agent Profiles
AI Bot Profiles by Operating Organization
OpenAI
Developer of ChatGPT, OpenAI's chatbot was first-to-market and still holds a ~60% share with 300M weekly active users as of December 2024. Its foundation models — and therefore user agents — power both ChatGPT and an array of third-party applications, including Microsoft's Co-Pilot and Bing search engine.
User agents:
- ChatGPT-User — Accesses websites in real-time and on-demand so that ChatGPT can formulate responses to user prompts based on the live web. It visits websites to gather information then processes, summarises and synthesizes this to provide the output. It is not used to gather data for model training.
- OAI-SearchBot — Used to power ChatGPT's search capabilities. Similar to ChatGPT-User, it accesses the internet in real-time but is optimized for search scenarios, delivering the raw links and search results alongside summaries.
- GPTBot — Gathers data from the web for the training of OpenAI's large language models. It operates continuously in the background and the data it gathers is collected and used offline for model development, rather than real-time responses to user prompts.
Robots.txt policy: OpenAI respects the signals provided by content owners via robots.txt, allowing them to disallow any or all of its crawlers.
Publisher partnerships: OpenAI has an extensive publisher partnerships program, having signed bilateral deals with ~35 publishers at the time of writing.
This index also refers to the Robots Exclusion Protocol or 'robots.txt'. This mechanism allows website owners to give instructions to bots about accessing a website. It uses a machine-readable file (named robots.txt) which specifies — for individual bots or all bots collectively — which pages or sections they can or cannot crawl. Robots.txt operates simply as a signal though and does not actively block access. Not all developers program their bots to comply with these instructions.
Perplexity
Perplexity has developed an AI answer engine, effectively AI-powered search that provides users with a natural language response to a prompt alongside a list of links and sources. It has 15 million active users. The standard free product primarily uses OpenAI's GPT3.5 model, whereas the premium, paid-for service includes access to a range of models including Claude 3.5 from Anthropic, Llama 3 from Meta and Grok-2 from xAI.
User agents:
- PerplexityBot — Gathers data from across the web to index it for Perplexity's search function. It also acts in real-time, gathering data to respond to specific user queries as they are placed.
Robots.txt policy: Perplexity claims that its web crawler respects robots.txt. However, there have been widely-reported complaints from publishers that it has ignored their signals, leading to an investigation from Amazon, its cloud provider. In its FAQs, Perplexity explains that 'if a page is blocked, we may still index the domain, headline, and a brief factual summary'.
Publisher partnerships: In 2024 Perplexity launched a publisher partner program under which it provides a share of advertising revenues generated by responses that are based upon the content of media partners. Around 20 deals have been signed so far, mostly (although not exclusively) with smaller or niche publishers. The revenue shared with publishers is reported to be capped at 25%.
Anthropic
Founded by seven former OpenAI employees, Anthropic is an AI developer with a focus on privacy, safety and alignment (with human values). Its foundation models power its Claude chatbot — which has a free and premium version — and a multitude of third-party applications.
User agents:
- ClaudeBot — Used to gather data from the internet for AI training. The Claude chatbot does not have access to the real-time web and has no search functionality; the data ClaudeBot retrieves is only used offline for the development of new large language models.
- Claude-Web — There is no published information on the function of this bot. It may be experimental or another identifier — used in specific circumstances — for Anthropic's primary user agent ClaudeBot.
- anthropic-ai — As with Claude-Web, there is no published information on this bot or its function.
Robots.txt policy: Anthropic's bots respect publisher signals in robots.txt files. Notably they also respond to any disallows for Common Crawl's CCBot.
Google is one of the foremost AI developers with its own proprietary models that have been integrated extensively into its consumer and enterprise applications, including search. It also operates Gemini as a standalone AI chatbot with real-time web access. This has around 14% of the chatbot market with an estimated 42 million active users.
User agents:
- Google-Extended — Used to gather data to train and improve Google's AI models. It operates independently of the crawlers used to power Google's search product. It should be noted that both Gemini, when it requires data from the live web, and AI Overviews (the natural language AI response to search queries) do not rely on Google-Extended for real-time data retrieval and therefore disallowing this bot does not control whether a publisher's content is used to inform the outputs of these products.
Robots.txt policy: Google's bots respect robots.txt signals. However, publishers do not have granular controls over the use of their content in real-time for AI Overviews or Gemini's responses as these products appear to use the data collected for Google's general search product rather than a discrete user agent. In order to prevent content being used for these applications publishers need to use the no-snippet directive or signal Google to stop indexing a page for search entirely. Both of these would have negative effects on prominence and referral traffic.
Publisher partnerships: Whilst Google has signed a $60M AI content licensing deal with Reddit, the large number of partnerships it has with publishing businesses are built around its News Showcase product, rather than content for AI.
Meta
Meta has developed the Llama family of open-source AI models. These are used in Meta's own products and extensively across third-party applications. Whilst historically it has relied on external datasets for model training, it has recently launched a new web crawler to collect data for its LLMs.
User agents:
- Meta-ExternalAgent — Described as crawling the web for 'use cases such as training AI models or improving products by indexing content directly'. Launched in summer 2024.
- Meta-ExternalFetcher — Accesses websites in real-time in response to actions by users. Meta is transparent that this crawler may bypass robots.txt on account of being user-'initiated'.
- FacebookBot — Previously described as crawling public web pages to improve language models for speech recognition technology. It has since removed this description from its developer site and it is not known whether the bot is now being used for different purposes.
Robots.txt policy: Meta's crawlers respect robots.txt signals although Meta-ExternalFetcher may bypass the protocol because it performs crawls that were user-initiated.
Publisher partnerships: Meta has signed an AI licensing deal with just one publisher — Reuters.
Apple
Apple has been investing in AI — including the development of its own foundation models — to enhance its products, particularly in privacy-centric applications and on-device AI capabilities.
User agents:
- Applebot-Extended — Apple's primary user agent is AppleBot. This is used to collect data to feed into a variety of user products in the Apple ecosystem, including Spotlight, Siri and Safari. Applebot-Extended is a secondary user agent that allows publishers to opt-out of their content being used to train Apple's foundation models. Applebot-Extended does not crawl webpages; it is only used to determine how Apple can use the data crawled by the primary Applebot user agent.
Robots.txt policy: Applebot-Extended respects robots.txt directives, allowing website owners to control the use of their content for AI training.
Publisher partnerships: Apple has not publicly announced any AI content licensing deals at time of writing but is reportedly in negotiations with a number of publishers including Condé Nast and NBC News.
Amazon
As well as a strategic partnership with Anthropic, Amazon has developed its own 'Nova' family of AI models which emphasize speed and value.
User agents:
- AmazonBot — Described as being used to improve services, 'such as enabling Alexa to answer even more questions for customers'. There are no published details of how it operates, or what the data it captures is used for.
Robots.txt policy: AmazonBot respects standard robots.txt rules.
Publisher partnerships: Amazon is reportedly in licensing negotiations with a number of news outlets for access to content that will give a revamped Alexa the ability to answer questions about current events.
ByteDance
At the time of writing ByteDance, owner of TikTok, has no live AI applications but is widely reported to be developing its own foundation models.
User agents:
- Bytespider — Has been scraping the web at a high rate since it first appeared in early 2024. ByteDance has published no information on the function it serves or what the data collected is being used for.
Robots.txt policy: ByteDance has no published robots.txt policy. There are widespread reports of Bytespider ignoring robots.txt signals.
Other Notable AI User Agents
- DuckAssistBot — DuckDuckGo's web crawler that crawls pages in real-time to source information for answers by DuckAssist. Data collected is not used to train AI models and it respects robots.txt signals.
- Timpibot — Web crawler for Timpi, a decentralized search index, accessible for a cost to businesses. Data is also available to AI developers for model training.
- YouBot — Crawler for You.com, an AI-powered search engine that integrates AI query responses alongside conventional search links.
- Diffbot — Web crawler focused on extracting data from web pages which it then converts into structured datasets for businesses and developers. Fully respects robots.txt directives.