Two governments, one message: publishers need more control

For decades, the web ran on an honor system. A file called robots.txt sat at the root of your site and would try to deter bots that were "disallowed". Bots that respected it were welcome, but bots that ignored it faced no consequences. It was a handshake, but handshakes only work with participation from two parties.

Governments are now stepping in to ensure publishers and websites get more control.

What happened

Within days of each other, policymakers in New York and the UK created requirements and acts that took aim at different parts of the same problem. One governs how AI uses publisher content, and the other governs how it accesses it.

In New York, the State Senate passed the Stealth Crawler Prohibition Act (A.11292 / S.9934A). It targets a specific bad actor: crawlers that hide their identity to pull news and broadcast content without permission or payment. The bill forces any individual or company employing crawlers, including AI systems, to disclose via user-agent the product and company behind it as well as its stated purpose when crawling, and grants news organizations a private right of action, allowing publishers to sue.

Any publisher qualifies as long as they produce original journalism with real people and real investment behind it, publish or update monthly with a corrections process, and have at least 1,000 monthly active viewers, listeners, users, or subscribers in New York. The crawler does not have to be based in New York either. The bill now heads toward the governor.

So even though it is a NY state bill, the impact could be so broad that any publisher with over 1000 readers in NY could pursue stealthy web scrapers and bots. In that regard, it's reminiscent of California Prop 65 warnings; a state specific ruling that impacted packaging across every state because of how distributed retail is.

In the UK, the Competition and Markets Authority (CMA) imposed what it calls a publisher conduct requirement on Google, the first use of the strategic market status powers it gained over general search. Issued June 3, it does far more than let publishers say no. Google now has to give them working controls over how their search content feeds its AI overviews and answers, publish plain explanations of how that content gets used, attribute it clearly with a real path back to the source, and give publishers detailed metrics on how users engage with their work inside AI features.

Why the Stealth Crawler Prohibition Act can work

New York's bill is a monumental step forward for publishers, and an easy ask for bots and agents. Stop hiding, disclose that you are a crawler, and do not mask your identity to access a publisher's content. Identification via user-agent string is a trivial technical task and already industry standard practice for many of the "well-behaved" bots today. We believe this is an easier, and globally scalable, regulation that works across all jurisdictions without relitigating copyright. This was the form of regulation that TollBit called out in the State of the Bots report, as well as IAB CEO David Cohen.

It's simple - but that's why it can work. The field is tilted against website owners today; without legislation requiring bots to correctly identify themselves, site owners have to incur enormous costs attempting to identify them. You cannot track, allow, block, or charge a visitor you cannot identify. This proposal fixes this by shifting the burden of identifying bad bots as a cost to the site owner into liability for the bot owner.

Identification is exactly what today's scrapers are built to prevent. When TollBit ran commercial scraping services against 30 high-authority publisher sites, none fetched robots.txt before taking the page, and every one wore a current Chrome user-agent to pass as a human reader. Several faked a Google search referrer so the request looked like organic traffic. These companies are hiding their identity because identity creates leverage for sites.

Unfortunately, this problem affects more than just traditional publishers. Any website whose value depends on proprietary content or data faces the same problem. In a recent TechCrunch article, Strava, the fitness and social running app, talked about the effects AI scraping had on its site. Their CEO mentioned specifically that one AI company routed their scraping through aggregator services to hide their identity. Across all sites, bots and agents should have to identify themselves before visiting.

If you force these bots to identify themselves, the publishers and websites will regain control of their own content. Once a bot has to say who it is, sites can make real decisions on what to do next.

With infrastructure like TollBit, sites get granular control over identified bot & agent traffic: allow the crawlers you want, block the ones you don't, and route the rest through a bot paywall that charges for access. All AI traffic gets re-routed to an agent version of your site where controls are enforced.

If you own a site

Between these two developments, it's clear that the wide-open crawl era is taking a hit. Regulation has set the principle that consent matters and publishers deserve more visibility and control.

We anticipate regulators will catch on to the harm being done to sites by the long tail of agentic traffic, which is not just harming traditional publishers but also companies like Strava. Ultimately, bots and agents should not be allowed to mimic humans across the Internet.

The part that makes it actionable, setting the terms they come in on, is something sites can put in place now via infrastructure like TollBit. Curious to understand your AI traffic? Want to test your site's defenses against stealth web scrapers? Want to start preparing rate cards for AI accessing your content? Sign up for TollBit for free.

Two governments, one message: publishers need more control

What happened

Why the Stealth Crawler Prohibition Act can work

If you own a site

You may also be interested in

The Data Acquisition Stack Scraping Your Site

What OpenAI’s App SDK means for publishers

Optimizing your site for agentic visitors: Markdown, context, tokens and more