AI preferences and opt-out signals

Publishers have at least ten independent technical mechanisms for signalling that they do not want their images and articles used to train generative-AI models. They sit at different layers (site-wide files, HTTP headers, embedded image metadata) and were proposed by different bodies over a short period — so adoption is uneven and the signals do not always agree. Article 4 of the EU DSM Directive requires opt-outs to be expressed in a machine-readable form; this page measures which of the candidate machine-readable forms publishers actually use.

The IPTC's Generative AI Opt-Out Best Practice Recommendations (v2.0, March 2026) sets out the thirteen techniques the IPTC currently recommends to publishers wishing to express a data-mining opt-out. Most are measured here; the remainder (plain-language rights statements, per-page TDMRep meta tags, firewall-level blocking, and opt-outs embedded in epub/PDF) are noted in the methodology section below.

Based on the most recent crawl: 502 publishers probed, 6,803 images analysed.

Adoption snapshot

per site
47.4%
robots.txt — any AI bot blocked
Blocks at least one known AI/scraper UA
238 of 502 →
per site
15.7%
noarchive / nosnippet meta robots
Honoured as AI-opt-out by Bing/Copilot (noarchive) and Google (nosnippet)
79 of 502 →
per site
4.2%
robots.txt — Content-Signal
Cloudflare Content Signals (contentsignals.org)
21 of 502 →
per site
2.2%
IPTC PLUS:DataMining
XMP-plus:DataMining on ≥1 sampled image
11 of 502 →
per site
1.4%
TDMRep — per-page <meta>
<meta name="tdm-reservation" content="1"> on ≥1 article
7 of 502 →
per site
1.2%
/.well-known/tdmrep.json
TDM Reservation Protocol (EU DSM Art. 4)
6 of 502 →
per site
0.8%
/ai.txt
Spawning AI consent proposal
4 of 502 →
per site
0.4%
RSL — License: in robots.txt
Really Simple Licensing (rslstandard.org)
2 of 502 →
per site
0.4%
trust.txt — datatrainingallowed=no
JournalList trust.txt opt-out directive
2 of 502 →
per site
0.2%
CAWG Training and Data Mining Assertion
cawg.training-mining assertion in ≥1 sampled image's C2PA manifest
1 of 502 →
per site
0.2%
noai / noimageai meta robots
AI-specific tokens on X-Robots-Tag or <meta name="robots">
1 of 502 →

robots.txt — AI bot block matrix

For each known AI/scraper user-agent, the share of crawled sites whose robots.txt disallows that UA at the root. List is versioned in known_uas.yaml; we welcome pull requests adding bots we have missed.

User-agent Operator % sites blocking  
CCBot Common Crawl 39.6%
GPTBot OpenAI 36.9%
Bytespider ByteDance 36.3%
ClaudeBot Anthropic 35.5%
Google-Extended Google 31.6%
anthropic-ai Anthropic 31.4%
PerplexityBot Perplexity 29.6%
omgilibot Webhose 28.2%
Amazonbot Amazon 28.0%
Claude-Web Anthropic 27.6%
cohere-ai Cohere 27.6%
Applebot-Extended Apple 27.0%
Diffbot Diffbot 26.6%
Meta-ExternalAgent Meta 26.4%
omgili Webhose 25.4%
ChatGPT-User OpenAI 24.3%
FacebookBot Meta 21.9%
Meta-ExternalFetcher Meta 19.9%
YouBot You.com 19.7%
OAI-SearchBot OpenAI 19.3%
Timpibot Timpi 18.5%
Scrapy Open-source scraper 16.6%
ImagesiftBot ImageSift 16.4%
Claude-SearchBot Anthropic 16.2%
magpie-crawler Brandwatch 15.8%
TurnitinBot Turnitin 15.8%
Perplexity-User Perplexity 15.2%
Claude-User Anthropic 14.8%
cohere-training-data-crawler Cohere 13.6%
DuckAssistBot DuckDuckGo 13.6%
AI2Bot AI2 13.0%
DataForSeoBot DataForSEO 12.6%
FriendlyCrawler Webis 12.4%
PanguBot Huawei 12.2%
img2dataset Open-source scraper 11.8%
AwarioRssBot Awario 11.6%
AwarioSmartBot Awario 11.6%
MistralAI-User Mistral 11.2%
Google-CloudVertexBot Google 11.0%
ia_archiver Internet Archive 11.0%
DeepSeekBot DeepSeek 10.8%
NewsNow NewsNow 10.3%
BLEXBot WebMeUp 9.9%
archive.org_bot Internet Archive 8.9%
peer39_crawler Peer39 8.1%
SeekrBot Seekr 7.9%
news-please news-please (open-source) 7.7%
Feedfetcher-Google Google 7.5%
Gemini-Deep-Research Google 6.5%
Quora-Bot Quora 6.3%
MyCentralAIScraperBot MyCentral 6.1%
quillbot.com QuillBot 6.1%
EchoboxBot Echobox 5.9%
Poseidon Research Crawler Poseidon Research 5.9%
SeznamHomepageCrawler Seznam 5.3%
AliyunSecBot Alibaba Cloud 5.1%
TaraGroup Intelligent Bot TaraGroup 5.1%
Grok xAI 4.9%
AudigentAdBot Audigent 4.7%
ViennaTinyBot Vienna Tiny 4.7%
Jetslide Jetslide 4.3%
GoogleOther Google 3.9%
bingbot Microsoft 0.8%

How many AI UAs does each site block?

A site can block zero, one, or many of the 63 tracked AI UAs. The distribution reveals whether publishers are picking individual bots to refuse or applying a blanket policy.

UAs blocked Sites % of sites  
0 (no AI blocks) 269 53.1%
1–5 38 7.5%
6–10 35 6.9%
11–20 58 11.4%
21–507 107 21.1%

Image-level signals

Three of the eight mechanisms live inside the image file (or its HTTP response), not on a site-wide file. They travel with the image when it is copied or shared.

Do the signals agree?

A site that takes AI opt-out seriously might use several mechanisms together. In practice the overlap is patchy — partly because the signals serve different audiences (robots.txt for crawlers, TDMRep for the EU legal framework, IPTC PLUS:DataMining for downstream image consumers).

If a site… …does it also… % overlap N
blocks GPTBot blocks ClaudeBot 83% 187
blocks GPTBot has tdmrep.json 3% 187
has tdmrep.json blocks ≥1 AI UA 83% 6
has ai.txt has tdmrep.json 0% 4
has RSL License blocks ≥1 AI UA 100% 2

What we measure (methodology)

The "publishers probed" denominator on this page (502) is wider than the "Sites crawled" figure on the homepage. The homepage counts publishers where the crawler completed a full pass through to fetching images. The denominator here also includes publishers whose robots.txt disallowed our user-agent at the root, whose sitemap or feed discovery was blocked, or who returned no articles — because for those publishers we still read robots.txt, probed the well-known files, and so on, so we can legitimately answer "did they express an AI preference". Only publishers we couldn't reach at all (DNS / TCP failures) are excluded. Narrowing the denominator to the homepage's figure would drop precisely the publishers most likely to opt out, biasing every percentage upward.