AI preferences and opt-out signals

Publishers have at least ten independent technical mechanisms for signalling that they do not want their images and articles used to train generative-AI models. They sit at different layers (site-wide files, HTTP headers, embedded image metadata) and were proposed by different bodies over a short period — so adoption is uneven and the signals do not always agree. Article 4 of the EU DSM Directive requires opt-outs to be expressed in a machine-readable form; this page measures which of the candidate machine-readable forms publishers actually use.

The IPTC's Generative AI Opt-Out Best Practice Recommendations (v2.0, March 2026) sets out the thirteen techniques the IPTC currently recommends to publishers wishing to express a data-mining opt-out. Most are measured here; the remainder (plain-language rights statements, per-page TDMRep meta tags, firewall-level blocking, and opt-outs embedded in epub/PDF) are noted in the methodology section below.

Based on the most recent crawl: 502 publishers probed, 6,803 images analysed.

Adoption snapshot

per site

47.4%

robots.txt — any AI bot blocked

Blocks at least one known AI/scraper UA

238 of 502 →

per site

15.7%

noarchive / nosnippet meta robots

Honoured as AI-opt-out by Bing/Copilot (noarchive) and Google (nosnippet)

79 of 502 →

per site

4.2%

robots.txt — Content-Signal

Cloudflare Content Signals (contentsignals.org)

XMP-plus:DataMining on ≥1 sampled image

11 of 502 →

per site

1.4%

TDMRep — per-page <meta>

7 of 502 →

per site

1.2%

/.well-known/tdmrep.json

TDM Reservation Protocol (EU DSM Art. 4)

Spawning AI consent proposal

4 of 502 →

per site

0.4%

RSL — License: in robots.txt

Really Simple Licensing (rslstandard.org)

2 of 502 →

per site

0.4%

trust.txt — datatrainingallowed=no

JournalList trust.txt opt-out directive

2 of 502 →

per site

0.2%

CAWG Training and Data Mining Assertion

cawg.training-mining assertion in ≥1 sampled image's C2PA manifest

1 of 502 →

per site

0.2%

noai / noimageai meta robots

AI-specific tokens on X-Robots-Tag or <meta name="robots">

1 of 502 →

robots.txt — AI bot block matrix

For each known AI/scraper user-agent, the share of crawled sites whose robots.txt disallows that UA at the root. List is versioned in known_uas.yaml; we welcome pull requests adding bots we have missed.

User-agent	Operator	% sites blocking
`CCBot`	Common Crawl	39.6%
`GPTBot`	OpenAI	36.9%
`Bytespider`	ByteDance	36.3%
`ClaudeBot`	Anthropic	35.5%
`Google-Extended`	Google	31.6%
`anthropic-ai`	Anthropic	31.4%
`PerplexityBot`	Perplexity	29.6%
`omgilibot`	Webhose	28.2%
`Amazonbot`	Amazon	28.0%
`Claude-Web`	Anthropic	27.6%
`cohere-ai`	Cohere	27.6%
`Applebot-Extended`	Apple	27.0%
`Diffbot`	Diffbot	26.6%
`Meta-ExternalAgent`	Meta	26.4%
`omgili`	Webhose	25.4%
`ChatGPT-User`	OpenAI	24.3%
`FacebookBot`	Meta	21.9%
`Meta-ExternalFetcher`	Meta	19.9%
`YouBot`	You.com	19.7%
`OAI-SearchBot`	OpenAI	19.3%
`Timpibot`	Timpi	18.5%
`Scrapy`	Open-source scraper	16.6%
`ImagesiftBot`	ImageSift	16.4%
`Claude-SearchBot`	Anthropic	16.2%
`magpie-crawler`	Brandwatch	15.8%
`TurnitinBot`	Turnitin	15.8%
`Perplexity-User`	Perplexity	15.2%
`Claude-User`	Anthropic	14.8%
`cohere-training-data-crawler`	Cohere	13.6%
`DuckAssistBot`	DuckDuckGo	13.6%
`AI2Bot`	AI2	13.0%
`DataForSeoBot`	DataForSEO	12.6%
`FriendlyCrawler`	Webis	12.4%
`PanguBot`	Huawei	12.2%
`img2dataset`	Open-source scraper	11.8%
`AwarioRssBot`	Awario	11.6%
`AwarioSmartBot`	Awario	11.6%
`MistralAI-User`	Mistral	11.2%
`Google-CloudVertexBot`	Google	11.0%
`ia_archiver`	Internet Archive	11.0%
`DeepSeekBot`	DeepSeek	10.8%
`NewsNow`	NewsNow	10.3%
`BLEXBot`	WebMeUp	9.9%
`archive.org_bot`	Internet Archive	8.9%
`peer39_crawler`	Peer39	8.1%
`SeekrBot`	Seekr	7.9%
`news-please`	news-please (open-source)	7.7%
`Feedfetcher-Google`	Google	7.5%
`Gemini-Deep-Research`	Google	6.5%
`Quora-Bot`	Quora	6.3%
`MyCentralAIScraperBot`	MyCentral	6.1%
`quillbot.com`	QuillBot	6.1%
`EchoboxBot`	Echobox	5.9%
`Poseidon Research Crawler`	Poseidon Research	5.9%
`SeznamHomepageCrawler`	Seznam	5.3%
`AliyunSecBot`	Alibaba Cloud	5.1%
`TaraGroup Intelligent Bot`	TaraGroup	5.1%
`Grok`	xAI	4.9%
`AudigentAdBot`	Audigent	4.7%
`ViennaTinyBot`	Vienna Tiny	4.7%
`Jetslide`	Jetslide	4.3%
`GoogleOther`	Google	3.9%
`bingbot`	Microsoft	0.8%

How many AI UAs does each site block?

A site can block zero, one, or many of the 63 tracked AI UAs. The distribution reveals whether publishers are picking individual bots to refuse or applying a blanket policy.

UAs blocked	Sites	% of sites
0 (no AI blocks)	269	53.1%
1–5	38	7.5%
6–10	35	6.9%
11–20	58	11.4%
21–507	107	21.1%

Image-level signals

Three of the eight mechanisms live inside the image file (or its HTTP response), not on a site-wide file. They travel with the image when it is copied or shared.

IPTC PLUS:DataMining — a controlled-vocabulary value in the XMP-plus namespace expressing whether data mining is permitted. We parse this from every image we fetch.
CAWG Training and Data Mining Assertion — a structured assertion inside a C2PA manifest declaring training/mining permissions (the label is cawg.training-mining). The assertion can also declare that training is allowed, so a hit here is a preference signal, not necessarily an opt-out. Spec at cawg.io.
noai / noimageai directives — informal but in-the-wild; expressed via X-Robots-Tag response headers on the image, or <meta name="robots"> on the host article page. Originated in DeviantArt's 2022 policy; not standardised.

Do the signals agree?

A site that takes AI opt-out seriously might use several mechanisms together. In practice the overlap is patchy — partly because the signals serve different audiences (robots.txt for crawlers, TDMRep for the EU legal framework, IPTC PLUS:DataMining for downstream image consumers).

If a site…	…does it also…	% overlap	N
blocks GPTBot	blocks ClaudeBot	83%	187
blocks GPTBot	has tdmrep.json	3%	187
has tdmrep.json	blocks ≥1 AI UA	83%	6
has ai.txt	has tdmrep.json	0%	4
has RSL License	blocks ≥1 AI UA	100%	2

What we measure (methodology)

The "publishers probed" denominator on this page (502) is wider than the "Sites crawled" figure on the homepage. The homepage counts publishers where the crawler completed a full pass through to fetching images. The denominator here also includes publishers whose robots.txt disallowed our user-agent at the root, whose sitemap or feed discovery was blocked, or who returned no articles — because for those publishers we still read robots.txt, probed the well-known files, and so on, so we can legitimately answer "did they express an AI preference". Only publishers we couldn't reach at all (DNS / TCP failures) are excluded. Narrowing the denominator to the homepage's figure would drop precisely the publishers most likely to opt out, biasing every percentage upward.

robots.txt: fetched once per run, parsed with protego. For each tracked UA we record allowed or disallowed at the site root.
tdmrep.json: HTTP probe of /.well-known/tdmrep.json; recorded as present + saved verbatim. Parsing of TDM-policy URLs and the tdm-reservation flag is planned.
TDMRep per-page meta: each sampled article's HTML head is scanned for <meta name="tdm-reservation" content="1">. A site is flagged when at least one of its sampled articles carries the tag (a site can publish per-page TDMRep without having a site-wide tdmrep.json, and vice versa).
ai.txt: HTTP probe of /ai.txt; presence flag only at first.
RSL: scan robots.txt for a License: directive linking to an RSL XML file (spec). Presence flag and the licence URL; we will not parse the XML at first.
Content-Signal: scan robots.txt for Cloudflare Content-Signal: directives (contentsignals.org). Each line is a comma-separated list of <signal>=yes|no pairs where <signal> is one of search, ai-input, ai-train; merged across lines with last-wins. We record the presence flag plus each signal's value.
trust.txt: HTTP probe of /.well-known/trust.txt and /trust.txt; parse for the datatrainingallowed=no directive. Presence flag and the directive value are recorded.
noai / noimageai meta robots: scanned in HTTP response headers (image and host article) and in the host article's <meta name="robots">. The noai, noimageai, and noml tokens were coined by DeviantArt's 2022 policy specifically as AI-opt-out directives — unambiguous in intent.
noarchive / nosnippet meta robots: scanned in the same places. Both predate AI by decades as classic search-snippet directives, but single vendors have reinterpreted them as AI-opt-out signals: noarchive is honoured by Bing/Copilot, nosnippet by Google (per Google's documentation, applies to AI Overviews and AI Mode). The IPTC best-practices document recommends combining them with the noai / noimageai tokens; we track them as a separate signal so the headline AI-opt-out figure isn't dominated by cache-control settings that publishers may not have intended as AI opt-outs at all.
IPTC PLUS:DataMining: already extracted via ExifTool on every image.
CAWG Training and Data Mining Assertion: parsed from C2PA manifest assertions when the label matches cawg.training-mining.