AI preferences and opt-out signals
Publishers have at least ten independent technical mechanisms for signalling that they do not want their images and articles used to train generative-AI models. They sit at different layers (site-wide files, HTTP headers, embedded image metadata) and were proposed by different bodies over a short period — so adoption is uneven and the signals do not always agree. Article 4 of the EU DSM Directive requires opt-outs to be expressed in a machine-readable form; this page measures which of the candidate machine-readable forms publishers actually use.
The IPTC's Generative AI Opt-Out Best Practice Recommendations (v2.0, March 2026) sets out the thirteen techniques the IPTC currently recommends to publishers wishing to express a data-mining opt-out. Most are measured here; the remainder (plain-language rights statements, per-page TDMRep meta tags, firewall-level blocking, and opt-outs embedded in epub/PDF) are noted in the methodology section below.
Adoption snapshot
robots.txt — AI bot block matrix
For each known AI/scraper user-agent, the share of crawled sites whose robots.txt
disallows that UA at the root. List is versioned in known_uas.yaml; we welcome
pull requests adding bots we have missed.
| User-agent | Operator | % sites blocking | |
|---|---|---|---|
CCBot | Common Crawl | 39.6% | |
GPTBot | OpenAI | 36.9% | |
Bytespider | ByteDance | 36.3% | |
ClaudeBot | Anthropic | 35.5% | |
Google-Extended | 31.6% | ||
anthropic-ai | Anthropic | 31.4% | |
PerplexityBot | Perplexity | 29.6% | |
omgilibot | Webhose | 28.2% | |
Amazonbot | Amazon | 28.0% | |
Claude-Web | Anthropic | 27.6% | |
cohere-ai | Cohere | 27.6% | |
Applebot-Extended | Apple | 27.0% | |
Diffbot | Diffbot | 26.6% | |
Meta-ExternalAgent | Meta | 26.4% | |
omgili | Webhose | 25.4% | |
ChatGPT-User | OpenAI | 24.3% | |
FacebookBot | Meta | 21.9% | |
Meta-ExternalFetcher | Meta | 19.9% | |
YouBot | You.com | 19.7% | |
OAI-SearchBot | OpenAI | 19.3% | |
Timpibot | Timpi | 18.5% | |
Scrapy | Open-source scraper | 16.6% | |
ImagesiftBot | ImageSift | 16.4% | |
Claude-SearchBot | Anthropic | 16.2% | |
magpie-crawler | Brandwatch | 15.8% | |
TurnitinBot | Turnitin | 15.8% | |
Perplexity-User | Perplexity | 15.2% | |
Claude-User | Anthropic | 14.8% | |
cohere-training-data-crawler | Cohere | 13.6% | |
DuckAssistBot | DuckDuckGo | 13.6% | |
AI2Bot | AI2 | 13.0% | |
DataForSeoBot | DataForSEO | 12.6% | |
FriendlyCrawler | Webis | 12.4% | |
PanguBot | Huawei | 12.2% | |
img2dataset | Open-source scraper | 11.8% | |
AwarioRssBot | Awario | 11.6% | |
AwarioSmartBot | Awario | 11.6% | |
MistralAI-User | Mistral | 11.2% | |
Google-CloudVertexBot | 11.0% | ||
ia_archiver | Internet Archive | 11.0% | |
DeepSeekBot | DeepSeek | 10.8% | |
NewsNow | NewsNow | 10.3% | |
BLEXBot | WebMeUp | 9.9% | |
archive.org_bot | Internet Archive | 8.9% | |
peer39_crawler | Peer39 | 8.1% | |
SeekrBot | Seekr | 7.9% | |
news-please | news-please (open-source) | 7.7% | |
Feedfetcher-Google | 7.5% | ||
Gemini-Deep-Research | 6.5% | ||
Quora-Bot | Quora | 6.3% | |
MyCentralAIScraperBot | MyCentral | 6.1% | |
quillbot.com | QuillBot | 6.1% | |
EchoboxBot | Echobox | 5.9% | |
Poseidon Research Crawler | Poseidon Research | 5.9% | |
SeznamHomepageCrawler | Seznam | 5.3% | |
AliyunSecBot | Alibaba Cloud | 5.1% | |
TaraGroup Intelligent Bot | TaraGroup | 5.1% | |
Grok | xAI | 4.9% | |
AudigentAdBot | Audigent | 4.7% | |
ViennaTinyBot | Vienna Tiny | 4.7% | |
Jetslide | Jetslide | 4.3% | |
GoogleOther | 3.9% | ||
bingbot | Microsoft | 0.8% |
How many AI UAs does each site block?
A site can block zero, one, or many of the 63 tracked AI UAs. The distribution reveals whether publishers are picking individual bots to refuse or applying a blanket policy.
| UAs blocked | Sites | % of sites | |
|---|---|---|---|
| 0 (no AI blocks) | 269 | 53.1% | |
| 1–5 | 38 | 7.5% | |
| 6–10 | 35 | 6.9% | |
| 11–20 | 58 | 11.4% | |
| 21–507 | 107 | 21.1% |
Image-level signals
Three of the eight mechanisms live inside the image file (or its HTTP response), not on a site-wide file. They travel with the image when it is copied or shared.
- IPTC PLUS:DataMining — a controlled-vocabulary value in the XMP-plus namespace expressing whether data mining is permitted. We parse this from every image we fetch.
- CAWG Training and Data Mining Assertion — a structured assertion inside a
C2PA manifest declaring training/mining permissions (the label is
cawg.training-mining). The assertion can also declare that training is allowed, so a hit here is a preference signal, not necessarily an opt-out. Spec at cawg.io. - noai / noimageai directives — informal but in-the-wild; expressed via
X-Robots-Tagresponse headers on the image, or<meta name="robots">on the host article page. Originated in DeviantArt's 2022 policy; not standardised.
Do the signals agree?
A site that takes AI opt-out seriously might use several mechanisms together. In practice the overlap is patchy — partly because the signals serve different audiences (robots.txt for crawlers, TDMRep for the EU legal framework, IPTC PLUS:DataMining for downstream image consumers).
| If a site… | …does it also… | % overlap | N |
|---|---|---|---|
| blocks GPTBot | blocks ClaudeBot | 83% | 187 |
| blocks GPTBot | has tdmrep.json | 3% | 187 |
| has tdmrep.json | blocks ≥1 AI UA | 83% | 6 |
| has ai.txt | has tdmrep.json | 0% | 4 |
| has RSL License | blocks ≥1 AI UA | 100% | 2 |
What we measure (methodology)
The "publishers probed" denominator on this page (502) is wider
than the "Sites crawled" figure on the homepage. The homepage counts publishers where the
crawler completed a full pass through to fetching images. The denominator here also includes
publishers whose robots.txt disallowed our user-agent at the root, whose sitemap
or feed discovery was blocked, or who returned no articles — because for those publishers we
still read robots.txt, probed the well-known files, and so on, so we can
legitimately answer "did they express an AI preference". Only publishers we couldn't reach at
all (DNS / TCP failures) are excluded. Narrowing the denominator to the homepage's figure
would drop precisely the publishers most likely to opt out, biasing every percentage upward.
- robots.txt: fetched once per run, parsed with
protego. For each tracked UA we recordallowedordisallowedat the site root. - tdmrep.json: HTTP probe of
/.well-known/tdmrep.json; recorded as present + saved verbatim. Parsing of TDM-policy URLs and thetdm-reservationflag is planned. - TDMRep per-page meta: each sampled article's HTML head is scanned for
<meta name="tdm-reservation" content="1">. A site is flagged when at least one of its sampled articles carries the tag (a site can publish per-page TDMRep without having a site-widetdmrep.json, and vice versa). - ai.txt: HTTP probe of
/ai.txt; presence flag only at first. - RSL: scan robots.txt for a
License:directive linking to an RSL XML file (spec). Presence flag and the licence URL; we will not parse the XML at first. - Content-Signal: scan robots.txt for Cloudflare
Content-Signal:directives (contentsignals.org). Each line is a comma-separated list of<signal>=yes|nopairs where<signal>is one ofsearch,ai-input,ai-train; merged across lines with last-wins. We record the presence flag plus each signal's value. - trust.txt: HTTP probe of
/.well-known/trust.txtand/trust.txt; parse for thedatatrainingallowed=nodirective. Presence flag and the directive value are recorded. - noai / noimageai meta robots: scanned in HTTP response headers
(image and host article) and in the host article's
<meta name="robots">. Thenoai,noimageai, andnomltokens were coined by DeviantArt's 2022 policy specifically as AI-opt-out directives — unambiguous in intent. - noarchive / nosnippet meta robots: scanned in the same places. Both
predate AI by decades as classic search-snippet directives, but single vendors have
reinterpreted them as AI-opt-out signals:
noarchiveis honoured by Bing/Copilot,nosnippetby Google (per Google's documentation, applies to AI Overviews and AI Mode). The IPTC best-practices document recommends combining them with the noai / noimageai tokens; we track them as a separate signal so the headline AI-opt-out figure isn't dominated by cache-control settings that publishers may not have intended as AI opt-outs at all. - IPTC PLUS:DataMining: already extracted via ExifTool on every image.
- CAWG Training and Data Mining Assertion: parsed from C2PA manifest
assertions when the label matches
cawg.training-mining.