About Metawatch

Metawatch is a project of the IPTC. It periodically scans a curated list of major news publishers worldwide, samples the photographs they publish, and reports how much embedded metadata survives the journey from photographer's camera to rendered article page. We check each image for Exif, IPTC (in IIM and XMP format) and C2PA Content Credentials.

Why this matters

Photographs carry information in their files: who took them, when and where, the licensing terms, captions, credits, alt text for accessibility and more. The IPTC Photo Metadata Standard exists so that this information travels with the image, anywhere the image goes. In practice much of it is stripped before readers see it: sometimes by publishers, sometimes by Content Delivery Networks (CDNs) as part of automatic resizing, sometimes by intermediate image-processing pipelines.

The cost falls on photographers (whose attribution is lost), agencies (whose licensing terms become invisible), and readers (who lose the provenance signals that would help them judge what they're looking at). Metawatch is meant to be a long-running, public, repeatable way to measure the problem and watch whether it improves.

The score

For every image we sample, we check the "Four Cs" of news photo provenance — fields drawn from IPTC IIM and IPTC XMP that long-standing wire-service training treats as universally applicable. A field counts as present if it appears in either family with a non-empty value. The image's score is the sum of weights for fields present, expressed as a percentage of 100. A site's score is the mean across all images we sampled for it; a country's score is the equal-weighted mean across that country's sites.

Scored field	Weight
`Creator`	25
`Copyright`	25
`CaptionDescription`	25
`CreditLine`	25
Total	100

A further 19 IPTC fields are tracked but do not affect the score: we record their presence and report it on the per-field breakdown, but absence isn't penalised. Many of these are legitimately omitted — LocationCreated can endanger sources, studio shoots and archival images often genuinely lack DateCreated or Keywords, and so on. We don't want to confuse "didn't supply" with "supplied badly".

Tracked fields: AIPromptInformation , AIPromptWriterName , AISystemUsed , AISystemVersionUsed , AltTextAccessibility , DataMining , DateCreated , DigitalSourceType , ExtendedDescriptionAccessibility , Genre , Keywords , LicensorName , LicensorURL , LocationCreated , LocationShown , ObjectName , Source , UsageTerms , WebStatement .

How a site is sampled

For each site in our list we try a small chain of discovery sources, in order, and use the first one that yields articles within the last 30 days:

RSS feed(s) found on previous attempts.
RSS feed(s) we auto-discover via <link rel="alternate"> on the homepage.
Common RSS paths we guess (/feed, /rss.xml, etc.).
XML sitemap declared in robots.txt — preferring news sitemaps, then any non-image sitemap. Image-only sitemaps are a last resort because they carry no article URLs.
Common sitemap paths we guess (/sitemap-news.xml, /sitemap.xml, etc.). Gzipped sitemaps (.xml.gz) are supported.

From the chosen source we take up to 20 of the most recent articles, fetch each one, and extract a single "lead" image per article — either from JSON-LD NewsArticle / Article markup or from <meta property="og:image">. Inline body images and related-story thumbnails are deliberately skipped: they often pick up site chrome, ads, or lazy-load placeholders rather than the photographs the article was built around. Also, we encourage site owners to keep metadata on higher-resolution "lead" images, but we understand if site owners want to strip metadata from smaller images for bandwidth optimisation purposes.

Each image is then fetched and analysed with ExifTool; the field presence and CDN attribution flow into the per-image, per-article and per-site database tables.

What each site status means

ok — we found articles, fetched them, and analysed at least one image.
robots_disallow — the site's robots.txt disallows our user-agent. We record the entry but don't crawl.
no_articles_found — discovery succeeded but no articles in the 30-day window were returned (often a sitemap that points at an empty news feed).
unreachable — every feed and sitemap we tried failed at the network layer. This usually means that a CDN or WAF is rejecting our requests. We make no assumptions about whether the site is healthy for human readers.
timeout — the crawl of that one site exceeded a 5-minute internal cap (rare; usually means a stuck CDN downstream).

Politeness

Metawatch identifies itself as Metawatch/2.0 with a From: metadata-crawler@iptc.org header. It always reads robots.txt first, respects Crawl-delay, and waits at least one second between requests to the same domain. If a publisher declares a long crawl delay we shrink the article sample so the whole site finishes within a fixed time budget, rather than hammering them at the edge of their allowance. We do not bypass paywalls, fingerprints, or CDN bot-management rules.

How often we run

Metawatch runs automatically once per month on the first of the month at 02:00 UTC. Each run commits its Parquet output and the exported JSON for this site to the project repository. We may also trigger ad-hoc runs while we are iterating on the crawler — those appear as additional run directories in data/runs/.

Data and source code

Every run's full Parquet output is published under data/runs/ in the project repository. The Dataset page lists every run with file sizes, direct download links, and a quick-start Python snippet. The data is licensed CC BY 4.0; the crawler source code is MIT-licensed. Citations and academic use are welcome — please credit "IPTC Metawatch" with a link back.

For publishers

If you would like to be removed from the crawled list, or to correct your site's entry — a better feed URL, a missing sitemap, a name fix — please email office@iptc.org. We honour removal requests without question.

Caveats and limitations

We sample, we don't crawl exhaustively. A 20-article window may be unrepresentative for very large publications, and the sample is necessarily skewed towards whatever a site happens to publish in any given month.
Scoring rewards presence, not correctness. A photo with a Creator field of "AP" scores the same as one naming the actual photographer.
"Stripped by CDN" versus "never embedded by the publisher" is genuinely hard to distinguish from a single end-of-pipeline fetch. The CDN page gives one view of this; Phase 2 will add a cross-check using publishers who distribute the same image through more than one CDN.
Some publishers serve content differently to crawlers than to human readers (paywalls, soft blocks, geolocation rules). A low score for those sites may reflect what their bot-facing surface looks like, not what their readers see.
C2PA detection is presence-first: we report whether a manifest is embedded and the signing certificate's issuer and we look for DigitalSourceType in c2pa.actions assertions, but do not yet parse individual c2pa.metadata or cawg.metadata assertions.

Latest run: 2026-07-01T031746Z — 455 of 508 sites successfully crawled, 6,845 images analysed.