Dataset

Every Metawatch run is published as a directory of Parquet files in the project repository. The data is licensed CC BY 4.0 — please credit "IPTC Metawatch" with a link back.

How to read it

Parquet is a columnar format readable by every mainstream data tool — pandas, polars, DuckDB, R/arrow, Apache Spark. A quick start in Python:

import pyarrow.parquet as pq
sites = pq.read_table("sites.parquet").to_pandas()
print(sites[sites.status == "ok"].sort_values("mean_iptc_score", ascending=False).head())

Or with DuckDB directly against the raw URL (no download needed):

duckdb -c "SELECT site_id, mean_iptc_score FROM 'https://github.com/iptc/metawatch/raw/main/data/runs/2026-05-13/sites.parquet' WHERE status = 'ok' ORDER BY mean_iptc_score DESC LIMIT 10;"

File layout

FileWhat's in it
runs.parquet 1 row per run — start/end times, totals.
sites.parquet 1 row per site per run — discovery strategy, status, articles & images sampled, mean score.
articles.parquet 1 row per article fetched — URL, publication date, http status, JSON-LD payload.
images.parquet 1 row per image — URL, dimensions, CDN attribution, per-image score, presence flags.
metadata_fields.parquet 1 row per (image, field) tuple — long-form table for per-field aggregation.

Runs

Most recent first. Click a file to download it directly.

2026-07-01

2026-07-01 · run took 72m 25s · 455/508 sites, 6,845 images

2026-06-15

2026-06-15 · run took 66m 23s · 452/508 sites, 6,765 images

2026-06-03

2026-06-03 · run took 50m 55s · 339/380 sites, 5,315 images

2026-05-19

2026-05-19 · run took 41m 42s · 294/341 sites, 4,399 images

The schema is described in SPEC.md §7. If you build something interesting on this data, we'd love to hear about it — office@iptc.org.