Dataset
Every Metawatch run is published as a directory of Parquet files in the project repository. The data is licensed CC BY 4.0 — please credit "IPTC Metawatch" with a link back.
How to read it
Parquet is a columnar format readable by every mainstream data tool — pandas, polars, DuckDB, R/arrow, Apache Spark. A quick start in Python:
import pyarrow.parquet as pq
sites = pq.read_table("sites.parquet").to_pandas()
print(sites[sites.status == "ok"].sort_values("mean_iptc_score", ascending=False).head()) Or with DuckDB directly against the raw URL (no download needed):
duckdb -c "SELECT site_id, mean_iptc_score FROM 'https://github.com/iptc/metawatch/raw/main/data/runs/2026-05-13/sites.parquet' WHERE status = 'ok' ORDER BY mean_iptc_score DESC LIMIT 10;" File layout
| File | What's in it |
|---|---|
runs.parquet | 1 row per run — start/end times, totals. |
sites.parquet | 1 row per site per run — discovery strategy, status, articles & images sampled, mean score. |
articles.parquet | 1 row per article fetched — URL, publication date, http status, JSON-LD payload. |
images.parquet | 1 row per image — URL, dimensions, CDN attribution, per-image score, presence flags. |
metadata_fields.parquet | 1 row per (image, field) tuple — long-form table for per-field aggregation. |
Runs
Most recent first. Click a file to download it directly.
2026-07-01
2026-07-01 · run took 72m 25s ·
455/508 sites,
6,845 images
-
articles.parquet5.1 MB -
images.parquet891.2 KB -
metadata_fields.parquet164.6 KB -
robots_analysis.parquet10.3 KB -
runs.parquet3.2 KB -
sites.parquet32.5 KB
2026-06-15
2026-06-15 · run took 66m 23s ·
452/508 sites,
6,765 images
-
articles.parquet5.1 MB -
images.parquet933.4 KB -
metadata_fields.parquet163.3 KB -
robots_analysis.parquet10.3 KB -
runs.parquet3.2 KB -
sites.parquet32.4 KB
2026-06-03
2026-06-03 · run took 50m 55s ·
339/380 sites,
5,315 images
-
articles.parquet4.2 MB -
images.parquet673.0 KB -
metadata_fields.parquet124.5 KB -
robots_analysis.parquet9.0 KB -
runs.parquet3.2 KB -
sites.parquet26.5 KB
2026-05-19
2026-05-19 · run took 41m 42s ·
294/341 sites,
4,399 images
-
articles.parquet3.7 MB -
images.parquet576.8 KB -
metadata_fields.parquet103.5 KB -
runs.parquet3.2 KB -
sites.parquet21.0 KB
The schema is described in SPEC.md §7. If you build something interesting on this data, we'd love to hear about it — office@iptc.org.