A 2.1B-document curated web corpus with per-document provenance and quality scores. Cleaned, deduplicated, and license-annotated.
corpus manifest (JSONL index of all shards)