Who Trained Your AI, and Can You Prove It?
The web was built on an implicit contract: you publish content, you receive traffic. That contract is quietly breaking — and almost nobody has a receipt to prove it.
Bots now outpace humans on the web
Today, automated bots account for roughly 49% of all internet traffic, according to Cloudflare’s 2024 bot traffic report. A growing share of those bots are not indexing content for human readers. They are consuming it as raw material for AI training pipelines.
Major crawlers currently active include:
- GPTBot — OpenAI’s training crawler
- ClaudeBot — Anthropic’s crawler
- Meta-ExternalAgent — Meta’s data collection agent
- CCBot — Common Crawl, source of many open training datasets
Publishers pay for bandwidth, editorial labour, and infrastructure. AI companies ingest the output, distil it into model weights, and monetise the result — often without attribution, compensation, or disclosure. The crawl leaves no receipt.
robots.txt: a gentleman’s agreement
The robots exclusion standard —
formalised in the mid-1990s — allows webmasters to declare crawl permissions in
a plain-text file at yourdomain.com/robots.txt. It is entirely voluntary.
There is no cryptographic binding. No audit trail. No penalty for violation. A
crawler that ignores robots.txt faces no technical barrier — only potential
legal exposure, which varies by jurisdiction and is rarely enforced against
large technology companies.
More fundamentally, robots.txt was designed for indexing, not training.
Its vocabulary — Allow, Disallow, Crawl-delay — has no semantics for
questions like: “May this content be used to fine-tune a language model?”
Emerging proposals attempt to bridge this gap:
- Cloudflare’s AI Audit controls — allowing site owners to block AI crawlers via their dashboard
- HTTP 402 Payment Required — a dormant status code being revisited for micropayment-gated data access
- ai.txt proposals — a proposed extension to robots.txt specifically for AI training opt-outs
These solve the access problem — not the proof problem. A model trained on data collected before these controls existed carries no verifiable record of what it consumed.
The lawsuits are piling up
The legal landscape reflects the epistemic crisis:
- The New York Times v. OpenAI and Microsoft (December 2023) — alleging copyright infringement at scale
- Andersen v. Stability AI — artists suing over image generation models trained on their work
- Kadrey v. Meta — authors challenging use of books in LLaMA training
- Getty Images v. Stability AI — stock photography licensing conflict
Several publishers have moved toward licensing rather than litigation:
- Reddit’s $60M data deal with Google (February 2024)
- Stack Overflow’s partnership with OpenAI (May 2024)
- The Financial Times’ deal with OpenAI (April 2024)
The majority of publishers, however, have neither the leverage to negotiate nor the tools to detect whether their content was used at all.
The scale problem: human audits cannot work
When regulators attempt to mandate AI auditing — as the EU AI Act does for high-risk systems — they implicitly assume that auditors can inspect training data. This assumption does not survive contact with reality.
Modern foundation models are trained on datasets of extraordinary scale:
| Dataset | Scale | Source |
|---|---|---|
| The Pile | ~300 billion tokens | EleutherAI |
| FineWeb | ~15 trillion tokens | Hugging Face |
| LAION-5B | ~5 billion image-text pairs | LAION |
| RedPajama | ~1.2 trillion tokens | Together AI |
A human auditor tasked with verifying the provenance of a 15-trillion-token dataset is not doing an audit. They are doing archaeology with a teaspoon. Even automated tooling — deduplication pipelines, content classifiers, domain filters — can only sample. They cannot provide mathematical guarantees.
This is the central gap: compliance frameworks demand accountability, but the infrastructure to deliver it does not yet exist.
The core question nobody can answer
The problem of AI training audits reduces to statements that need to be proven, not merely claimed:
- “This model was trained on dataset D and no other.”
- “Dataset D does not contain documents from domain X.”
- “The training algorithm followed specification S on committed dataset D.”
- “This model’s outputs are consistent with training exclusively on licensed content.”
Today, for virtually every AI system in production, none of these statements can be verified by any external party. Model providers can assert them. They cannot prove them.
The EU AI Act’s conformity assessment requirements for high-risk AI systems, the US Executive Order on Safe, Secure, and Trustworthy AI, and national AI strategies across the UK, Singapore, and Canada all demand accountability that self-attestation cannot provide.
What cryptography can offer
This is not unsolvable. Mathematics has relevant answers — though they come with significant engineering cost.
Dataset commitments and Merkle trees
A Merkle tree constructs a binary hash tree over a corpus. Each leaf is the hash of a document. The root — a single 32-byte value — is a commitment to the entire dataset. A model provider who publishes a Merkle root before training begins has made a verifiable commitment: they cannot later claim the dataset was different without invalidating the root.
This solves dataset identity. It does not yet prove that training actually used the committed dataset.
Zero-knowledge proofs
Zero-knowledge proofs (ZKPs) allow a prover to convince a verifier that a statement is true, without revealing any information beyond the truth of the statement itself.
Applied to AI training, the goal is a proof of training: a cryptographic object proving that model parameters theta were produced by running algorithm A on dataset D, without revealing D or intermediate computation.
Recent research demonstrates this is tractable for specific model classes. The ZKBoost paper (2026) shows verifiable training for gradient-boosted trees. Systems like zkML are extending these techniques toward neural network inference verification.
The floating-point problem
Here is the central technical friction: modern neural networks operate in floating-point arithmetic (IEEE 754 — fp32, bf16, fp16). Zero-knowledge proof systems operate over finite fields — exact integer arithmetic modulo a prime. These are fundamentally incompatible.
The bridge is fixed-point arithmetic: representing real-valued parameters as scaled integers. A weight of 0.375 becomes 384 at scale factor 1024. Additions and multiplications become integer operations, directly expressible in a ZK circuit. The training algorithm is re-implemented in fixed-point, and this re-implementation is what gets encoded as an arithmetic circuit.
The costs are real: quantisation error, range management, and circuit size. Current ZK proving systems — including Plonky3, HyperPlonk, and folding schemes like Nova — are advancing rapidly. Hardware acceleration for ZK proving is an active research and commercial area.
The statements we cannot currently verify
To be concrete about what is missing, here is a non-exhaustive list of claims routinely made by AI companies that have no current mechanism for cryptographic verification:
- “Our model was not trained on this copyrighted content.” — OpenAI’s position on NYT lawsuit
- “We honour robots.txt opt-outs.” — stated by multiple labs, unverifiable
- “Personal data was removed from our training set.” — standard GDPR compliance claim
- “Our model does not exhibit bias against protected classes.” — common in enterprise AI marketing
- “We trained only on licensed data.” — Adobe Firefly’s positioning
Each of these is an assertion. None is a proof. This is the problem statement.
Further reading
Legislation and regulation
- EU AI Act — Official Text
- EU AI Act Explorer — searchable, annotated version
- US Executive Order on AI (October 2023)
- NIST AI Risk Management Framework
- UK AI Safety Institute
- Canada’s Artificial Intelligence and Data Act (AIDA)
- China’s Interim Measures for Generative AI
- OECD AI Policy Observatory — tracking AI policies across 70+ countries
- UNESCO Recommendation on the Ethics of AI
Lawsuits and legal proceedings
- NYT v. OpenAI — complaint (December 2023)
- Andersen v. Stability AI — class action tracker
- Getty Images v. Stability AI (UK High Court)
- Thomson Reuters v. ROSS Intelligence — first AI copyright ruling
- AI Copyright Litigation Tracker — Duke University
Data licensing and industry responses
- Reddit’s $60M data deal with Google (February 2024)
- Stack Overflow’s partnership with OpenAI (May 2024)
- The Financial Times’ deal with OpenAI (April 2024)
Data provenance and transparency
- The Foundation Model Transparency Index
- Data Provenance Initiative — auditing popular AI datasets
- Common Crawl — open dataset used across AI training
- Spawning.ai — opt-out infrastructure for AI training
- Have I Been Trained? — check if your images are in AI datasets
Cryptographic verification
This article is part of an ongoing effort at badaas.be to track verifiable statements at the intersection of AI, data governance, and cryptographic accountability. Claims cited here link to primary sources. If a source is missing or a claim has been updated, open an issue.