Who Trained Your AI, and Can You Prove It?

March 14, 2026 | Danny Willems

The web was built on an implicit contract: you publish content, you receive traffic. That contract is quietly breaking — and almost nobody has a receipt to prove it.

Bots now outpace humans on the web

Today, automated bots account for roughly 49% of all internet traffic, according to Cloudflare’s 2024 bot traffic report. A growing share of those bots are not indexing content for human readers. They are consuming it as raw material for AI training pipelines.

Major crawlers currently active include:

GPTBot — OpenAI’s training crawler
ClaudeBot — Anthropic’s crawler
Meta-ExternalAgent — Meta’s data collection agent
CCBot — Common Crawl, source of many open training datasets

Publishers pay for bandwidth, editorial labour, and infrastructure. AI companies ingest the output, distil it into model weights, and monetise the result — often without attribution, compensation, or disclosure. The crawl leaves no receipt.

robots.txt: a gentleman’s agreement

The robots exclusion standard — formalised in the mid-1990s — allows webmasters to declare crawl permissions in a plain-text file at yourdomain.com/robots.txt. It is entirely voluntary.

There is no cryptographic binding. No audit trail. No penalty for violation. A crawler that ignores robots.txt faces no technical barrier — only potential legal exposure, which varies by jurisdiction and is rarely enforced against large technology companies.

More fundamentally, robots.txt was designed for indexing, not training. Its vocabulary — Allow, Disallow, Crawl-delay — has no semantics for questions like: “May this content be used to fine-tune a language model?”

Emerging proposals attempt to bridge this gap:

Cloudflare’s AI Audit controls — allowing site owners to block AI crawlers via their dashboard
HTTP 402 Payment Required — a dormant status code being revisited for micropayment-gated data access
ai.txt proposals — a proposed extension to robots.txt specifically for AI training opt-outs

These solve the access problem — not the proof problem. A model trained on data collected before these controls existed carries no verifiable record of what it consumed.

The lawsuits are piling up

The legal landscape reflects the epistemic crisis:

The New York Times v. OpenAI and Microsoft (December 2023) — alleging copyright infringement at scale
Andersen v. Stability AI — artists suing over image generation models trained on their work
Kadrey v. Meta — authors challenging use of books in LLaMA training
Getty Images v. Stability AI — stock photography licensing conflict

Several publishers have moved toward licensing rather than litigation:

Reddit’s $60M data deal with Google (February 2024)
Stack Overflow’s partnership with OpenAI (May 2024)
The Financial Times’ deal with OpenAI (April 2024)

The majority of publishers, however, have neither the leverage to negotiate nor the tools to detect whether their content was used at all.

The scale problem: human audits cannot work

When regulators attempt to mandate AI auditing — as the EU AI Act does for high-risk systems — they implicitly assume that auditors can inspect training data. This assumption does not survive contact with reality.

Modern foundation models are trained on datasets of extraordinary scale:

Dataset	Scale	Source
The Pile	~300 billion tokens	EleutherAI
FineWeb	~15 trillion tokens	Hugging Face
LAION-5B	~5 billion image-text pairs	LAION
RedPajama	~1.2 trillion tokens	Together AI

A human auditor tasked with verifying the provenance of a 15-trillion-token dataset is not doing an audit. They are doing archaeology with a teaspoon. Even automated tooling — deduplication pipelines, content classifiers, domain filters — can only sample. They cannot provide mathematical guarantees.

This is the central gap: compliance frameworks demand accountability, but the infrastructure to deliver it does not yet exist.

The core question nobody can answer

The problem of AI training audits reduces to statements that need to be proven, not merely claimed:

“This model was trained on dataset D and no other.”
“Dataset D does not contain documents from domain X.”
“The training algorithm followed specification S on committed dataset D.”
“This model’s outputs are consistent with training exclusively on licensed content.”

Today, for virtually every AI system in production, none of these statements can be verified by any external party. Model providers can assert them. They cannot prove them.

The EU AI Act’s conformity assessment requirements for high-risk AI systems, the US Executive Order on Safe, Secure, and Trustworthy AI, and national AI strategies across the UK, Singapore, and Canada all demand accountability that self-attestation cannot provide.

What cryptography can offer

This is not unsolvable. Mathematics has relevant answers — though they come with significant engineering cost.

Dataset commitments and Merkle trees

A Merkle tree constructs a binary hash tree over a corpus. Each leaf is the hash of a document. The root — a single 32-byte value — is a commitment to the entire dataset. A model provider who publishes a Merkle root before training begins has made a verifiable commitment: they cannot later claim the dataset was different without invalidating the root.

This solves dataset identity. It does not yet prove that training actually used the committed dataset.

Zero-knowledge proofs

Zero-knowledge proofs (ZKPs) allow a prover to convince a verifier that a statement is true, without revealing any information beyond the truth of the statement itself.

Applied to AI training, the goal is a proof of training: a cryptographic object proving that model parameters theta were produced by running algorithm A on dataset D, without revealing D or intermediate computation.

Recent research demonstrates this is tractable for specific model classes. The ZKBoost paper (2026) shows verifiable training for gradient-boosted trees. Systems like zkML are extending these techniques toward neural network inference verification.

The floating-point problem

Here is the central technical friction: modern neural networks operate in floating-point arithmetic (IEEE 754 — fp32, bf16, fp16). Zero-knowledge proof systems operate over finite fields — exact integer arithmetic modulo a prime. These are fundamentally incompatible.

The bridge is fixed-point arithmetic: representing real-valued parameters as scaled integers. A weight of 0.375 becomes 384 at scale factor 1024. Additions and multiplications become integer operations, directly expressible in a ZK circuit. The training algorithm is re-implemented in fixed-point, and this re-implementation is what gets encoded as an arithmetic circuit.

The costs are real: quantisation error, range management, and circuit size. Current ZK proving systems — including Plonky3, HyperPlonk, and folding schemes like Nova — are advancing rapidly. Hardware acceleration for ZK proving is an active research and commercial area.

The statements we cannot currently verify

To be concrete about what is missing, here is a non-exhaustive list of claims routinely made by AI companies that have no current mechanism for cryptographic verification:

“Our model was not trained on this copyrighted content.” — OpenAI’s position on NYT lawsuit
“We honour robots.txt opt-outs.” — stated by multiple labs, unverifiable
“Personal data was removed from our training set.” — standard GDPR compliance claim
“Our model does not exhibit bias against protected classes.” — common in enterprise AI marketing
“We trained only on licensed data.” — Adobe Firefly’s positioning

Each of these is an assertion. None is a proof. This is the problem statement.

Who Trained Your AI, and Can You Prove It?

Bots now outpace humans on the web

robots.txt: a gentleman’s agreement

The lawsuits are piling up

The scale problem: human audits cannot work

The core question nobody can answer

What cryptography can offer

Dataset commitments and Merkle trees

Zero-knowledge proofs

The floating-point problem

The statements we cannot currently verify

Further reading

Legislation and regulation

Lawsuits and legal proceedings

Data licensing and industry responses

Data provenance and transparency

Cryptographic verification