Who Trained Your AI, and Can You Prove It?

| Danny Willems

The web was built on an implicit contract: you publish content, you receive traffic. That contract is quietly breaking — and almost nobody has a receipt to prove it.


Bots now outpace humans on the web

Today, automated bots account for roughly 49% of all internet traffic, according to Cloudflare’s 2024 bot traffic report. A growing share of those bots are not indexing content for human readers. They are consuming it as raw material for AI training pipelines.

Major crawlers currently active include:

Publishers pay for bandwidth, editorial labour, and infrastructure. AI companies ingest the output, distil it into model weights, and monetise the result — often without attribution, compensation, or disclosure. The crawl leaves no receipt.


robots.txt: a gentleman’s agreement

The robots exclusion standard — formalised in the mid-1990s — allows webmasters to declare crawl permissions in a plain-text file at yourdomain.com/robots.txt. It is entirely voluntary.

There is no cryptographic binding. No audit trail. No penalty for violation. A crawler that ignores robots.txt faces no technical barrier — only potential legal exposure, which varies by jurisdiction and is rarely enforced against large technology companies.

More fundamentally, robots.txt was designed for indexing, not training. Its vocabulary — Allow, Disallow, Crawl-delay — has no semantics for questions like: “May this content be used to fine-tune a language model?”

Emerging proposals attempt to bridge this gap:

These solve the access problem — not the proof problem. A model trained on data collected before these controls existed carries no verifiable record of what it consumed.


The lawsuits are piling up

The legal landscape reflects the epistemic crisis:

Several publishers have moved toward licensing rather than litigation:

The majority of publishers, however, have neither the leverage to negotiate nor the tools to detect whether their content was used at all.


The scale problem: human audits cannot work

When regulators attempt to mandate AI auditing — as the EU AI Act does for high-risk systems — they implicitly assume that auditors can inspect training data. This assumption does not survive contact with reality.

Modern foundation models are trained on datasets of extraordinary scale:

DatasetScaleSource
The Pile~300 billion tokensEleutherAI
FineWeb~15 trillion tokensHugging Face
LAION-5B~5 billion image-text pairsLAION
RedPajama~1.2 trillion tokensTogether AI

A human auditor tasked with verifying the provenance of a 15-trillion-token dataset is not doing an audit. They are doing archaeology with a teaspoon. Even automated tooling — deduplication pipelines, content classifiers, domain filters — can only sample. They cannot provide mathematical guarantees.

This is the central gap: compliance frameworks demand accountability, but the infrastructure to deliver it does not yet exist.


The core question nobody can answer

The problem of AI training audits reduces to statements that need to be proven, not merely claimed:

Today, for virtually every AI system in production, none of these statements can be verified by any external party. Model providers can assert them. They cannot prove them.

The EU AI Act’s conformity assessment requirements for high-risk AI systems, the US Executive Order on Safe, Secure, and Trustworthy AI, and national AI strategies across the UK, Singapore, and Canada all demand accountability that self-attestation cannot provide.


What cryptography can offer

This is not unsolvable. Mathematics has relevant answers — though they come with significant engineering cost.

Dataset commitments and Merkle trees

A Merkle tree constructs a binary hash tree over a corpus. Each leaf is the hash of a document. The root — a single 32-byte value — is a commitment to the entire dataset. A model provider who publishes a Merkle root before training begins has made a verifiable commitment: they cannot later claim the dataset was different without invalidating the root.

This solves dataset identity. It does not yet prove that training actually used the committed dataset.

Zero-knowledge proofs

Zero-knowledge proofs (ZKPs) allow a prover to convince a verifier that a statement is true, without revealing any information beyond the truth of the statement itself.

Applied to AI training, the goal is a proof of training: a cryptographic object proving that model parameters theta were produced by running algorithm A on dataset D, without revealing D or intermediate computation.

Recent research demonstrates this is tractable for specific model classes. The ZKBoost paper (2026) shows verifiable training for gradient-boosted trees. Systems like zkML are extending these techniques toward neural network inference verification.

The floating-point problem

Here is the central technical friction: modern neural networks operate in floating-point arithmetic (IEEE 754 — fp32, bf16, fp16). Zero-knowledge proof systems operate over finite fields — exact integer arithmetic modulo a prime. These are fundamentally incompatible.

The bridge is fixed-point arithmetic: representing real-valued parameters as scaled integers. A weight of 0.375 becomes 384 at scale factor 1024. Additions and multiplications become integer operations, directly expressible in a ZK circuit. The training algorithm is re-implemented in fixed-point, and this re-implementation is what gets encoded as an arithmetic circuit.

The costs are real: quantisation error, range management, and circuit size. Current ZK proving systems — including Plonky3, HyperPlonk, and folding schemes like Nova — are advancing rapidly. Hardware acceleration for ZK proving is an active research and commercial area.


The statements we cannot currently verify

To be concrete about what is missing, here is a non-exhaustive list of claims routinely made by AI companies that have no current mechanism for cryptographic verification:

  1. “Our model was not trained on this copyrighted content.” — OpenAI’s position on NYT lawsuit
  2. “We honour robots.txt opt-outs.” — stated by multiple labs, unverifiable
  3. “Personal data was removed from our training set.” — standard GDPR compliance claim
  4. “Our model does not exhibit bias against protected classes.” — common in enterprise AI marketing
  5. “We trained only on licensed data.” — Adobe Firefly’s positioning

Each of these is an assertion. None is a proof. This is the problem statement.


Further reading

Legislation and regulation

Data licensing and industry responses

Data provenance and transparency

Cryptographic verification


This article is part of an ongoing effort at badaas.be to track verifiable statements at the intersection of AI, data governance, and cryptographic accountability. Claims cited here link to primary sources. If a source is missing or a claim has been updated, open an issue.