CPU Efficiency
While AO hash functions are designed for efficient evaluation inside arithmetic circuits, their native CPU performance also matters. In practice, a hash function is evaluated both inside a proof system (the prover must compute the witness) and outside it (e.g., for Merkle tree construction, commitment schemes, or standalone hashing). This section analyses the CPU cost of each primitive.
Why CPU efficiency matters for AO primitives
The total cost of using an AO hash function in a proof system includes:
- Circuit cost: number of constraints/rows (covered in Arithmetization)
- Prover native cost: the prover must evaluate the hash function to compute the witness before encoding it as constraints
- Standalone cost: applications that use the hash function outside proofs (e.g., building Merkle trees for data availability)
A function that is extremely circuit-efficient but slow on CPUs creates a bottleneck at step 2 and 3. The ideal AO hash function is fast in both settings.
Cost breakdown per operation
The primitive operations in AO hash functions and their approximate CPU costs:
| Operation | CPU cost | Notes |
|---|---|---|
| Field addition | ~1 ns (64-bit limb) | Cheap on all architectures |
| Field multiplication | ~5-20 ns | Depends on field size and Montgomery |
| Field squaring | ~4-15 ns | Slightly cheaper than general multiply |
| Field inversion | ~200-500 ns | Expensive; uses extended Euclidean |
| (square-and-multiply) | ~15-50 ns | 2 squarings + 1 multiply |
| (Tonelli-Shanks) | ~300-1000 ns | Much more expensive than forward |
| MDS matrix-vector product | multiplications | Optimised to in some cases |
| Lookup table (8-bit) | ~1 ns | Cache-friendly on CPU |
Comparison across primitives
The table below estimates the total number of field multiplications for one permutation call. This is the dominant factor in CPU performance.
| Primitive | Muls per round | Rounds | Total muls (approx) | Notes |
|---|---|---|---|---|
| Poseidon () | ~12 (full), ~6 (partial) | 8F+57P | ~400 | Partial rounds reduce mul count |
| Poseidon2 () | ~8 (ext), ~4 (int) | 4E+56I | ~260 | Cheaper linear layer |
| Rescue-Prime () | ~6 + inv | 14 | ~700+ | Inverse S-box is CPU-expensive |
| Anemoi () | ~10 | 14 | ~140 | Flystel avoids full inversions |
| Griffin () | ~8 + 1 inv | 12 | ~500+ | One inverse per round |
| Reinforced Concrete | ~20 (table lookups) | 8 | ~160 | Lookup-based S-box is fast on CPU |
| Tip5 | ~15 (table + power) | 5 | ~75 | Fewest rounds; fast on CPU |
Field choice impact
The CPU cost of field arithmetic varies significantly with the field:
Inverse S-box: the CPU bottleneck
The inverse power map deserves special attention. While it is essentially free in AIR and PlonK-ish arithmetizations (the verifier checks ), on CPU the prover must actually compute .
There are two approaches:
- Exponentiation: compute . The inverse exponent is typically close to in magnitude, requiring ~ squarings and multiplications.
- Tonelli-Shanks variant: for specific values and field shapes, specialised algorithms exist (e.g., cube roots via the Cipolla-Lehmer method).
For Rescue-Prime (which uses every other round), this means the CPU cost per round is dominated by the inverse S-box computation. This is the primary reason why Rescue is less popular for applications where native performance matters, even though its circuit efficiency in AIR is excellent.
Parallelism and vectorisation
Modern CPUs can exploit parallelism at multiple levels:
- SIMD: field operations on BabyBear or Goldilocks can be vectorised using AVX2/AVX-512 (4-8 elements per instruction)
- S-box parallelism: full rounds apply the S-box to all state elements independently, enabling parallel evaluation
- Multi-hash: when computing many hashes (e.g., Merkle tree layers), multiple permutation calls can run in parallel
Primitives with wider states (larger ) expose more S-box parallelism per permutation call. However, they also require larger MDS matrices, which can offset the gains.
References
- Grassi, Khovratovich, Rechberger, Roy, Schofnegger. "Poseidon: A New Hash Function for Zero-Knowledge Proof Systems" (USENIX Security 2021) ePrint 2019/458 - includes CPU benchmarks
- Grassi, Hao, Rechberger, Tribastone, Yun. "Poseidon2: A Faster Version of the Poseidon Hash Function" (2023) ePrint 2023/323 - CPU performance comparison
- Polygon Zero Team. "Plonky2: Fast Recursive Arguments with PLONK and FRI" (2022) - benchmarks over Goldilocks field