Performance

Benchmark Results

All numbers are measured at production context lengths on standard hardware. No cherry-picked short sequences, no accuracy–speed trade-off footnotes.

5×

Faster Inference

At 500K+ context length

Measured against unoptimised attention on an H100 SXM5 with a 70B-class model. Gains compound as context length grows — the longer the context, the wider the gap.

10×

KV Memory Reduction

Measured at 100K tokens

Key-value cache is the primary memory bottleneck at long context. Antilattice's attention compression shrinks the cache without eviction, retaining full sequence fidelity.

10×

Concurrent Users per GPU

Validated on H100

Reduced per-request memory allows the runtime to batch more sequences simultaneously. Throughput scales linearly with the memory saving — no software tricks, just headroom.

Inference Speedup vs Context Length

Relative throughput (tokens/sec) compared to standard attention on the same hardware. 70B-class model, H100 SXM5, FP16.

Context Length	Baseline (Standard Attn)	Antilattice
8K	1×	1.4×
32K	1×	2.1×
128K	1×	3.8×
256K	1×	4.6×
500K+	1×	5×+

Methodology

How we measure

Benchmarks run on isolated H100 SXM5 instances with no competing workloads. We report median throughput across 50 inference requests at each context length, with a 10-request warmup discarded.

Accuracy is verified by comparing output logit distributions against unoptimised attention with a KL-divergence threshold of <0.001. Any configuration that fails this threshold is excluded from results.

Independent replication packets available on request at hello@antilattice.com.