Performance
Benchmark Results
All numbers are measured at production context lengths on standard hardware. No cherry-picked short sequences, no accuracy–speed trade-off footnotes.
Faster Inference
At 500K+ context length
Measured against unoptimised attention on an H100 SXM5 with a 70B-class model. Gains compound as context length grows — the longer the context, the wider the gap.
KV Memory Reduction
Measured at 100K tokens
Key-value cache is the primary memory bottleneck at long context. Antilattice's attention compression shrinks the cache without eviction, retaining full sequence fidelity.
Concurrent Users per GPU
Validated on H100
Reduced per-request memory allows the runtime to batch more sequences simultaneously. Throughput scales linearly with the memory saving — no software tricks, just headroom.
Inference Speedup vs Context Length
Relative throughput (tokens/sec) compared to standard attention on the same hardware. 70B-class model, H100 SXM5, FP16.
| Context Length | Baseline (Standard Attn) | Antilattice |
|---|---|---|
| 8K | 1× | 1.4× |
| 32K | 1× | 2.1× |
| 128K | 1× | 3.8× |
| 256K | 1× | 4.6× |
| 500K+ | 1× | 5×+ |
Methodology
How we measure
Benchmarks run on isolated H100 SXM5 instances with no competing workloads. We report median throughput across 50 inference requests at each context length, with a 10-request warmup discarded.
Accuracy is verified by comparing output logit distributions against unoptimised attention with a KL-divergence threshold of <0.001. Any configuration that fails this threshold is excluded from results.
Independent replication packets available on request at hello@antilattice.com.