KV cache memory cost model: group-wise low-rank compression with quantization

Model & sequence parameters
4096
32768
4
32
512
fp16
Low-rank ranks
1.50
Ranks are derived from target α and β, or you can set r_k directly:
auto
3.0×
Quantization — per matrix type
A_K shared token basis
128
B_K per-layer K factors
128
A_V shared value basis
128
B_V per-layer V factors
128
K_decode dense decode keys
none
V_decode dense decode values
none

Quantization overhead includes scale (fp16) and optionally base (fp16, asymmetric only) per group, amortised over each element. group_size = 0 means per-row (one scale per row).

Results

Effective α (full model)

vs. dense fp16 baseline

Prefill-only α

ignoring decode buffers

Compressed (MB)

baseline: —

r_k / r_v

resolved ranks

Memory breakdown per group (one group, elements → bytes)
Effective bits-per-element per matrix

Effective bits = data bits + scale overhead + base overhead (if asymmetric). Scale is fp16 (16 bits) per group_size elements. Base is an additional fp16 per group_size elements for asymmetric quantization.

Formula reference