KV cache memory cost model: group-wise low-rank compression with quantization
Model & sequence parameters
4096
32768
4
32
512
fp16
Low-rank ranks
1.50
Ranks are derived from target α and β, or you can set r_k directly:
auto
3.0×
Quantization — per matrix type
A_K shared token basis
128
B_K per-layer K factors
128
A_V shared value basis
128
B_V per-layer V factors
128
K_decode dense decode keys
none
V_decode dense decode values
none
Quantization overhead includes scale (fp16) and optionally base (fp16, asymmetric only) per group, amortised over each element. group_size = 0 means per-row (one scale per row).
Results
Effective α (full model)
—
vs. dense fp16 baseline
Prefill-only α
—
ignoring decode buffers
Compressed (MB)
—
baseline: —
r_k / r_v
—
resolved ranks
Memory breakdown per group (one group, elements → bytes)
Effective bits-per-element per matrix
Effective bits = data bits + scale overhead + base overhead (if asymmetric). Scale is fp16 (16 bits) per group_size elements. Base is an additional fp16 per group_size elements for asymmetric quantization.