Estimate total GPU memory for LLM inference: model weights, KV cache, and activation memory. Supports weight quantization, KV cache quantization, and activation quantization. Fetch model config directly from HuggingFace.
Model Architecture
MoE (set Num Experts = 1 for dense)
Quantization
Inference Config
GPU
Memory Breakdown (per GPU)
| Component | Memory | Formula |
|---|---|---|
| Model Weights | — | |
| KV Cache | — | |
| Activation Memory (1 layer peak) | — | |
| Total | — |
Model Parameters
—KV Cache Formula
—GPU Fit Check
Roofline Analysis (per layer, decode)
Arithmetic Intensity (AI) = FLOPs / Bytes. If AI < ops:byte ratio of the GPU, the operation is Memory Bound; otherwise Compute Bound.
| Operation | FLOPs | Bytes | AI (F/B) | Bound |
|---|
References
Fujii et al., “Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator” (arXiv:2411.06465) — by the author of this tool •
Williams et al., “Roofline: An Insightful Visual Performance Model” (2009) •
JAX Scaling Book — Rooflines •
kipply — Transformer Inference ArithmeticNote
This estimator targets Transformer-based LLMs (decoder-only, with SwiGLU FFN and RMSNorm). It does not support SSM-based models (Mamba, RWKV, etc.), Gated Linear Networks, or Diffusion models. Additionally:- KV cache assumes the full max sequence length is allocated. With paged attention (e.g., vLLM, TGI), pages are allocated on-demand, so actual usage may be lower.
- Inference frameworks have their own memory overhead (CUDA context ~300–500 MB, framework buffers, etc.) which is not included here.
Citation
@misc{fujii2024acceleratinglargelanguagemodel,
title={Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator},
author={Kazuki Fujii and Kohei Watanabe and Rio Yokota},
year={2024},
eprint={2411.06465},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.06465},
}