Estimate total GPU memory for LLM inference: model weights, KV cache, and activation memory. Supports weight quantization, KV cache quantization, and activation quantization. Fetch model config directly from HuggingFace.

Memory Breakdown (per GPU)
ComponentMemoryFormula
Model Weights
KV Cache
Activation Memory (1 layer peak)
Total
Model Parameters
KV Cache Formula
GPU Fit Check
Roofline Analysis (per layer, decode)

Arithmetic Intensity (AI) = FLOPs / Bytes. If AI < ops:byte ratio of the GPU, the operation is Memory Bound; otherwise Compute Bound.

OperationFLOPsBytesAI (F/B)Bound
Note
This estimator targets Transformer-based LLMs (decoder-only, with SwiGLU FFN and RMSNorm). It does not support SSM-based models (Mamba, RWKV, etc.), Gated Linear Networks, or Diffusion models. Additionally:
  • KV cache assumes the full max sequence length is allocated. With paged attention (e.g., vLLM, TGI), pages are allocated on-demand, so actual usage may be lower.
  • Inference frameworks have their own memory overhead (CUDA context ~300–500 MB, framework buffers, etc.) which is not included here.
Citation
@misc{fujii2024acceleratinglargelanguagemodel, title={Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator}, author={Kazuki Fujii and Kohei Watanabe and Rio Yokota}, year={2024}, eprint={2411.06465}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2411.06465}, }