LLM Inference Memory Estimator

Estimate total GPU memory for LLM inference: model weights, KV cache, and activation memory. Supports weight quantization, KV cache quantization, and activation quantization. Fetch model config directly from HuggingFace.

HuggingFace Model ID

Model Preset

Model Architecture

Hidden Size (h)

Intermediate Size (h_ffn, dense)

Num Layers (L)

Num Attention Heads (a)

Num KV Heads (k)

Head Dim (d)

Vocab Size (v)

Tie Word Embeddings

MoE (set Num Experts = 1 for dense)

Num Routed Experts

Num Shared Experts

Expert FFN Size (moe_intermediate_size)

Experts per Token (top-k)

Quantization

Weight Precision

KV Cache Precision

Activation Precision

Inference Config

Max Sequence Length

Batch Size

TP Size (Tensor Parallel)

GPU

GPU Type

Memory Breakdown (per GPU)

Component	Memory	Formula
Model Weights	—
KV Cache	—
Activation Memory (1 layer peak)	—
Total	—

Model Parameters

—

KV Cache Formula

—

GPU Fit Check

Roofline Analysis (per layer, decode)

Arithmetic Intensity (AI) = FLOPs / Bytes. If AI < ops:byte ratio of the GPU, the operation is Memory Bound; otherwise Compute Bound.

Operation	FLOPs	Bytes	AI (F/B)	Bound

References

Fujii et al., “Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator” (arXiv:2411.06465) — by the author of this tool •
Williams et al., “Roofline: An Insightful Visual Performance Model” (2009) •
JAX Scaling Book — Rooflines •
kipply — Transformer Inference Arithmetic

Note

This estimator targets Transformer-based LLMs (decoder-only, with SwiGLU FFN and RMSNorm). It does not support SSM-based models (Mamba, RWKV, etc.), Gated Linear Networks, or Diffusion models. Additionally:

KV cache assumes the full max sequence length is allocated. With paged attention (e.g., vLLM, TGI), pages are allocated on-demand, so actual usage may be lower.
Inference frameworks have their own memory overhead (CUDA context ~300–500 MB, framework buffers, etc.) which is not included here.

Citation

@misc{fujii2024acceleratinglargelanguagemodel,
title={Accelerating Large Language Model Training with 4D Parallelism and Memory Consumption Estimator},
author={Kazuki Fujii and Kohei Watanabe and Rio Yokota},
year={2024},
eprint={2411.06465},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2411.06465},
}