Interactive Estimators
Practical client-side tools for planning LLM training and inference. All calculations run entirely in your browser. Fetch model configs directly from HuggingFace.
Practical client-side tools for planning LLM training and inference. All calculations run entirely in your browser. Fetch model configs directly from HuggingFace.
Estimate total GPU memory for LLM inference: model weights, KV cache, and activation memory. Supports weight quantization, KV cache quantization, and activation quantization. Fetch model config directly from HuggingFace. HuggingFace Model ID Fetch Model Preset Llama-3.1-8B Llama-3.1-70B Llama-3.1-405B Qwen3-8B Qwen3-235B-A22B (MoE) DeepSeek-V3 (MoE) Custom Model Architecture Hidden Size (h) Intermediate Size (h_ffn, dense) Num Layers (L) Num Attention Heads (a) Num KV Heads (k) Head Dim (d) Vocab Size (v) Tie Word Embeddings No (separate LM head) Yes (shared) MoE (set Num Experts = 1 for dense) Num Routed Experts Num Shared Experts Expert FFN Size (moe_intermediate_size) Experts per Token (top-k) Quantization Weight Precision BF16 / FP16 (2 bytes) FP32 (4 bytes) FP8 / INT8 (1 byte) NVFP4 / MXFP4 / INT4 (0.5 bytes) KV Cache Precision FP16 / BF16 (2 bytes) FP8 / INT8 (1 byte) NVFP4 / MXFP4 / INT4 (0.5 bytes) Activation Precision FP16 / BF16 (2 bytes) FP8 (1 byte) Inference Config Max Sequence Length Batch Size TP Size (Tensor Parallel) GPU GPU Type A100 (40 GB) A100 (80 GB) H100 (80 GB) H100 NVL (94 GB) H200 (141 GB) B200 (180 GB) Memory Breakdown (per GPU) Component Memory Formula Model Weights — KV Cache — Activation Memory (1 layer peak) — Total — Model Parameters — KV Cache Formula — GPU Fit Check Roofline Analysis (per layer, decode) Arithmetic Intensity (AI) = FLOPs / Bytes. If AI < ops:byte ratio of the GPU, the operation is Memory Bound; otherwise Compute Bound. ...
Estimate per-GPU memory consumption during LLM training with 5D parallelism (TP, PP, DP, CP, EP). Supports MoE, distributed optimizer, and NCCL buffer estimation. Fetch model config from HuggingFace. HuggingFace Model ID Fetch Model Preset Llama-3.1-8B Llama-3.1-70B Llama-3.1-405B Qwen3-8B Qwen3-235B-A22B (MoE) DeepSeek-V3 (MoE) Custom Model Architecture Hidden Size (h) Intermediate Size (h_ffn, dense) Num Layers (L) Num Attention Heads (a) Num KV Heads (k) Head Dim (d) Vocab Size (v) Tie Word Embeddings No (separate LM head) Yes (shared) MoE (set Num Experts = 1 for dense) Num Routed Experts Num Shared Experts Expert FFN Size (moe_intermediate_size) Experts per Token (top-k) Precision & Optimizer Param Dtype BF16 (2 bytes) FP32 (4 bytes) FP8 (1 byte) NVFP4 / MXFP4 (0.5 bytes) Grad Dtype BF16 (2 bytes) FP32 (4 bytes) Optimizer Adam (master FP32 + m + v = 12 B/param) SGD (master FP32 + m = 8 B/param) Training Config Sequence Length (s) Micro Batch Size (b) Parallelism TP Size (Tensor Parallel) PP Size (Pipeline Parallel) DP Size (Data Parallel) CP Size (Context Parallel) EP Size (Expert Parallel) Distributed Optimizer ON (shard across DP) OFF PP Scheduler 1F1B Interleaved 1F1B Virtual PP Stages Standalone Embedding Stage OFF (embed in first/last PP stage) ON (embed/LM head as separate PP stages) GPU GPU Type A100 (40 GB) A100 (80 GB) H100 (80 GB) H100 NVL (94 GB) H200 (141 GB) B200 (180 GB) Memory Breakdown (per GPU) Component Memory Formula Parameters — Gradients — Optimizer States — Activations — NCCL Buffers (est.) — Total — Model Parameters — GPU Fit Check Roofline Analysis (Training, per layer) Arithmetic Intensity (AI) = FLOPs / Bytes. If AI < ops:byte ratio of the GPU, the operation is Memory Bound; otherwise Compute Bound. ...