Interactive Estimators
Practical client-side tools for planning LLM training and inference. All calculations run entirely in your browser. Fetch model configs directly from HuggingFace.
Practical client-side tools for planning LLM training and inference. All calculations run entirely in your browser. Fetch model configs directly from HuggingFace.
Visualize how GPU processes are distributed across nodes in distributed LLM training. Configure cluster and parallelism sizes (TP, EP, CP, DP, PP), and see an interactive visualization of process-to-GPU mapping with color-coded communicator groups. Cluster Configuration Number of Nodes 1 2 4 8 16 32 64 GPUs per Node 1 2 4 8 Parallelism Configuration TP Size (Tensor Parallel) EP Size (Expert Parallel) CP Size (Context Parallel) DP Size (Data Parallel) PP Size (Pipeline Parallel) Preset Configurations Custom 1N×8G: TP=8 2N×8G: TP=8, PP=2 4N×8G: TP=8, DP=2, PP=2 8N×8G: TP=8, DP=4, PP=2 16N×8G: TP=8, DP=2, PP=4, CP=2 32N×8G: TP=8, EP=8, DP=4, PP=2 (MoE) TP EP CP DP PP None Communicator Basics In distributed training, GPUs are organized into process groups (communicators). Each group defines a set of GPUs that need to communicate for a specific type of parallelism. A single GPU belongs to multiple communicator groups simultaneously — one for each parallelism dimension. ...
Estimate total GPU memory for LLM inference: model weights, KV cache, and activation memory. Supports weight quantization, KV cache quantization, and activation quantization. Fetch model config directly from HuggingFace. HuggingFace Model ID Fetch Model Preset Llama-3.1-8B Llama-3.1-70B Llama-3.1-405B Qwen3-8B Qwen3-235B-A22B (MoE) DeepSeek-V3 (MoE) Custom Model Architecture Hidden Size (h) Intermediate Size (h_ffn, dense) Num Layers (L) Num Attention Heads (a) Num KV Heads (k) Head Dim (d) Vocab Size (v) Tie Word Embeddings No (separate LM head) Yes (shared) MoE (set Num Experts = 1 for dense) Num Routed Experts Num Shared Experts Expert FFN Size (moe_intermediate_size) Experts per Token (top-k) Quantization Weight Precision BF16 / FP16 (2 bytes) FP32 (4 bytes) FP8 / INT8 (1 byte) NVFP4 / MXFP4 / INT4 (0.5 bytes) KV Cache Precision FP16 / BF16 (2 bytes) FP8 / INT8 (1 byte) NVFP4 / MXFP4 / INT4 (0.5 bytes) Activation Precision FP16 / BF16 (2 bytes) FP8 (1 byte) Inference Config Max Sequence Length Batch Size TP Size (Tensor Parallel) GPU GPU Type A100 (40 GB) A100 (80 GB) H100 (80 GB) H100 NVL (94 GB) H200 (141 GB) B200 (180 GB) Memory Breakdown (per GPU) Component Memory Formula Model Weights — KV Cache — Activation Memory (1 layer peak) — Total — Model Parameters — KV Cache Formula — GPU Fit Check Roofline Analysis (per layer, decode) Arithmetic Intensity (AI) = FLOPs / Bytes. If AI < ops:byte ratio of the GPU, the operation is Memory Bound; otherwise Compute Bound. ...
Estimate per-GPU memory consumption during LLM training with 5D parallelism (TP, PP, DP, CP, EP). Supports MoE, distributed optimizer, and NCCL buffer estimation. Fetch model config from HuggingFace. HuggingFace Model ID Fetch Model Preset Llama-3.1-8B Llama-3.1-70B Llama-3.1-405B Qwen3-8B Qwen3-235B-A22B (MoE) DeepSeek-V3 (MoE) Custom Model Architecture Hidden Size (h) Intermediate Size (h_ffn, dense) Num Layers (L) Num Attention Heads (a) Num KV Heads (k) Head Dim (d) Vocab Size (v) Tie Word Embeddings No (separate LM head) Yes (shared) MoE (set Num Experts = 1 for dense) Num Routed Experts Num Shared Experts Expert FFN Size (moe_intermediate_size) Experts per Token (top-k) Precision & Optimizer Param Dtype BF16 (2 bytes) FP32 (4 bytes) FP8 (1 byte) NVFP4 / MXFP4 (0.5 bytes) Grad Dtype BF16 (2 bytes) FP32 (4 bytes) Optimizer Adam (master FP32 + m + v = 12 B/param) SGD (master FP32 + m = 8 B/param) Training Config Sequence Length (s) Micro Batch Size (b) Parallelism TP Size (Tensor Parallel) PP Size (Pipeline Parallel) DP Size (Data Parallel) CP Size (Context Parallel) EP Size (Expert Parallel) Distributed Optimizer ON (shard across DP) OFF PP Scheduler 1F1B Interleaved 1F1B Virtual PP Stages Standalone Embedding Stage OFF (embed in first/last PP stage) ON (embed/LM head as separate PP stages) GPU GPU Type A100 (40 GB) A100 (80 GB) H100 (80 GB) H100 NVL (94 GB) H200 (141 GB) B200 (180 GB) Memory Breakdown (per GPU) Component Memory Formula Parameters — Gradients — Optimizer States — Activations — NCCL Buffers (est.) — Total — Model Parameters — GPU Fit Check Roofline Analysis (Training, per layer) Arithmetic Intensity (AI) = FLOPs / Bytes. If AI < ops:byte ratio of the GPU, the operation is Memory Bound; otherwise Compute Bound. ...