GPU Estimate

Overview

Most modern AI models scale efficiently across multiple GPUs. Estimating GPU memory, also called VRAM, and compute time is critical for NAIRR proposals to avoid out of memory errors and unnecessary resource waste. This guide separates two distinct cases: inference, where you only run a trained model forward, and training or fine-tuning, where you also update model weights. The memory profiles for these two cases differ by an order of magnitude.

Estimating GPU Memory for Inference

In an autoregressive transformer, each new token attends to all previous tokens. To avoid recomputing attention for previous tokens at each step, the model stores the Keys and Values for every layer and attention head. This stored data is called the KV cache. Caching drastically reduces computational time from \(O(T^2)\) to \(O(T)\) at the cost of VRAM.

\[\textrm{KV-cache} = 2 \times L \times H \times d \times T \times b,\]

where the factor 2 comes from storing both keys and values, \(L\) is the number of transformer layers, \(H\) is the number of attention heads, \(d\) is the dimension of each head, \(T\) is the context length, and \(b\) is bytes per element with FP16 = 2.

With typical values of these parameters, \(L=32\), \(H=32\), \(d=128\), and \(b=2\), the VRAM from the KV cache is over 0.5 GB for a context length of one thousand tokens. For the purpose of resource estimation, we can treat 1 GB per thousand tokens, equivalently 0.001 GB per token, as a conservative upper bound.

The runtime overhead, comprising the CUDA context, cuBLAS and cuDNN workspaces, and kernel launch buffers, can vary from 300 MB to 1 GB per process.

A quick rule of thumb for estimating GPU VRAM in GB:

\[\begin{split}\begin{align} \mathrm{VRAM}_{\text{inference}} \;(\mathrm{GB}) \;\approx\; & \underbrace{2 \times \mathrm{params}_{(\mathrm{B})}}_{\text{weight VRAM}} \;+\; \underbrace{1 \times \mathrm{context}_{(\mathrm{k\ tokens})}}_{\text{KV-cache VRAM}} \\ &+\;\underbrace{0.15 \times (\text{weight VRAM} + \text{KV-cache VRAM}) + 1}_{\text{runtime overhead}} \end{align}\end{split}\]

where weight VRAM is the first term \(2 \times \mathrm{params}_\mathrm{B}\) and KV-cache VRAM is the second term \(1 \times \mathrm{context}_\mathrm{k}\). The overhead is modeled as 15 percent of the combined weight and cache footprint plus a 1 GB constant for the CUDA context and workspace allocations.

Example. For StableCode with 3B parameters and 16k context, VRAM is approximately 6 GB for weights plus 16 GB for KV cache plus 4.3 GB of overhead, totaling about 26 GB. This fits on an A100, H100, or 32 GB V100 for inference.

Note

For inference, context length is often the major VRAM bottleneck. Consider dropping old tokens or using a sliding window.
Actual usage depends on the framework, runtime, and settings such as CUDA graphs, KV cache eviction policy, and preallocation.

Inference VRAM Requirement Estimator

Model size (billion parameters):

Context length (tokens):

Quantization:

Estimated VRAM

—

VRAM ≈ params(B) × bytes/param + KV-cache + overhead

A Small Code to Test Memory Consumption and Performance

Install the relevant packages such as PyTorch and Transformers, and run on a machine with GPU access. You can also watch GPU usage with watch -n 0.5 nvidia-smi.

import torch, time
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL = "gpt2-large"
DTYPE = torch.float16
DEVICE = "cuda"
max_new_tokens = 256

prompt = "Explain KV cache in one paragraph."
# download or load the correct tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
# load model weights and move to GPU
model = AutoModelForCausalLM.from_pretrained(MODEL, dtype=DTYPE).to(DEVICE)

# tokenize input as tensors and move to GPU
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
# warm-up run to initialize CUDA kernels and allocate buffers
with torch.inference_mode():
    _ = model.generate(**inputs, max_new_tokens=32, do_sample=False)

# wait for GPU to finish the process
torch.cuda.synchronize()
# reset previous peak memory stats
torch.cuda.reset_peak_memory_stats()

# start timing
t0 = time.perf_counter()
with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=max_new_tokens,
                        do_sample=False)

torch.cuda.synchronize()
t1 = time.perf_counter()

new_tokens = out.shape[-1] - inputs["input_ids"].shape[-1]

tok_per_s = new_tokens / (t1 - t0)
peak_alloc = torch.cuda.max_memory_allocated() / 1024**3
peak_reserved = torch.cuda.max_memory_reserved() / 1024**3

print(f"Generated tokens: {new_tokens}")
print(f"Time: {t1 - t0:.3f} s")
print(f"Throughput: {tok_per_s:.2f} tokens/s")
print(f"Peak allocated: {peak_alloc:.2f} GB")
print(f"Peak reserved:  {peak_reserved:.2f} GB")

Here, Peak allocated is the actual memory used by tensors, which is a good approximation of real usage. Peak reserved is the memory reserved by the PyTorch caching allocator, which may be higher because the allocator keeps freed blocks for reuse. For small models, GPU memory is often dominated by fixed overhead such as CUDA context and libraries rather than the model itself, which is why nvidia-smi may report higher VRAM usage than peak_allocated alone.

On a machine with an NVIDIA GeForce 1050 Ti with Max-Q Design, the code produced:

Generated tokens: 256
Time: 17.774 s
Throughput: 14.40 tokens/s
Peak allocated: 1.53 GB
Peak reserved:  1.67 GB

while nvidia-smi showed a maximum memory usage of 1.8 GB.

Estimating GPU Memory for Training

A practical planning heuristic for transformer models trained with Adam and mixed precision:

\[\mathrm{VRAM}_{\text{training}} \;(\mathrm{GB}) \;\approx\; 40 \times \mathrm{params}_{(\mathrm{billions})}\]

Breakdown of the 40 times factor. For a model with P billion parameters trained in mixed precision with the Adam optimizer, the major VRAM consumers are:

FP16 weights plus FP32 master copy: roughly 6 GB per billion parameters, accounting for 2 plus 4 bytes per parameter
Gradients in FP16: roughly 2 GB per billion parameters
Adam optimizer states with FP32 momentum and variance: roughly 8 GB per billion parameters
Activations with typical gradient checkpointing: roughly 20 to 24 GB per billion parameters, depending on batch size and sequence length

Together these total roughly 36 to 40 GB per billion parameters. The 40 times figure is a conservative upper bound that absorbs minor sources like temporary buffers and fragmentation.

Example. A 7B parameter model requires approximately 7 × 40 = 280 GB of VRAM for training, which would need at least four A100 80 GB GPUs with model parallelism, or eight A100 40 GB GPUs.

Note

Activation checkpointing trades compute for memory and can reduce the activations component significantly at the cost of about 30 percent more compute time.
Techniques like ZeRO in DeepSpeed, FSDP in PyTorch, or tensor parallelism can distribute optimizer states and gradients across GPUs, reducing per GPU VRAM.
The 40 times heuristic assumes a reasonable batch size. Very large batch sizes will increase activation memory beyond this estimate.

Minimal Monitoring

Peak VRAM from the shell:

nvidia-smi --query-gpu=memory.total,memory.used,gpu_name --format=csv -l 2

PyTorch in code snapshot:

import torch
# ... after warmup or inside training loop
torch.cuda.reset_peak_memory_stats()
# run a representative step or small loop...
peak = torch.cuda.max_memory_allocated() / (1024**3)
print(f"Peak allocated VRAM: {peak:.2f} GB")

Profiling Tools

NVIDIA-SMI Usage

nvidia-smi is available on GPU enabled nodes and reports per GPU and per process memory and utilization. It is the fastest way to sanity check VRAM usage and GPU load.

Basic usage

nvidia-smi

Typical output

Wed Oct 15 20:58:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------|
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-16GB           On  |   00000000:3B:00.0 Off |                  Off |
| N/A   27C    P0             37W /  250W |   13830MiB /  16384MiB |      76%     Default |
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|=========================================================================================|
|    0   N/A  N/A     27515      C   .../envs/vllm/bin/python                    13818MiB |
+-----------------------------------------------------------------------------------------+

What to watch:

Memory-Usage shows used versus total VRAM; there is OOM risk if approaching 100 percent.
GPU-Util shows the percent of time kernels keep the GPU busy.
Processes table shows which PID and program is consuming VRAM.

Watch continuously, refreshing every 0.5 seconds:

watch -n 0.5 nvidia-smi

Log to CSV over time for later plotting:

timeout 60s nvidia-smi --query-gpu=timestamp,power.draw,memory.used,temperature.gpu \
 --format=csv,nounits -l 1 > gpu_usage.csv

Per process view, showing memory by PID:

timeout 60s nvidia-smi --query-compute-apps=pid,process_name,used_memory \
--format=csv,nounits -l 1 > gpu_usage.csv

Find the right node on SLURM clusters:

# Which node is my job on?
squeue -u $USER
# SSH to that node to run nvidia-smi there if your site allows:
ssh <node-name>

Tips

Sample after warm up to capture steady state VRAM. JIT compilation and CUDA graphs can cause initial spikes.
Combine with /usr/bin/time -v to capture CPU and RAM alongside GPU stats.
If VRAM is near capacity, try a smaller batch or sequence length, activation checkpointing, or quantization.

References

[OSC-GPU]

Ohio Supercomputer Center, HOWTO: Estimating and Profiling GPU Memory Usage for Generative AI. Available at: https://www.osc.edu/resources/getting_started/howto/howto_estimating_and_profiling_gpu_memory_usage_for_generative_ai (accessed October 20, 2025).