The Memory and Computation Cost of GPT

Yan Wang

Jun 8, 2023

Memory

Activation values: 50 times the parameter memory
Optimizer state: 2 times the parameter memory
Gradients: 1 times the parameter memory
KV Cache: 0.5 times the memory

Number of Parameters

Number of parameters: approximately (number)：$12h^2l$

Memory Usage

Independent of input data

Memory usage during training: Excluding activation values
- Parameters: float16 + float32：$\Phi$
  - Gradients: float16 + float32：$\Phi$
- Parameters for AdamW: float32：$2\Phi$
- Total：$20\Phi Bytes$
Memory usage during inference: Excluding activation values
- Parameters: float16：$\Phi$
- Total: $2\Phi Bytes$

Activation Values

Approximately: per unit (Bytes)：$(34bsh + 5bs^2a)l$

When b=1, approximately 0.8 times the parameters

KV Cache

Approximately: per unit (number)：$2b(s + n)l$

Approximately 0.5 times the parameter memory

Computation

Computational Complexity

Approximately: per unit (count)：$24bsh^2l$

Relationship between Computational Complexity and Number of Parameters

Approximately: 2 times the number of tokens * number of parameters

$$ \frac{24bsh^2l} {12h^2l * bs} = 2 $$

Computation Time

Backward propagation is approximately twice the computation of forward propagation.
Activation value recomputation is approximately the same as forward propagation.
Total is 8 times.

$$ \frac{8 * number of tokens * number of parameters} {number of GPUs * peak flops per GPU * GPU utilization} $$

GPU utilization is approximately between 0.3 and 0.55.