GPU Fleet Sizing Calculator

Size a self-hosted GPU fleet. Input model size, quantization, and QPS targets to get GPU count and monthly cost.

Model Size (Billions of Parameters)

Quantization

Peak QPS

Redundancy Factor

GPU Type

Model Weights

3.5 GB

Memory / Instance

4.5 GB

Weights + KV-cache

GPUs / Instance

A10G (24 GB)

QPS / Instance

200

Instances Needed

Before redundancy

Total Instances

With 2x buffer

Total GPUs

A10G (24 GB)

Max Concurrent / Instance

requests

On-Demand

Hourly$10

Monthly$7.3K

Reserved (1-year)

Hourly$6

Monthly$4.4K

Savings$2.9K/mo

Estimates as of March 2026. QPS numbers are approximate and vary with request complexity, context length, and batching efficiency. KV-cache overhead estimated at 30% of model weights. Actual GPU utilization depends on continuous batching configuration (vLLM, TensorRT-LLM). GPU pricing reflects typical cloud on-demand and 1-year reserved rates.