GPU & Inference Hardware
NVIDIA A100-80GB
A high-end data center GPU with 80 GB of high-bandwidth memory, designed for running large AI models. Think of it as the heavy-duty truck of the GPU world. Serves roughly 50 QPS for 34B models, 15 QPS for 70B. Around $2/hr on-demand. The workhorse of LLM serving as of 2026.
NVIDIA A10G (24 GB)
A smaller, more affordable GPU with 24 GB of memory. Good enough for running compact 7B models. Like a delivery van instead of a truck: less capacity but cheaper and efficient for lighter loads. Fits a 7B INT4 model (~12 GB total). Serves roughly 200 QPS for inline completions. Around $1/hr.
NVIDIA H100 (80 GB)
The next generation after A100, roughly 2-3x faster for LLM inference. Think of it as a faster truck that carries the same load but gets there sooner. Higher cost per hour but better throughput per dollar at scale. Preferred for new deployments when the budget allows.
Model Weights
The billions of numbers that a model learned during training, stored in GPU memory so the model can use them to generate responses. Like a brain's neural connections, but stored as numbers on a chip. A 7B model in INT4 needs roughly 4 GB for weights. A 70B in FP16 needs roughly 140 GB. The remaining GPU memory goes to KV-cache and batch overhead.
GPU Memory Math
How to figure out if a model fits on a GPU. Total GPU memory needed = model weights + KV-cache per concurrent request + batch overhead. If a 70B FP16 model needs 172 GB total but each A100 has 80 GB, the model needs to be split across 2 cards (called tensor parallelism). This is the most common sizing calculation in AI infrastructure.
Tensor Parallelism
Splitting a single model across multiple GPUs so each GPU holds part of every layer. Like splitting a book's pages across multiple readers, where each reader handles their portion of every chapter simultaneously. Required when a model's weights don't fit on one GPU. A 70B FP16 model (140 GB) needs at least 2x A100-80GB with tensor parallelism.
Pipeline Parallelism
Splitting a model by layers, not within layers. GPU 1 handles layers 1-40, GPU 2 handles layers 41-80. Like an assembly line where each station does a different step. Simpler than tensor parallelism but adds latency because each GPU waits for the previous one. Often combined with tensor parallelism in very large deployments.
QPS per GPU
How many requests per second a single GPU can handle. Depends on model size: smaller models are faster. 7B INT4: roughly 200 QPS on an A10G. 34B INT8: roughly 50 QPS on an A100. 70B FP16: roughly 15 QPS on a 2xA100 pair. These numbers assume continuous batching with vLLM.
Fleet Sizing Formula
How to calculate the total number of GPUs needed for a service. GPUs needed = peak_QPS / QPS_per_GPU x 2 (failover buffer). The 2x multiplier ensures the service stays up during rolling deploys and GPU failures. Example: 3,000 QPS on 7B = 3,000 / 200 x 2 = 30 A10G GPUs.
API vs Self-Hosted
Two ways to access LLM inference. API means paying a provider (Anthropic, OpenAI) per token, no infrastructure to manage, fast to start. Self-hosted means running models on owned or rented GPUs. API is simpler but 8-12x more expensive per token at scale. Breakeven is typically 5-50M requests/month. Most teams start with API and switch to self-hosted as volume grows.
Spot vs Reserved Instances
Two pricing models for cloud GPUs. On-demand (spot) pricing means pay as needed, flexible but expensive (~$2/hr for A100). Reserved instances mean committing for 1-3 years at 40-60% discount (~$1.20/hr). For production inference workloads that run 24/7, reserved instances are almost always worth it. The savings compound quickly across a fleet of 30-60 GPUs.
vLLM
The most popular open-source framework for serving LLMs on GPUs. It handles the complex parts: PagedAttention for efficient KV-cache memory management, continuous batching to maximize GPU utilization, and speculative decoding for speed. Think of it as nginx for LLMs. The standard choice for self-hosted serving.
On-Device Inference
Running a small model (7B INT4) directly on the developer's laptop instead of calling a cloud API. Tools like Ollama and llama.cpp make this possible. Zero network latency, works offline, and code never leaves the machine (good for privacy). Used as a fallback when cloud providers are slow or down.