vLLM

Why It Exists

LLM inference is memory-bound. That is the core problem. During autoregressive generation, every token forces the GPU to read the entire model's weights from memory. Meanwhile, the KV cache (which stores attention key/value pairs for all previous tokens) grows linearly with sequence length. The vanilla HuggingFace Transformers approach pre-allocates a contiguous memory block sized for the maximum possible sequence length. In practice, this wastes 60-80% of GPU memory on empty space. Fewer concurrent requests fit in memory, and the GPU just sits there waiting.

vLLM came out of UC Berkeley in 2023 and attacked this problem head-on with PagedAttention. The idea is simple: manage KV cache memory in non-contiguous pages, the same way an operating system handles virtual memory. No more fragmentation. No more waste. The result is 2-24x more concurrent requests on identical hardware. It became the default serving engine for open-weight LLMs in production pretty quickly, and for good reason.

How It Works

PagedAttention: Standard attention implementations allocate one contiguous buffer per request, sized to the model's max sequence length. A model with a 32K context window allocates 32K x hidden_dim x num_layers x 2 bytes per request, even if the actual sequence is 500 tokens long. That is absurd. PagedAttention splits the KV cache into fixed-size blocks (16 tokens per block by default) and allocates them on demand as the sequence grows. Blocks do not need to sit next to each other in memory. A block table maps logical positions to physical locations, exactly like a page table in an OS kernel.

Continuous Batching: Static batching groups N requests into a batch and waits for the slowest one to finish before accepting new work. Continuous batching (sometimes called iteration-level scheduling) is smarter. It injects new requests into the running batch at every generation step. The moment a request finishes (hits EOS or max tokens), its slot gets filled immediately. The GPU stays busy, and throughput goes up dramatically.

Model Execution: vLLM loads model weights into GPU memory and runs inference through optimized CUDA kernels. When a model is too large for one GPU, tensor parallelism splits each layer's weight matrices across multiple GPUs, with all-reduce operations syncing intermediate results. The API server speaks the OpenAI chat completions format, so if an app already uses the OpenAI client library, pointing it at vLLM just works.

Architecture Deep Dive

Scheduler: The scheduler handles request queuing, priority, and preemption. When GPU memory gets tight, lower-priority requests get preempted. Their KV cache blocks are swapped to CPU memory or flagged for recomputation later. It balances fairness and throughput using strategies like shortest-remaining-first or round-robin. In practice, the preemption logic is what keeps vLLM stable under bursty traffic rather than just falling over.

Quantization: vLLM supports several quantization formats to shrink the memory footprint. AWQ (Activation-aware Weight Quantization) and GPTQ compress weights to 4-bit integers with minimal accuracy loss. FP8 quantization (on H100/H200 GPUs) halves memory compared to FP16 while staying close to lossless. Concrete numbers: Llama 3.1 70B needs 140GB in FP16, which means two A100 80GB GPUs. In AWQ 4-bit, it drops to 35GB and fits on a single card. That difference changes the infrastructure cost calculation completely.

Speculative Decoding: When latency matters more than raw throughput, speculative decoding pairs a small "draft" model (say, Llama 3.1 8B) with the main model (Llama 3.1 70B). The draft model proposes K candidate tokens cheaply. The main model then verifies all K tokens in one forward pass, which costs roughly the same as generating a single token. Tokens that pass verification are kept. Rejected ones get resampled. The result is 2-3x better time-to-first-token, and the output distribution is mathematically identical. No quality tradeoff.

Fireworks AI runs a customized vLLM fork and hits sub-200ms time-to-first-token for 70B parameter models. The project has over 40,000 GitHub stars and 400+ contributors, which says a lot about where the community has placed its bets.

Deployment Best Practices

Match the model to the hardware first. On a single A100 80GB, it is possible to run Llama 3.1 8B in FP16, Llama 3.1 70B in AWQ 4-bit, or Mixtral 8x7B in FP16. On 4x A100 80GB with tensor parallelism, Llama 3.1 70B runs in FP16 or Llama 3.1 405B in AWQ 4-bit. Set --max-model-len to the actual max context the application needs, not whatever the model card says. If the longest request is 8K tokens, do not set it to 128K.

For monitoring, watch three metrics: gpu_cache_usage_perc (aim for 80-95% under load), num_requests_running, and avg_generation_throughput_toks_per_s. Export them to Prometheus. If cache usage stays low, the batch size or concurrency is too low. If it is pegged at 100%, request preemption is imminent and latency will spike. Finding the sweet spot for a given workload takes some iteration, but those three numbers tell the story.

Why It Exists

How It Works

Architecture Deep Dive

Deployment Best Practices

Use Cases

Architecture

Why It Exists

How It Works

Architecture Deep Dive

Deployment Best Practices

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

vLLM

Use Cases

Architecture

Why It Exists

How It Works

Architecture Deep Dive

Deployment Best Practices

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies