vLLM
The go-to LLM inference engine, built around PagedAttention for squeezing real throughput out of your GPUs
Use Cases
Architecture
Why It Exists
LLM inference is memory-bound. That is the core problem. During autoregressive generation, every token forces the GPU to read the entire model's weights from memory. Meanwhile, the KV cache (which stores attention key/value pairs for all previous tokens) grows linearly with sequence length. The vanilla HuggingFace Transformers approach pre-allocates a contiguous memory block sized for the maximum possible sequence length. In practice, this wastes 60-80% of GPU memory on empty space. Fewer concurrent requests fit in memory, and the GPU just sits there waiting.
vLLM came out of UC Berkeley in 2023 and attacked this problem head-on with PagedAttention. The idea is simple: manage KV cache memory in non-contiguous pages, the same way an operating system handles virtual memory. No more fragmentation. No more waste. The result is 2-24x more concurrent requests on identical hardware. It became the default serving engine for open-weight LLMs in production pretty quickly, and for good reason.
How It Works
PagedAttention: Standard attention implementations allocate one contiguous buffer per request, sized to the model's max sequence length. A model with a 32K context window allocates 32K x hidden_dim x num_layers x 2 bytes per request, even if the actual sequence is 500 tokens long. That is absurd. PagedAttention splits the KV cache into fixed-size blocks (16 tokens per block by default) and allocates them on demand as the sequence grows. Blocks do not need to sit next to each other in memory. A block table maps logical positions to physical locations, exactly like a page table in an OS kernel.
Continuous Batching: Static batching groups N requests into a batch and waits for the slowest one to finish before accepting new work. Continuous batching (sometimes called iteration-level scheduling) is smarter. It injects new requests into the running batch at every generation step. The moment a request finishes (hits EOS or max tokens), its slot gets filled immediately. The GPU stays busy, and throughput goes up dramatically.
Model Execution: vLLM loads model weights into GPU memory and runs inference through optimized CUDA kernels. When a model is too large for one GPU, tensor parallelism splits each layer's weight matrices across multiple GPUs, with all-reduce operations syncing intermediate results. The API server speaks the OpenAI chat completions format, so if an app already uses the OpenAI client library, pointing it at vLLM just works.
Architecture Deep Dive
Scheduler: The scheduler handles request queuing, priority, and preemption. When GPU memory gets tight, lower-priority requests get preempted. Their KV cache blocks are swapped to CPU memory or flagged for recomputation later. It balances fairness and throughput using strategies like shortest-remaining-first or round-robin. In practice, the preemption logic is what keeps vLLM stable under bursty traffic rather than just falling over.
Quantization: vLLM supports several quantization formats to shrink the memory footprint. AWQ (Activation-aware Weight Quantization) and GPTQ compress weights to 4-bit integers with minimal accuracy loss. FP8 quantization (on H100/H200 GPUs) halves memory compared to FP16 while staying close to lossless. Concrete numbers: Llama 3.1 70B needs 140GB in FP16, which means two A100 80GB GPUs. In AWQ 4-bit, it drops to 35GB and fits on a single card. That difference changes the infrastructure cost calculation completely.
Speculative Decoding: When latency matters more than raw throughput, speculative decoding pairs a small "draft" model (say, Llama 3.1 8B) with the main model (Llama 3.1 70B). The draft model proposes K candidate tokens cheaply. The main model then verifies all K tokens in one forward pass, which costs roughly the same as generating a single token. Tokens that pass verification are kept. Rejected ones get resampled. The result is 2-3x better time-to-first-token, and the output distribution is mathematically identical. No quality tradeoff.
Fireworks AI runs a customized vLLM fork and hits sub-200ms time-to-first-token for 70B parameter models. The project has over 40,000 GitHub stars and 400+ contributors, which says a lot about where the community has placed its bets.
Deployment Best Practices
Match the model to the hardware first. On a single A100 80GB, it is possible to run Llama 3.1 8B in FP16, Llama 3.1 70B in AWQ 4-bit, or Mixtral 8x7B in FP16. On 4x A100 80GB with tensor parallelism, Llama 3.1 70B runs in FP16 or Llama 3.1 405B in AWQ 4-bit. Set --max-model-len to the actual max context the application needs, not whatever the model card says. If the longest request is 8K tokens, do not set it to 128K.
For monitoring, watch three metrics: gpu_cache_usage_perc (aim for 80-95% under load), num_requests_running, and avg_generation_throughput_toks_per_s. Export them to Prometheus. If cache usage stays low, the batch size or concurrency is too low. If it is pegged at 100%, request preemption is imminent and latency will spike. Finding the sweet spot for a given workload takes some iteration, but those three numbers tell the story.
Pros
- • 2-24x higher throughput than naive HuggingFace inference, thanks to PagedAttention
- • OpenAI-compatible API, so you can swap it in without changing your client code
- • Supports 50+ model architectures: Llama, Mistral, Qwen, Gemma, and more
- • Continuous batching keeps GPU utilization high across concurrent requests
- • Large, active open-source community shipping features fast
Cons
- • NVIDIA GPUs (CUDA) are basically required. AMD and Intel support is still rough
- • Large models (70B+) need careful memory tuning or you will hit OOM errors
- • You will spend time tuning config flags to match your specific hardware
- • Inference only. No built-in fine-tuning support
- • Multi-node tensor parallelism adds real networking headaches
When to use
- • You are self-hosting open-weight LLMs and need production-grade serving
- • The model's native serving framework cannot keep up with your throughput needs
- • You want an OpenAI-compatible endpoint for internal services
- • Your request volume is high enough that owning GPUs beats paying per-token API costs
When NOT to use
- • Low request volume where API providers are cheaper than renting or owning GPUs
- • You need proprietary models like GPT-4 or Claude (those are not self-hostable)
- • You do not have GPU infrastructure and are not ready to set it up
- • You are rapidly experimenting across many models and setup time matters more than throughput
Key Points
- •PagedAttention treats KV cache memory the way an OS treats virtual memory. Attention keys and values get stored in non-contiguous blocks, which eliminates the 60-80% memory waste seen when naive implementations pre-allocate one big contiguous buffer per request
- •Continuous batching inserts new requests into a running batch at every generation step instead of waiting for the current batch to finish. This pushes GPU utilization from the 30-50% range (static batching) up to 85-95%
- •Tensor parallelism splits model layers across GPUs on the same node. Pipeline parallelism splits consecutive layers across nodes. For a 70B model, the standard deployment is 4-way tensor parallelism on 4x A100 80GB GPUs
- •Speculative decoding runs a small draft model to propose candidate tokens, then the main model verifies them in a single forward pass. The result is 2-3x latency improvement with zero quality loss
- •Prefix caching reuses KV cache blocks for shared prompt prefixes (system prompts, few-shot examples) across requests. This is a big deal for RAG applications where every query shares the same system prompt
Common Mistakes
- ✗Leaving --max-model-len at the default, which is the model's full context length (128K for Llama 3.1, for example). If requests only use 4K tokens, that wastes 97% of VRAM on empty KV cache
- ✗Setting --gpu-memory-utilization to 0.99 and then hitting OOM during peak traffic. Keep it at 0.85-0.90 so there is headroom for KV cache growth when request bursts hit
- ✗Skipping quantization when it would help. AWQ or GPTQ 4-bit cuts memory by 75% with barely any quality drop. A 70B model fits on one A100 80GB instead of four
- ✗Forgetting --enable-prefix-caching for RAG workloads. Without it, identical system prompts re-compute KV cache on every single request, burning 30-50% of compute for no reason
- ✗Sticking to one GPU when tensor parallelism would help. Even if a model fits on a single GPU, 2-way TP can cut per-token latency in half by parallelizing matrix multiplications