LLM Model Tiers & Quantization

Tokens

The basic unit LLMs work with. Text is split into tokens before the model processes it. A token is roughly 4 characters of English or 0.75 words. Code uses slightly more tokens than prose because of syntax characters like brackets and semicolons. A 100-line function is roughly 500-800 tokens. Pricing, context windows, and speed are all measured in tokens.

Parameters (7B, 34B, 70B)

A model's size is measured by how many numbers (parameters) it learned during training. 7B means 7 billion parameters. More parameters generally means better reasoning but more GPU memory and slower inference. Think of it like engine size in a car: a bigger engine is more powerful but burns more fuel.

FP16 (16-bit float)

The default way to store a model's parameters, using 16 bits per number. Full precision, no quality loss. A 70B model needs ~140 GB GPU memory for weights alone. Used for agent tasks and complex reasoning where quality matters most. This is the baseline everything else is compared against.

INT8 (8-bit integer)

A compression technique that rounds each parameter from 16 bits down to 8 bits. Like converting a high-res photo to medium-res: smaller file, barely noticeable difference. Halves memory vs FP16 (~70 GB for 70B). Roughly 1% quality drop. Good balance for multi-line code completions.

INT4 (GPTQ / AWQ)

Aggressive compression down to 4 bits per parameter, using algorithms like GPTQ or AWQ to minimize quality loss. Quarters memory vs FP16 (~35 GB for 70B). About 3% drop for completions, 8% for reasoning. Used for fast inline autocomplete where speed matters more than depth.

TTFT (Time to First Token)

How quickly the model starts responding, measured from when the request is sent to when the first word appears. This is what developers actually feel as speed. Target: under 200ms for autocomplete, under 3s for agent tasks. A 200ms TTFT feels instant even if the full response takes 500ms.

KV-Cache

Think of it as the model's short-term memory for the current conversation. It stores the attention state (key-value matrices) computed while processing the prompt. When a developer types one more character, most of the prompt is unchanged, so the system reuses the cached state instead of recomputing from scratch. This can turn a 100ms computation into 10ms.

Context Window

The maximum amount of text a model can read and respond to in a single call, measured in tokens. Like a desk: bigger desk fits more documents, but papers in the middle tend to get lost. 4K-8K tokens is standard, 128K-200K for frontier models. Quality often degrades in the middle of very long contexts (lost-in-the-middle effect).

Temperature and Top-p

Controls for how 'creative' or 'random' the model's output is. Temperature 0.0 means always pick the most likely next token (deterministic, good for code). Temperature 1.0 means more variety (good for brainstorming). Top-p (nucleus sampling) limits choices to the smallest set of tokens whose probabilities sum to p. For code completions, low temperature (0.2-0.4) with top-p 0.95 is typical.

Speculative Decoding

A speed trick that uses teamwork between two models. A small fast model (7B) guesses the next several tokens. The large accurate model (70B) checks all those guesses in one pass (because verification is cheaper than generation). Correct guesses are kept, wrong ones are regenerated. Provides 1.5-2x speedup in practice.

Continuous Batching

A smarter way to share GPU resources across multiple requests. Instead of waiting for all requests in a batch to finish before serving any of them, GPU slots are freed as individual requests complete and immediately given to waiting requests. Like a restaurant that seats new diners as soon as a table opens, not waiting until every table is free. GPU utilization jumps from roughly 40% to 80-90%.

Fill-in-the-Middle (FIM)

A special prompt format designed for editing existing code, not just writing new code at the end. The prompt includes code BEFORE the cursor and code AFTER the cursor. The model generates what goes in the middle. This way the model knows not to duplicate the code that already exists below. Critical for real-world editing where developers work inside functions, not just at the end of a file.

MoE (Mixture of Experts)

A model architecture where only a fraction of the parameters activate for each token. A 600B-parameter MoE model might only use 37B parameters per token, making it nearly as fast as a 34B dense model but with the knowledge capacity of a much larger one. DeepSeek V3.2 and Mistral Large 3 use this approach. The trade-off: more total memory needed to hold all experts, even though only some fire per request.

Fine-Tuning vs RAG

Two ways to give an LLM knowledge it doesn't have from training. Fine-tuning permanently changes the model weights by training on new data (expensive, slow, bakes knowledge in). RAG retrieves relevant information at query time and injects it into the prompt (cheaper, keeps data fresh, no model changes). For code assistants, RAG is preferred because codebases change constantly and fine-tuning can't keep up.