LLM Prompt Engineering for Code

System Prompt

The hidden instructions sent to the LLM before any user content. For code assistants, this includes: 'Only use imports that exist in this project', 'Match the surrounding code style', 'Preserve all function signatures during refactors'. The system prompt shapes every response without the developer seeing it. Getting it right is the difference between helpful and dangerous suggestions.

FIM Prompt Format

Fill-in-the-Middle. The prompt is split into three special sections: prefix (code before the cursor), suffix (code after the cursor), and the model generates the middle. Uses special tokens like <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|>. Without FIM, the model only sees what's above the cursor and might generate code that conflicts with what's already below.

Context Window Budget

The total context window (e.g., 4K tokens for fast completions) must be carefully divided. A typical budget: current file region (800 tokens), suffix after cursor (200), imports and types (400), open tabs and recent edits (300), RAG-retrieved snippets (300), git diff (150), LSP diagnostics (100). Every source competes for limited space, so a scoring system ranks them by relevance.

Scope-Aware Truncation

Instead of naively taking the first N tokens of a file, use the AST to identify structurally important parts: imports (lines 1-20), the enclosing class definition (line 150), and the current method (lines 280-310). Include those and skip the irrelevant lines in between. This gives the model the skeleton of the file plus the precise area being edited, using far fewer tokens.

Multi-Candidate Generation

Instead of generating one completion, generate 3-5 candidates at different temperature settings (0.2, 0.4, 0.8) to get diverse options. Score each candidate on syntax validity (does it parse?), import existence (do referenced packages exist?), style match (indentation, naming), model confidence (log probability), and deduplication. Show the highest-scoring candidate as ghost text.

Guardrail Injection

Rules injected into prompts to prevent common LLM mistakes. For completions: 'Only use imports that exist in this project.' For refactors: 'Preserve all function signatures.' For bug fixes: 'Minimal change, do not refactor unrelated code.' For test generation: 'Use the same test framework as existing tests.' Without guardrails, the model halluccinates packages and rewrites entire files.

Relevance Scoring Formula

Each context source gets a score to determine whether it's included in the prompt. A typical formula: score = (1 / distance_from_cursor) x recency_weight x import_depth_bonus x edit_frequency_bonus. Sources are sorted by score and included until the token budget is full. Recently edited files that are imported by the current file rank highest.

Prompt Templates by Task

Different tasks need different prompt structures. Inline completion uses FIM format. Refactoring uses instruction + before/after code with 'preserve signatures' guardrail. Test generation uses the function under test + 'use same test framework' guardrail. Bug fixes use error message + code + 'minimal change' guardrail. Each template is tuned for its specific use case.

Instruction Hierarchy

When file contents become part of the prompt (which happens every time the agent reads a file), malicious content in those files could try to hijack the LLM. Defense: the system prompt takes permanent precedence over any user-provided or file-provided content. This is called instruction hierarchy. Combined with content safety classifiers, it prevents prompt injection attacks.

Streaming and Word-Boundary Buffering

The model generates sub-word tokens. Token 'proc' followed by 'essPayment' should appear as 'processPayment', not flash 'proc' then replace. The system buffers the first 3-5 tokens before flushing to the UI. After the first flush, each token is sent immediately. This prevents visual jitter while keeping the perceived speed fast.

Context Poisoning Prevention

RAG can retrieve malicious code from the codebase (e.g., a test fixture with obfuscated exploit code). If this code enters the prompt, the model might reproduce the malicious pattern. Fix: run a content safety classifier on all retrieved chunks before injecting them into the prompt. Chunks flagged as potentially harmful are excluded from the context.

Persistence Rate Over Acceptance Rate

Acceptance rate (did the developer press Tab?) is a misleading metric. Developers sometimes Tab-accept a suggestion and immediately delete it (Ctrl+Z). Persistence rate measures what they actually keep after 30 seconds. A suggestion that's accepted then deleted is a failure, not a success. Optimize for persistence rate, not acceptance rate.