How do you answer LLM system design interview questions step by step?

Updated June 9, 2026 · 8 min read · Crack ML Interview

TL;DR

Answering an LLM system design question well requires a consistent five-step structure: clarify scope and SLAs, design the inference serving stack, layer in retrieval if required, define observability, and address cost and scale. High-scoring candidates proactively name specific components like vLLM for serving, Pinecone or pgvector for retrieval, and p50/p99 latency targets rather than staying generic. Skipping any step, especially observability and cost, is one of the most common sources of lost points at companies like OpenAI, Anthropic, and Databricks.

The Five-Step Framework for LLM System Design Questions

Step 1: Clarify scope before drawing anything

Spend the first three to five minutes asking clarifying questions that shape every downstream decision. Is this an online serving system or a batch pipeline? What is the latency SLA: under 200ms time-to-first-token, or is 2 seconds acceptable? What is the traffic scale: requests per second or daily active users? What model size and provider constraints exist? Is this a retrieval-augmented system or direct generation? Writing the answers on the whiteboard before designing ensures you solve the right problem and demonstrates structured thinking.

Step 2: Design the inference serving stack

Describe the model serving layer with specific technology choices. Name the model server, vLLM and TGI being the standard open-source options, and explain why: continuous batching allows the server to process new requests without waiting for the full batch to complete, and PagedAttention manages KV cache memory without waste. Add a load balancer distributing requests across replicas. Explain how the KV cache grows with sequence length and why it constrains throughput. Interviewers at OpenAI and Anthropic specifically probe whether you understand KV cache memory arithmetic.

Steps 3 through 5: retrieval, observability, and cost

If the question involves RAG, design the retrieval layer: chunking strategy and chunk size tradeoffs, an embedding model, a vector database such as Pinecone for managed scale or pgvector for PostgreSQL integration, and a reranker to improve top-k precision before passing context to the LLM. Then address observability: define the metrics you would monitor, specifically latency at p50 and p99, token throughput in tokens per second, and hallucination rate using a judge model or citation grounding. Finally address cost and scale: GPU fleet sizing, spot instance strategy, and request queuing to absorb traffic spikes without overprovisioning.

Applying the Framework to Common Interview Questions

Design a RAG chatbot for an enterprise knowledge base

Clarify: expected document volume, query latency target, and whether answers need citations. Inference stack: vLLM serving a 7B or 70B model with continuous batching. Retrieval layer: ingest documents with 512-token overlapping chunks, embed with a sentence transformer, store in pgvector, retrieve top-20 by cosine similarity, rerank with a cross-encoder to top-5. Observability: faithfulness score comparing generated answer against retrieved context, latency per query, retrieval miss rate. Cost: cache frequent query embeddings, use smaller model for simple queries, escalate to larger model for complex ones.

Design an inference API serving 100,000 RPS

Clarify: token length distribution, acceptable p99 latency, and whether requests are streaming or batch. At 100K RPS the bottleneck moves from single-instance memory to fleet coordination. Design a multi-region deployment with a global load balancer, autoscaling GPU clusters per region, and a request queue using Kafka or SQS to absorb bursts. Use semantic caching at the API gateway layer to serve repeated or near-duplicate queries without hitting the model. Discuss spot instance strategies and minimum warm capacity to avoid cold-start latency spikes.

Common Mistakes That Cost Points in LLM System Design Rounds

Staying generic and never naming real components

Answers that describe a cache layer, a model server, and a database without naming specific technologies score significantly lower than answers that say vLLM, PagedAttention, Pinecone, and pgvector. Interviewers use technology choices as a proxy for real production experience. You do not need to know every configuration detail, but you must be able to justify why you chose one option over alternatives.

Skipping observability and cost as afterthoughts

Many candidates spend all their time on the happy-path architecture and mention monitoring in passing at the end. Interviewers at leading AI companies consistently deduct points for this. Treat observability and cost as first-class design decisions: define your SLIs and SLOs early, name the specific metrics you would alert on, and discuss GPU cost per million tokens as a design constraint that influences every other architectural choice you make.

LLM Serving System Components: Latency and Cost Tradeoffs

Component	Purpose	Key Technology Options	Latency Impact	Cost Driver	Common Failure Mode
Model server	Run inference	vLLM, TGI, TensorRT-LLM	Highest	GPU compute	KV cache OOM, underutilized batching
Vector database	Store and retrieve embeddings	Pinecone, pgvector, Weaviate, Qdrant	Low (10–50ms)	Storage and query cost	Stale embeddings, poor chunking
Reranker	Improve retrieval precision	Cross-encoder, Cohere Rerank	Medium (50–200ms)	CPU/small GPU compute	Added latency exceeds SLA
Semantic cache	Avoid redundant inference	Redis with vector similarity	Near-zero on hit	Memory	Low hit rate if queries are diverse
Request queue	Absorb traffic bursts	Kafka, SQS, in-process queue	Adds queuing delay	Messaging cost	Queue depth grows unbounded under sustained load
Hallucination monitor	Detect grounding failures	LLM judge, citation checker	Async, no serving impact	Additional inference cost	Judge model itself hallucinates

Who this is for

Backend engineer interviewing for an LLM platform role at a Series B AI startup

Profile: Strong distributed systems background, has built microservice APIs at scale, but has only used LLMs via API calls and has never designed an inference serving stack.

Pain points: Defaults to describing the architecture as a generic microservice with a model API call, misses KV cache mechanics, and does not mention continuous batching or retrieval reranking.

Strategy: Practice the five-step framework on three to four canned questions before the interview. Focus specifically on memorizing the KV cache memory formula and the continuous batching explanation, since these are the two most common probe points. Use Crack ML Interview's LLM system design question bank to get exposure to the vocabulary and component naming that scores points.

ML researcher moving into an AI infrastructure engineer role at a large tech company

Profile: Deep familiarity with transformer internals and training dynamics, but limited exposure to serving infrastructure, load balancing, or distributed systems patterns.

Pain points: Can explain attention and KV cache mathematically but frames the entire design around a single model instance, missing fleet coordination, autoscaling, queuing, and cost optimization.

Strategy: Study serving infrastructure patterns explicitly: load balancing strategies, autoscaling policies, request queuing, and semantic caching. Use the researcher background as an advantage by leading with KV cache and batching mechanics, then layer in the infrastructure decisions around them. This combination of ML depth plus engineering breadth is the highest-scoring profile.

FAQ

Q: How long should each part of the framework take in a 45-minute system design round?

A: A reasonable split is five minutes for scope clarification, fifteen minutes for the inference stack, ten minutes for retrieval if required, eight minutes for observability and cost, and seven minutes for follow-up questions and tradeoff discussion. Adjust based on what the interviewer probes most.

Q: Do I need to cover all five steps even if the question does not mention retrieval?

A: Clarification, inference stack, observability, and cost are always relevant. Skip the retrieval step if the question is pure generation without knowledge grounding. If you are unsure, ask the interviewer during the clarification phase whether retrieved context is part of the scope.

Q: How specific do I need to be about technology choices like vLLM versus TGI?

A: Name at least one specific option per component and be able to state one concrete reason for choosing it over alternatives. You do not need to know configuration internals. Saying vLLM because it supports continuous batching and PagedAttention, which improves memory utilization under variable-length requests, is sufficient depth for most interview contexts.

Want to practice with real, verified ML interview questions from top companies?

Browse the question bank