TRACE · Part 4 of 4 ← Previous

Why Does an LLM Answer One Piece at a Time?

2026-05-28 53 min read

trace systems llms transformers gpu inference distributed-systems ai-infra

Note: this is an AI-assisted exploration written for my own understanding, not a description of any one company's proprietary serving stack. It explains the common mechanisms used by large-scale LLM systems and cites public sources where possible.

A child points at a rainbow through the window and asks why the colors are there.

You open chat and type:

Explain rainbows simply.

The cursor blinks.

Then the first word appears.

Not the whole answer. Not a finished paragraph. Just the first little piece.

Then another piece. Then another. The sentence grows in front of you, as if the machine is thinking and typing at the same time.

That should feel strange.

If a normal web page answers you, it usually sends a finished thing: a search result page, a JSON response, a file. But an LLM chat answer arrives unfinished. The service is not hiding a completed essay and slowly revealing it for drama. It is choosing one small piece of text, sending it, feeding that piece back into its own input, and choosing again.

So the real question is not just “how do LLMs work?”

The sharper question is:

What has to stay true, at every layer, for the next little piece to appear?

That first visible fragment has a path. It starts as your rainbow question, becomes bytes on a wire, gets admitted as future work, turns into numbered text fragments, passes through learned tables on GPU chips, wins a little probability contest, and returns as pixels on your screen.

Let’s follow it.

The Journey

Here is the whole loop at the highest level:

Static fallback: a chat request moves from message assembly to nearby networking, front-door checks, text splitting, shared GPU work, repeated piece generation, streaming, and finally browser painting.

Keep three objects in your hand:

Object	The plain question
Your message	what exactly did the browser send?
The running copy	which loaded model is doing the work?
The next piece	what is being chosen and streamed right now?

Most confusion comes from calling all of that “the model.” The answer on your screen is not one object. It is a loop that stayed coherent long enough for a tiny piece of text to reach your eyes.

Our running thread will stay small:

Moment	Toy object
What you typed	`Explain rainbows simply.`
First useful answer fragment	something like `Rainbows`
The thing we follow	one next piece leaving the loop

The Browser Does Not Send a Thought

You press Enter.

The browser does not send a thought. It sends a web request.

That request carries your visible text, but also the quiet labels around it: which conversation this belongs to, which account is asking, which model or product mode is requested, whether the answer should stream, what tools may be available, and how the server should cancel the work if you hit stop.

Before that request can move safely, the browser has to do ordinary internet work. It finds the server address. It opens a transport connection. It checks the server identity. It creates encryption keys. Only then do your chat bytes cross the network.

Streaming changes the shape. The browser is not asking for a closed package. It is asking the server to keep a response open so answer fragments can keep arriving.

That open stream is the first clue. The request is not “please give me an answer.” It is closer to:

Start a loop, and send me each safe piece as soon as it exists.

Max output 1200 stream response allow tools attach cancel handle

request envelope

conversation_idconv_8f31

request_idreq_91ca

modelreasoning-large

streamtrue

max_output_tokens1200

toolssearch, calculator

cancel_handleattached

downstream consequence

gatewaycharge quota against reasoning-large

backend choiceneeds a model path that can use tools

work queuereserve output budget near 1200 tokens

streamerhold connection open and flush chunks

cleanupclient stop can free remembered work

Static fallback: an LLM request carries conversation id, request id, model alias, stream preference, tool permissions, output budget, and cancellation state. Those fields become accounting, backend-choice, work-queue, and cleanup constraints.

This is why a boring network problem can look like a model problem. DNS can fail. Login can expire. A proxy can time out. A mobile network can sleep. The model may be perfectly healthy while the stream dies before the next piece reaches you.

Measurement	What it means
New TCP plus TLS 1.3 setup	request data waits for network handshakes first
Reused HTTP/2 or HTTP/3 connection	a chat request can skip fresh setup
Streamed event	a product chunk of bytes, not necessarily one model token
Cancel signal	must travel backward so future GPU work can stop

The server now has an encrypted request.

But it still does not owe you an answer.

First it has to decide whether it can afford to promise one.

The Front Door Makes a Promise

The first serious question is not “what should the model say?”

It is: should this request enter at all?

An LLM service has a front door: an edge service, API gateway, router, or some mixture of those. It checks the ordinary things first. Is the session valid? Is the API key real? Is this account allowed to use this model? Has it crossed a quota?

Then it asks a stranger question:

How much future work is this one click asking us to reserve?

A short prompt asking for one sentence is small. A pasted legal contract asking for a long analysis is not. A reasoning model may need hidden work before the visible answer. A tool-enabled answer may need outside calls. A long stream may keep memory occupied for many seconds.

So the front door estimates before the model runs. It estimates input tokens, maximum output tokens, hidden reasoning budget if the product exposes one, tool permissions, region constraints, and whether any warm running copy can hold the conversation state.

Prompt tokens 8,000 Max output 1,200 Reasoning multiplier 2.0x Free memory blocks 64 region locked by policy

Request id req-1041p

Tier token budget 36,000

Estimated total tokens 10,400

Hidden reasoning estimate 1,200

Memory blocks needed 41/64

Latency objective interactive

identity session, org, model alias

quota tokens per minute, spend cap

policy region, abuse, safety precheck

capacity loaded model and memory blocks

route admit to an already-loaded model group

Static fallback: admission control estimates prompt tokens, output tokens, hidden reasoning work, account quota, policy constraints, and free conversation-memory blocks before sending a request to an already-loaded model group.

Think of a restaurant host with a kitchen behind them. Seating too many tables is not kindness. It creates a room full of people waiting for food that cannot arrive. For LLMs, the kitchen is GPU time, memory, queues, policy checks, and stream connections.

That is admission control: deciding whether accepting this request would make the system break its promises.

Prompt tokens 8,000 Max output 1,200 Reasoning budget 2x Queue age 420 ms Free memory budget 72%

Reserved tokens 12,080 2% of tier TPM

Hidden work estimate 2,880 not necessarily shown to user

Conversation memory 25% needed 72% free

Routing priority 2 used only after policy and residency checks

Teaching estimate: real admission systems use model-specific profilers, measured prompt tokenization, per-region capacity, policy state, and scheduler feedback. The point here is the reservation shape, not the exact formula.

The key word is reserve. Future tokens are not here yet, but the system must treat them as real. If it accepts every long answer optimistically, the fleet can run out of memory halfway through everybody’s streams. If it rejects too cautiously, expensive chips sit idle while users wait outside.

This is where “AI” first becomes a capacity-planning problem.

If this layer breaks, the error may look unfair: “too many requests,” “quota exceeded,” “try again later.” It may happen while some GPUs elsewhere are idle, because policy, region, model choice, or memory shape says your request cannot safely move there.

The front door lets you in.

But “in” is not a place.

It is a choice among many rooms.

The Hallway Behind the Sign

The model name you click is usually not one machine.

It is a sign on a hallway.

Behind the sign may be several rooms: one old snapshot, one new canary snapshot, one region closer to you, one region allowed by policy, one pool that supports tools, one pool that can handle images, one smaller fallback model, one larger model reserved for paid traffic.

Routing is the act of choosing the room.

The Production Stack Is Not Just "A Model"

Browser / app

send message, keep answer path open

visible wait

Nearby web service

private connection, request label

network wait

Front door

identity, usage, and safety checks

can enter?

Router

choose a warm machine group

right room

Model runner

prepare input, share turns, release pieces

work packing

Worker process

loaded model copy and saved past work

runner health

Accelerator machine

number-table math on special chips

chip time

Fleet

many machine groups, updates, failures

whole service

The router is not only looking for “a GPU.” It needs a warm replica with the right weights, the right tokenizer, the right context limit, the right tool support, the right safety envelope, enough memory, and a queue that will not make the first token arrive too late.

This is why a product can feel uneven. The model did not become moody. You may have crossed into a different region, a different snapshot, a colder replica, or a pool under different load.

The hallway also explains rollouts. A new snapshot can sit behind the same visible name while receiving only a thin slice of traffic. If it behaves well, the slice grows. If not, the router can send users back to the old room.

So before the model sees your prompt, the request has already become a routing problem:

Choice	Why it matters
Region	affects latency, policy, and data rules
Snapshot	affects behavior and quality
Replica group	affects queue and memory pressure
Feature pool	affects tools, vision, context length, and formats
Priority	affects who waits when demand spikes

The router finally picks a path.

Now the model gets your message?

Not yet.

Your Message Was Never Alone

You typed one sentence.

The model usually receives much more.

A chat product builds a prompt: the full input sequence the model can read on this turn. It may include system instructions, developer instructions, conversation history, tool descriptions, safety rules, retrieved documents, image-derived representations, and your newest message.

In everyday speech, “prompt” means what you typed. In the machine, prompt means the whole packed bundle.

That distinction matters because the model does not have a private notebook it can secretly consult during a plain serving call. If the model should use something, that thing must be placed where the model can read it, or retrieved by a tool and placed there later.

1 Read request visible text plus metadata

2 Fetch memory recent turns or summaries

3 Retrieve sparse, dense, or hybrid search

4 Rerank keep only context worth spending tokens on

5 Assemble ordered prompt the model will actually see

conversation memory RAG documents tool schemas image input

Retrieved chunks 12 Chunks after rerank 4

system policyrecent historyuser messageretrieved chunkstool schemas

Total prompt tokens 4,370

RAG chunks dropped 8 retrieved but not worth context budget

What the model can use weights + assembled prompt + executed tool results

Static fallback: a chat product builds the real prompt from policy, memory, retrieved chunks, tool schemas, images converted to vectors or tokens, and the user's visible message. Retrieval must still be reranked and fit into the context window.

Now add the familiar modern trick: retrieval.

Suppose you ask about a private policy document. The model weights are not supposed to know that document. The product has to find the relevant passage and put it into the prompt. That is the core idea behind retrieval-augmented generation, or RAG: retrieve evidence first, then generate with that evidence visible.

But evidence has its own journey. A document is cleaned, split into chunks, turned into searchable number handles called embeddings, placed in an index, searched at question time, reranked, trimmed to fit, and finally inserted into the prompt.

Trace step 3/5 Evidence token budget 3200

1 ingest policy.pdf is cleaned, chunked, embedded, and indexed before the chat

2 retrieve the new question searches sparse and dense indexes

3 rerank top chunks are reordered for relevance and source quality

4 pack the best chunk is trimmed into the prompt with citation metadata

5 generate the answer must use the chunk, not just sound plausible

Evidence tokens requested 780

Budget result evidence fits

Failure mode wrong chunk retrieved or evidence pushed out by token budget

Static fallback: RAG, tools, memory, and image inputs help only when their evidence is retrieved or executed, validated, packed into the prompt, and actually used by the final generation.

A citation at the end is only as strong as that path. Was the right document ingested? Was the right chunk found? Did the reranker keep it? Did the prompt include it? Did the final sentence actually stay inside what the evidence says?

Tools follow the same rule. The model can write a structured request such as “call search with these words” or “run this calculation.” Product code validates it, runs the outside system, and feeds the result back into the prompt. The tool is useful only if the result truly crosses back into the model’s readable context.

So here is the first-principles test for memory, RAG, agents, tools, and vision:

What evidence actually entered the context, and what forced the final answer to respect it?

If the answer is “nothing concrete,” the feature is only a suggestion to a text generator.

Thing	Why it matters
Hidden instructions	spend the same context budget as visible text
Retrieved chunks	must fit beside the conversation and system instructions
Tool result	has to be validated before it becomes context
Citation	proves little unless tied to evidence actually supplied

The Prompt Budget Is a Suitcase

The prompt can hold a lot.

It cannot hold everything.

That makes prompt assembly a packing problem. Recent conversation turns matter. System instructions matter. Retrieved evidence matters. Tool descriptions matter. Safety rules matter. Images may matter. User files may matter. But the model has a context limit, and every hidden instruction spends the same kind of room as your visible text.

So the chat app has to decide what travels.

Old messages may be dropped. They may be summarized. A summary may preserve the decision but lose the exact wording that made it important. Retrieval may bring back one paragraph and leave behind the table that explains it. A tool schema may be included because the model might need it, even if this turn never calls the tool.

This is why “the model forgot” can mean several different things.

Maybe the app never put the old fact into this turn. Maybe it summarized the fact badly. Maybe retrieval missed it. Maybe the model saw it and failed to use it. Maybe the suitcase was full of lower-value material that pushed the useful thing out.

The suitcase metaphor is useful because it forces the real question:

What got packed, what got left behind, and who made that choice?

For long chats, that question matters more than the word “memory.” A memory feature is only real if it changes what the model can read or what the product loop can retrieve and verify.

For our rainbow question, the suitcase is probably tiny. But the same packing rule handles a hundred-page policy, a tool list, a pasted screenshot, or a year-long conversation.

And that creates the next trap: even a small-looking suitcase can be expensive once the text is cut up.

Prompt item	Hidden cost
System instructions	tokens before your message begins
Tool schemas	tokens even if no tool is called
Conversation history	grows with every turn unless trimmed
Retrieved chunks	compete with history and instructions
Images	become model-readable representations that still consume budget

The Evidence Can Fall Out

Grounding sounds clean from far away.

Retrieve evidence. Put evidence in the prompt. Generate an answer.

But every verb in that sentence can fail.

The document may never have been ingested. It may have been split at a bad boundary, so the sentence with the answer lives in one chunk and the sentence that explains it lives in another. The search query may find a nearby topic instead of the right passage. The reranker may throw away the useful chunk. The prompt packer may trim the evidence to make room for conversation history. The model may see the right passage and still write a sentence that goes beyond it.

That is why a citation is not magic. A citation is a claim about a path.

Evidence step 4/6

1 query rewrite

2 retrieve candidates

3 rerank source

4 pack evidence

5 generate claim

6 verify support

Retrieved / tool evidence Plan A allows refunds within 30 days if usage is under 10 hours.

Evidence in prompt? yes

Draft claim Plan A refunds are available within 30 days under the usage limit.

Toy support check claim follows from evidence

Static fallback: a grounded claim requires query processing, retrieval or tool execution, reranking, prompt packing, generation, and a final support check against the evidence that actually reached the model. This miniature uses a toy check; production systems need stronger citation and entailment checks.

The toy checker above is deliberately simple, but the question is the real one:

Which source supported this sentence, was that source actually in the prompt, and did the final text stay inside what it says?

This is also where tools and agents become less mystical.

A tool call is a loop inside the loop. The model writes structured text that means “call this function with these arguments.” Product code validates it, runs the outside thing, receives a result, and places that result back into context. If the call is malformed, unauthorized, too slow, or surprising, the product must decide whether to retry, ask the user, fall back, or stop.

An agent is not a different species of model. It is a product loop wrapped around a model: plan, call a tool, read the result, update context, continue. A multi-agent system splits that loop into roles, but now it must manage shared memory, duplicate work, disagreement, and runaway loops.

The useful test is plain:

Did the extra loop improve the task enough to pay for the extra time, cost, and failure surface?

Images follow the same pattern. An image is cut into patches or otherwise converted into model-readable number tables. A vision-language model can combine those image-derived representations with text. The exact design varies, but the journey is familiar: human content becomes numbered pieces or number tables, then the model operates on those numbers.

The prompt now exists.

It is still text.

And the model does not read text.

Text Becomes Little Numbers

The model reads integers.

Before the prompt can enter the model, a tokenizer cuts text into pieces and maps each piece to a number. A token might be a word, part of a word, punctuation, whitespace, an emoji fragment, or a byte-level piece of text. A token id is the integer assigned to that piece.

Try changing the text below. The exact ids are toy ids, but the shape is real: text becomes pieces, pieces become numbers.

Prompt text

Tell id 32082 #0

space id 30159 #1

me id 6039 #2

space id 30159 #3

why id 47609 #4

space id 30159 #5

GPUs id 29054 #6

space id 30159 #7

make id 46465 #8

space id 30159 #9

LLMs id 36849 #10

space id 30159 #11

fast id 4447 #12

. id 45873 #13

Why this matters: context size, waiting time, memory use, and billing all count tokens. A long word, code identifier, emoji, or pasted log can cost more token positions than your eyes expect.

Why not just use words?

Because human text is messy. Names appear. Code appears. Typos appear. New slang appears. Many languages mix scripts. A fixed word dictionary would constantly fail. Subword tokenization is the compromise: common words can be single tokens, while rare words are built from smaller pieces.

After tokenization, the prompt looks conceptually like this:

[128000, 9125, 374, 279, 2768, 315, ...]

Those ids do not contain meaning by themselves. They are more like library card numbers. Token id 9125 tells the model which row to pull from a learned table.

That first pull is an embedding lookup. The row it pulls is a vector: a list of numbers. Put one vector row beside every token position, and the prompt stops being prose. It becomes a table.

token text refund

token id 34952

embedding row 4,096 numbers

prompt matrix 8,192 x 4,096

Hidden width 4,096 Prompt tokens 8,192 Bytes per number 2

One embedding row 8.0 KiB

Prompt matrix 64.0 MiB

128k-row table 0.98 GiB

This is the first clean crossing:

Human text has become a rectangular block of numbers.

The tokenizer also explains many surprises. A short-looking code identifier can become many tokens. A compact log line can become expensive. A prompt injection hidden in retrieved text can consume enough token budget to push useful evidence out.

Merge steps 6/12

Current rule mode + l -> model Common words and spaces merge into larger pieces.

The id 6009 pos 0

space id 1032 pos 1

model id 77737 pos 2

space id 1032 pos 3

a id 1097 pos 4

n id 1110 pos 5

s id 1115 pos 6

w id 1119 pos 7

e id 1101 pos 8

r id 1114 pos 9

s id 1115 pos 10

Initial pieces 17

Current token count 11

Context cost 11 positions

Static fallback: BPE-style tokenizers repeatedly merge common adjacent pieces. The final pieces are mapped to fixed integer ids that must match the model's embedding table.

Tokens are not a footnote. They are the accounting unit for cost, context, waiting time, and the later memory the model must keep while generating.

Short Text Can Be Long to a Model

Humans count words.

LLM services count tokens.

That difference creates small surprises. A normal English sentence may tokenize compactly. A stack trace may explode. A long code identifier may split into many pieces. Random-looking base64, minified JSON, logs, mixed scripts, or pasted tables can look short on screen and still become expensive inside the model.

This matters before any clever reasoning begins.

Each extra token is another embedding lookup. During prefill, it is another row that must pass through the transformer. During attention, it may be another position to compare with earlier positions. During generation, it may become another key/value entry remembered by the cache.

So tokenization is upstream of almost every cost:

Human surprise	Machine consequence
”This pasted log is only a page”	many token positions
”This identifier is one word”	many subword pieces
”This JSON is compact”	dense syntax still spends tokens
”This retrieval chunk is short”	may crowd out other evidence

The tokenizer is also a compatibility boundary. A model trained with one tokenizer cannot simply be served with another. The ids would point at the wrong learned rows. The arithmetic might still run, but it would be reading the wrong shelves.

So our rainbow question is no longer a sentence. It is a row of shelf numbers.

But the shelves only matter because someone filled them earlier.

Measurement	Example value
GPT-3 largest vocabulary	50,257 BPE tokens
GPT-3 largest hidden width	12,288 numbers per token vector
GPT-3 largest depth	96 layers
Prompt length	counted in tokens, not words

Now we have rows of numbers.

But rows of numbers cannot answer anything unless the learned numbers are already waiting somewhere.

The Learned Numbers Were Already Waiting

The most expensive part of your answer happened before you arrived.

Months earlier, a training run took huge piles of text and played one game again and again:

Given the pieces so far, what piece should come next?

Start with one sentence:

Paris is the capital of France.

The model sees “Paris is the capital of” and guesses the next token. If it gives “France” a low score, the training program measures that miss with a number called loss. Then it works backward to discover which learned numbers should move.

A weight is one learned number. A gradient is a direction signal for a weight: move this number up a little, or down a little, if you want the loss to shrink. Backpropagation is the bookkeeping walk that carries blame from the wrong final score back through the operations that contributed to it. An optimizer turns those directions into careful nudges.

Raw score 1.2 Nudge size 0.20 Workers 8

piece raw score belief share direction

Paris 1.20 49.6% -0.504

London 0.70 30.1% 0.301

blue -0.40 10.0% 0.100

quickly -0.80 6.7% 0.067

unsafe -1.40 3.7% 0.037

Target score after toy update 1.301

Workers sharing directions 8 workers agree

Toy update message 80 bytes

Static fallback: a training position produces raw scores, turns them into belief shares, measures loss, sends update directions backward, shares those directions across workers, and then nudges the weights.

One sentence barely matters. Training is that tiny correction repeated across enormous mixtures of text and enormous numbers of weights. The point is not to memorize one line. The point is to make many shared numbers better at the next-token game across many contexts.

The first trained model is usually not yet a good assistant. It may continue text well, but a chat product wants more: answer questions, follow instructions, refuse harmful requests, use tools, cite sources when asked. So a second family of training steps shapes behavior after base training.

1 Choose examples questions, tasks, or conversations

2 Produce answers human labels or sampled model answers

3 Score signal reference, preference, reward, or verifier

4 Compute loss turn scores into pressure on weights

5 Update policy change the assistant for future serving

Examples 512 Sampled answers per example 8 Avg answer pieces 600

Training signal score each answer relative to its group

Items scored 4,096 groups of sampled answers

Generated answer pieces 2,457,600 zero here means labels or pairs already exist

Weight update group-based preference update

Static fallback: one post-training path learns from labeled answers, one learns from human preferences, one learns from chosen/rejected pairs, and one compares groups of sampled answers.

In post-training, examples or preferences become a signal. The signal becomes a loss. The loss becomes another weight update. The updated snapshot then has to pass evaluations before it can serve users.

Training step 3/6

1 example A training example asks for helpful behavior, tool use, refusal, math, or citation.

2 answers The current model or humans produce candidate answers.

3 signal A target, preference, reward, verifier, or principle turns behavior into supervision.

4 loss Training code converts that signal into a number to minimize.

5 update Gradients nudge a copy of the model weights.

6 evaluation gate Offline checks, adversarial cases, and small test releases decide whether this snapshot can serve users.

Supervision signal preferred answer beats rejected answer

What the loss tries to do raise the preferred answer relative to the rejected answer

Failure to test depends on pair quality and reference behavior

Static fallback: post-training turns examples and candidate answers into targets, preferences, rewards, verifier results, or principles. Those signals become losses, losses become weight updates, and eval gates decide whether a snapshot is deployable.

The names can stay plain:

Name	Plain shape
Supervised fine-tuning	imitate good instruction-answer examples
RLHF	learn from human preferences, then update toward preferred answers
DPO	train directly from preferred-versus-rejected answer pairs
Rollout	a sampled answer used as training material, not a final user answer

Different labs use different recipes. The steady rule is simpler: when you press Enter, you are not training the model. You are asking a frozen snapshot to run forward using your current prompt.

Your chat may produce logs or feedback for future training. But the answer you are watching comes from weights already loaded for this request.

Quantity	Public example
GPT-3 largest model	175 billion parameters
The Pile public corpus	825 GiB of English text from 22 subsets
Transformer paper training run	3.5 days on 8 GPUs for WMT14 English-to-French
InstructGPT preference result	1.3B model preferred over 175B GPT-3 on the paper’s prompt distribution

The Snapshot Carries Old Scars

Frozen does not mean perfect.

It means the mistakes are frozen too.

If training data over-represented a bad pattern, the weights may carry that habit. If duplicated text leaked into the mixture, the model may repeat memorized fragments more often. If post-training rewarded confident answers too strongly, the model may learn to sound certain when the evidence is weak. If safety examples were too broad, it may refuse harmless requests. If they were too narrow, it may under-refuse dangerous ones.

Serving cannot repair those weights on the fly. It can add context, use tools, choose safer decoding rules, or route to a different snapshot. But it cannot make the frozen weights know a fact they never learned and never receive.

This is why hallucinations have to be diagnosed by layer.

Some are training failures: the learned habit is bad.

Some are prompt failures: the needed evidence was not in context.

Some are retrieval failures: the wrong chunk was supplied.

Some are decoding failures: the right evidence was visible, but the next-token path drifted.

Some are product-loop failures: a tool result was ignored, a citation was trusted too easily, or a policy gate fired at the wrong time.

The symptom is one sentence on your screen. The owner may be anywhere in the journey.

The Second Training Can Also Hurt

Post-training makes a base model behave more like an assistant.

But every shaping step has a cost.

A domain fine-tune can make a model better at one domain and worse outside it. The human version would be practicing only legal contracts for weeks and then becoming clumsy at casual email. The model version is called catastrophic forgetting: extra training improves one behavior while damaging behavior the base model already had.

Teams fight this by mixing old and new examples, making smaller updates, freezing parts of the model, and testing old skills before release.

Another tool is distillation.

In distillation, a smaller or specialized model learns from answers produced by a stronger model. The important shape is teacher and student. Instead of learning only from raw human-written text, the student learns from model-generated targets that carry some of the teacher’s behavior.

There are compact adaptation methods too. LoRA trains small extra matrices rather than rewriting every weight. QLoRA combines that idea with quantized base weights so adaptation can fit on smaller hardware. Those details matter when teams want a model to learn a style, task, or domain without paying the cost of a full training run.

Names multiply quickly: SFT, RLHF, DPO, GRPO, LoRA, QLoRA, distillation.

The clock-post version is simpler:

What signal changes future answers, and which learned numbers are allowed to move?

But a snapshot on disk is not yet a running model.

Files have to become a live machine.

A Model File Is Not a Model Yet

Before your request arrives, some worker has to load a model artifact.

An artifact is the bundle needed for serving: weights, tokenizer rules, chat format, safety settings, number formats, split plan, and configuration files. If those pieces disagree, the math can run while the meaning is broken.

The live copy is called a replica. A replica is not another mind. It is one loaded running copy of the same snapshot, ready to accept work.

Warmup step 5/7

1 alias resolve chat-model-large -> snapshot 2026-05-28-a

2 split plan this worker owns one slice of each large table

3 recipe check table names, shapes, number formats, text-piece count

4 read pieces storage -> ordinary memory through local disk or network path

5 copy to GPU memory ordinary memory -> GPU memory for weights and workspace

6 warm runtime prepare GPU programs, communication setup, and remembered-token pool

7 register healthy router can send real sequences to this replica group

Current artifact copy to GPU memory

Check being enforced enough GPU memory after weights, scratch, and remembered-token room

Routing answer not eligible

Static fallback: a large model replica must choose the right snapshot, load the correct model pieces, validate the text-splitting rule and table shapes, copy weights into GPU memory, reserve room for remembered tokens, warm GPU programs, and register healthy before traffic reaches it.

Loading a large replica is not like starting a small web server. The worker checks files, verifies checksums, reads weight shards, copies bytes into CPU memory, moves them into GPU memory, warms GPU programs, allocates workspace, reserves room for future tokens, and only then announces that it can serve.

A shard is one stored piece of a larger model. A tensor is an organized block of numbers with a shape. The shape matters because GPU programs expect exact dimensions. If a worker loads the wrong block, the next operation may not know how to multiply it.

Warmup step 3/5 Parameters 70B GPU workers 4 Bytes per weight 2

recipe read model recipe, text-splitting rule, chat format, split plan wrong text-piece count or unsupported setting

pieces check table names, shapes, number formats, fingerprints missing model piece or shape mismatch

arrange repack learned numbers for GPU programs number format not supported

GPU memory copy learned numbers into GPU memory and save workspace not enough room for remembered tokens

ready run test work and mark replica healthy cold route delays first answer piece

Total raw weights 130.4 GiB

Per-worker piece 32.6 GiB

Serving invariant all workers ready or no route

Static fallback: a model snapshot becomes live only after recipe checks, model-piece verification, layout packing, GPU-memory loading, workspace allocation, and health registration.

Think of a stage set arriving in crates. One crate has the wall. One has the lights. One has the floor. Opening a crate does not give you a play. The crew has to assemble the set, wire the lights, test the rig, and only then let the audience in.

That is model warmup.

The Weights Are Only the First Rent

Parameters 70B Weight bytes 2 Layers 80 KV heads 8 Concurrent sequences 16 Context tokens 8,192 GPU memory 80 GiB

weights

runtime

Weights 130.4 GiB

KV cache 40.0 GiB

Runtime overhead 10.4 GiB

Total 180.8 GiB

This is a teaching estimate, not a capacity planner. Real systems add split model pieces, temporary workspace, warmed GPU-program overhead, leftover gaps, smaller-number details, add-on weights, and safety margins.

Static fallback: serving memory is dominated by model weights plus KV cache. Weights scale with parameter count and precision; KV cache scales with concurrent sequences, context length, layers, KV heads, head dimension, and bytes per value.

Now comes the second surprise.

“The weights fit” is not the same as “the service works.”

A GPU must hold its share of weights, but also temporary workspace, communication buffers, allocator gaps, safety margin, and the growing memory for active conversations. The model file is the first rent. Every open stream keeps paying.

Parameters 70B Weight bytes 2 TP GPUs 4 Active sequences 24 Context tokens 8,192 GPU memory 80 GiB

GPU 1 68.1 / 80 GiB

weights KV runtime margin

GPU 2 68.1 / 80 GiB

weights KV runtime margin

GPU 3 68.1 / 80 GiB

weights KV runtime margin

GPU 4 68.1 / 80 GiB

weights KV runtime margin

Weight shard 32.6 GiB

KV shard 15.0 GiB

Runtime + comm 12.1 GiB

Free per GPU 11.9 GiB

Teaching model: KV cache is shown as evenly sharded by tensor-parallel degree. Real placement depends on the attention layout, runtime, quantization, allocator fragmentation, and the exact parallelism plan.

Static fallback: a model replica must fit weights, KV cache, workspace, communication buffers, and safety margin on every tensor-parallel GPU. Fitting raw weights alone is not enough.

If one GPU in a group runs out of room, the whole running copy is in trouble. A deployment plan is therefore not just “70B on four GPUs.” It is a budget for what every GPU must hold at the instant the next decode step launches.

Quantity	What it means
175B parameters at 16-bit	about 350 billion bytes before overhead
70B parameters at 16-bit	about 140 billion bytes before overhead
80 GB GPU	not enough for a raw 70B 16-bit model plus serving memory
KV cache	grows with active tokens and active conversations

The Contract Can Break Before Math Starts

Several failures at this layer look like “the model is down” even though no model math has run.

The tokenizer file can mismatch the embedding table. The chat template can wrap messages differently from the format used in post-training. The split plan can expect four GPUs while the runtime has eight. A shard can fail checksum. A runtime can load the weights but not support the requested smaller-number format. A cold replica can be healthy but not yet warm, so routing to it creates a long pause before the first token.

Some failures are worse because they do not crash.

If the tokenizer and embedding table disagree, token id 128006 may point at the wrong row. The matrix multiply can still run. The output can still look like language. But the meaning is poisoned at the boundary.

So a production artifact is not “a file with intelligence inside.”

It is a contract:

Piece	Must agree with
Weights	layer count, tensor names, shapes, number formats
Tokenizer	embedding rows and special-token ids
Chat template	post-training format and tool-call format
Split plan	GPU count, shard ownership, communication steps
Runtime kernels	precision, layout, attention variant, cache layout
Evaluation record	behavior that is safe enough to route traffic

If the contract is wrong, the first visible symptom may be a timeout, a refusal spike, nonsense output, or just “the new model feels worse.”

Now the model is warm.

Your prompt can finally enter the stack.

The Model Reads the Whole Prompt Once

The prompt is now a table: one row per token position, many numbers per row.

A transformer layer edits that table. It takes in one row per token and returns one row per token. The shape stays the same, but the numbers inside the rows change.

Stack many layers and each token row becomes more informed by its context.

The famous step is attention.

Take the phrase:

river bank
central bank

The word “bank” needs earlier words to make sense. Near “river,” it should lean toward land beside water. Near “central,” it should lean toward finance.

Attention is the mechanism that lets one token position look back at earlier positions and ask: which of you matters to me right now?

1 Embeddings token id -> vector

2 Attention compare queries with keys

3 MLP transform each position

4 Logits vector -> vocabulary scores

5 Sampler scores -> one next token

Prompt tokens 2,048 Hidden width 8,192 Layers 80

Attention score cells in prefill 4,194,304 tokens × tokens before kernels optimize memory traffic

One activation matrix 0.03 GiB tokens × hidden × 2 bytes

KV cache after prefill 0.63 GiB tokens × layers × K/V × KV heads × head dim

Each position creates three learned views of itself:

Name	Plain meaning
Query	what this position is looking for
Key	what an earlier position offers for matching
Value	the information copied if the match is useful

The query compares with keys. The comparison gives strengths. The strengths mix values. The mixed result flows back into the current token row.

If that sounds abstract, keep the library picture. The query is what you search for. The keys are labels on cards. The values are the records you actually read.

After attention, another part of the layer edits each position privately. This is often called an MLP or feed-forward network. If attention lets positions talk to each other, the MLP lets each position digest what it has received.

The layer does not replace the stream. It edits it and adds the edit back. Those add-back paths are residual connections. They help many layers accumulate useful changes without every layer having to rebuild the whole representation from scratch.

At the final layer, the model turns the last token’s row into one raw score for every possible next token. These raw scores are logits.

The model still has not chosen a word.

It has created a score landscape.

Quantity	Scaling behavior
Attention score cells for `n` prompt tokens	`n x n` per head before optimized tiling
2,048-token prompt attention cells	4,194,304 per head
8,192-token prompt attention cells	67,108,864 per head
GPT-3 output logits	50,257 raw scores

The Symbols Are Just the Same Story Tighter

You do not need the symbols to keep reading.

But the symbols are useful because they show where the pain comes from.

Let the hidden width be H, the number of token positions be T, and the number of attention heads be A. The layer starts with something shaped like this:

X: [batch, T, H]

The model multiplies X by learned tables to make queries, keys, and values:

Q = X Wq
K = X Wk
V = X Wv

Then it reshapes them so several attention heads can run side by side:

[batch, A, T, head_dim]

Each head compares positions with positions. That makes a T x T score table before optimized kernels avoid materializing the giant table in the naive way.

Double the prompt length, and the simple score table wants four times as many cells.

That is why long context is not just “more words.” It changes the size of the comparison problem.

The compact attention story is:

compare queries with keys
block future positions
turn scores into mix weights
mix the values

The causal mask is the “block future positions” rule. While generating token 42, the model may use tokens 1 through 41. It may not use token 43, because token 43 does not exist yet.

The Private Editor Matters Too

Attention gets the fame.

The MLP does a lot of the work.

If attention is how positions talk to each other, the MLP is how each position changes itself after hearing the room. It applies learned table operations to each token row independently. A gate can let some intermediate features through strongly and damp others.

That private edit is computationally large. In many transformer layers, the MLP uses big matrix multiplies that dominate arithmetic work. It is less intuitive than attention because it does not have the friendly “this word looks at that word” story. But it is where a lot of per-token transformation happens.

Some models replace one big MLP with a mixture of experts. Picture a workshop with many specialist benches. For each token, a small router chooses a few benches. Only those experts run, so the model can contain many weights without using all of them on every token.

That saves compute per token, but creates a new systems problem. Tokens have to be sent to the GPUs that hold their chosen experts. If too many tokens choose the same expert, that expert becomes a bottleneck. If experts live across machines, the model has created an all-to-all communication problem inside the layer.

Again the pattern repeats:

A clever model trick becomes a scheduling and communication trick when it serves real traffic.

The prompt has now been read once.

But the answer is not one token long.

If the model had to reread the whole past from scratch every time, generation would crawl.

So it remembers the useful parts.

The Past Becomes a Cache

Imagine a 500-token answer.

For the first answer token, the model reads your whole prompt. For the second, it needs the prompt plus the first answer token. For the third, it needs the prompt plus the first two answer tokens.

The earlier work does not change.

So serving systems keep a KV cache. KV means key/value. The plain idea is simpler: while reading earlier tokens, each layer creates reusable notes. During generation, the model can look back at those notes instead of recomputing the whole past.

Target token: server

For "server"

When

many

people

chat

once

the

server

batches

tokens

Context tokens 4,096 Layers 80 KV heads 8

Approx KV cache for one sequence 1.25 GiB context × layers × 2 × KV heads × head dim × bytes

Static fallback: causal attention lets each generated token read earlier tokens only. The server stores prior keys and values per layer, so decode can append one new cache entry instead of recomputing the whole prompt.

The first phase is prefill. That is the big read of the whole prompt. It fills the cache.

The second phase is decode. That is the one-token loop. Generate one token, add its new notes to the cache, then use the enlarged cache to generate the next token.

Prompt tokens 4,096 Generated tokens 512 Active sequences 24 Layers 80 KV heads 8

tokens x layers x 2 x kv_heads x head_dim x bytes The 2 is one key tensor plus one value tensor per layer.

Tokens per sequence 4,608

KV per sequence 1.41 GiB

Logical 16-token pages 6,912

Approx prompt-token recompute avoided 2,093,056

Static fallback: KV-cache memory scales with tokens, layers, key/value tensors, KV heads, head dimension, bytes per value, and active sequences. It saves repeated prefill work by storing past keys and values, but it makes concurrency memory-bound.

This is why two speed numbers matter:

User-visible thing	What it includes
Time to first token	queueing, routing, tokenization, and prefill
Tokens per second	repeated decode steps after the first token

The cache saves work, but it spends memory. Every active conversation carries remembered state until it finishes or is cancelled. Long prompts, long answers, and many simultaneous users all push on the same memory budget.

If KV memory fills, the problem starts to look less like “AI” and more like operating systems. The server may queue, reject, swap, evict, recompute, or route elsewhere. You can have arithmetic capacity left and still be blocked by memory.

Factor	Effect
KV tensors per layer	2: keys and values
KV memory vs context length	grows linearly with active tokens
KV memory vs concurrency	grows linearly with active sequences
vLLM reported throughput gain	2-4x over FasterTransformer and ORCA in evaluated workloads

The First Read and the Later Ticks Want Different Machines

Prefill and decode are not the same kind of work.

Prefill reads the whole prompt. It likes big chunks of work and fills the cache for many positions at once.

Decode advances one token at a time. It likes steady, low-latency turns. Each turn is smaller, but it must happen again and again, and token 257 cannot exist until token 256 has been chosen.

Some serving designs separate those phases. One pool specializes in prompt reading. Another specializes in decode. That can improve hardware use, but it creates a new problem:

How does the remembered state get from the prefill world to the decode world?

The KV cache may need to move, be shared, be rebuilt, compressed, restored, or spilled. What looked like “saved work” becomes data with a location, lifetime, owner, and transfer cost.

This is why long-context serving keeps producing new runtime designs. The difficult object is not only the model weight file. It is the active conversation state.

The Cache Has a Lifetime

A cache entry is not just “memory used by the model.”

It has an owner.

It belongs to a sequence. It lives on a device or in a memory block. It may be paged, reused, evicted, swapped, compressed, or freed. If the user cancels, it should die. If the stream finishes, it should die. If the request is preempted, the system has to decide whether to keep it, move it, or recompute it later.

This is why KV-cache allocators look a little like operating systems. They manage blocks. They fight fragmentation. They track which sequence owns which pages. They try to keep enough free space for the next request without wasting too much memory on gaps.

There is also a positive trick: prefix caching.

If many requests begin with the same prefix, the service may avoid rereading that shared beginning. System prompts, tool definitions, or repeated application scaffolding can create common prefixes. Save that work once, and later requests can start from the cached state.

But prefix caching has its own rules. The prefix must match exactly enough. The cache must still be valid for the same model, tokenizer, and prompt format. A tiny hidden instruction change can ruin the match.

The object keeps getting more concrete:

The past is not an idea. It is blocks of numbers with addresses and owners.

The cache explains the memory pressure.

Now look lower. Those blocks still have to be read by real hardware.

Where are those numbers physically moving?

The Numbers Touch Metal

Inside the GPU, the model is no longer language.

It is number tables moving between kinds of memory and arithmetic circuits.

Some time is spent doing math. GPUs contain many arithmetic units, including tensor cores, circuits built to multiply small number tables quickly.

But a number cannot be multiplied until it arrives.

That is the quieter half of the story.

GPU HBM is high-bandwidth memory packaged close to the chip. It can hold large streams of numbers. SRAM sits on the chip itself. It is faster, but much smaller. Registers are smaller still, and closer to the arithmetic.

Far and large. Close and tiny.

An LLM kernel is often a choreography for moving the right numbers from the large place into the fast place just before the arithmetic needs them.

Tile M 64 Tile N 128 Tile K 128 Bytes per value 2

Math work 2 x M x N x K 2.10 MFLOPs

Minimum traffic A tile + B tile + output 64.0 KiB

Arithmetic intensity 32.0 FLOPs/byte ridge point: 298.5

107 TFLOP/s attainable under this toy roofline

Trace rule: a faster advertised FLOP number helps only after the kernel keeps enough data on chip and feeds the tensor cores.

Static fallback: matrix multiplication cost is about 2*M*N*K FLOPs, but runtime also depends on bytes read and written. Low arithmetic intensity makes a kernel memory-bound.

Trace one attention tile. A tile is a small block of a larger table. A block of GPU threads pulls a slice of cached keys from HBM into faster on-chip storage. Tensor cores multiply query tiles by key tiles. The kernel keeps a running summary, streams the matching value tile, mixes it into the output, and moves on.

If the tile is chosen well, the chip reuses nearby numbers and avoids writing a giant temporary table back to HBM. If the tile is chosen poorly, the chip spends its time fetching and storing instead of multiplying.

This is why the same attention math can be slow or fast depending on the memory path.

The physical bottom is not “electricity” in the abstract. It is charge stored in memory cells, signals crossing package traces, transistor gates switching, SRAM banks feeding arithmetic units, network cards moving packets, and cooling systems carrying heat away so the chips can keep their clocks.

The model math and the data center have become the same story.

Quantity	Public example
CUDA warp	32 threads
H100 SXM HBM bandwidth	3.35 TB/s
H100 NVLink bandwidth	up to 900 GB/s bidirectional per GPU in Hopper systems
Common inference number formats	16-bit, 8-bit, or 4-bit values

Smaller Numbers Change the Path

A weight stored in 16 bits takes twice as much memory as a weight stored in 8 bits.

That makes quantization tempting.

Quantization is the craft of storing numbers with fewer bits while trying to preserve behavior. It is like rounding measurements on a map. Round gently and you save little. Round aggressively and roads no longer line up.

There are several places to round.

Weight-only quantization shrinks the fixed model weights. That helps fit and bandwidth. Quantizing intermediate activations can speed or shrink temporary work, but mistakes there can be more visible because those numbers are being actively transformed. KV-cache quantization attacks the memory that grows with context and concurrency.

The important point is that quantization is not only a file-size trick.

It changes which kernels run. It changes memory bandwidth. It changes cache layout. It can change quality failures. A 4-bit model that fits may still be worse than an 8-bit model that barely misses the target. A quantized KV cache may save active memory while introducing subtle degradation on long contexts.

The practical question is not “can the model be smaller?”

It is:

Which numbers can be made smaller without breaking the behavior users care about?

One GPU may not be enough.

So the model is split.

And splitting creates a new problem: the pieces have to talk.

More GPUs Means More Waiting Lines

If a model is too large for one GPU, the service has a few levers.

It can store numbers with fewer bits. That is quantization: a careful rounding of model numbers to save memory while trying to preserve behavior.

It can split one big matrix operation across GPUs. That is tensor parallelism.

It can put different layers on different GPUs. That is pipeline parallelism.

It can run more full copies so different requests go to different replicas. That is data parallelism.

Parameters 70B

Weight precision

2 bytes/weight estimate

Tensor-parallel ranks 4 GPU memory 80 GiB Active sequences 48 Context tokens 8,192

weights

runtime

Weight shard/rank 32.6 GiB

KV/rank estimate 30.0 GiB

Runtime reserve 5.0 GiB

Headroom/rank 12.4 GiB

Teaching estimate: real deployments also depend on FP8 formats, group scales, quantization metadata, activation precision, KV-cache layout, allocator behavior, and kernel support.

The trap is that “use more GPUs” is not a complete answer.

More GPUs add memory and arithmetic, but they also add communication. If GPU 0 computes one slice of a result and GPU 1 computes another slice, the next operation cannot pretend either slice is complete.

The chips need a coordinated exchange.

Tensor-parallel ranks 4 Hidden width 8,192 Tokens in microbatch 16 Trace step 4/5

rank 0 2,048 cols local shard

rank 1 2,048 cols local shard

rank 2 2,048 cols straggler risk

rank 3 2,048 cols local shard

rank 4 unused not in group

rank 5 unused not in group

rank 6 unused not in group

rank 7 unused not in group

Per-rank weight slice hidden x 2,048

Activation payload 0.25 MiB

Approx collective traffic 0.38 MiB

Invariant next op waits for all ranks

Static fallback: tensor parallelism splits a matrix across ranks. Local matrix multiplies produce partial tensors, then collectives assemble or reduce them before the next operation can continue.

A collective operation is a group communication step. In an all-reduce, every GPU contributes partial values and every GPU receives the combined result. In an all-gather, every GPU contributes a shard and every GPU receives the assembled tensor.

That collective is a waiting point. The fastest GPU waits for the slowest one.

GPU 0 shard 0

GPU 1 shard 1

GPU 2 shard 2

GPU 3 shard 3

GPU 4 shard 4

GPU 5 shard 5

GPU 6 shard 6

GPU 7 shard 7

GPUs participating 8 Payload per sync 64 MiB Link bandwidth 300 GB/s

What crosses partial activations

Synchronization shape all-reduce / all-gather

Where the wait appears inside many layers

Bandwidth lower bound 0.365 ms before software, topology, and queueing overhead

one matrix is split across GPUs. The lower bound is deliberately optimistic: real systems also pay kernel launch, routing, topology, congestion, and straggler costs.

Static fallback: tensor parallelism synchronizes partial layer results, pipeline parallelism sends activations between stages, data parallel training all-reduces gradients, and expert parallelism all-to-all routes tokens to selected experts.

Now the interconnect is part of the model. NVLink, NVSwitch, InfiniBand, or high-speed Ethernet are not just infrastructure around the answer. They can be inside the path of each generated token.

Parallelism trades one pain for another: less pressure on local memory, more pressure on communication and synchronization.

Quantity	Public example
HGX H100 GPUs per baseboard	8
H100 NVLink bandwidth in HGX H100	900 GB/s bidirectional per GPU
Common tensor-parallel collective	all-reduce or all-gather
Common expert-parallel communication	all-to-all token dispatch

The model has now produced raw scores.

The answer still has not been written.

It has to choose one next piece.

The Raffle for the Next Piece

At the end of the stack, the model has one raw score for every token in its vocabulary.

The simplest rule is greedy decoding: pick the highest-scoring token.

That is easy, but it can make writing brittle or repetitive. Chat systems often sample instead.

Sampling means choosing randomly, but not equally. Imagine a raffle. Better tokens get more tickets. After “Paris is the capital of,” the token for “France” gets many tickets. “Banana” gets almost none.

The raw scores are not ticket counts yet. They can be negative, huge, or tiny. They do not add up to anything useful. A function called softmax turns raw scores into probabilities: ticket shares that add up to 1.

Temperature 0.8 Top-p 0.90 Seed 42 mask unsafe token

token raw score ticket share in top-p

the 3.2 49.8% yes

answer 2.7 26.7% yes

is 2.2 14.3% yes

probably 1.4 5.2% no

because 1.1 3.6% no

banana -0.6 0.4% no

unsafe masked 0.0% no

Loop invariant: after this one token is chosen, it is appended to the context, written into KV cache, streamed if allowed, and the whole model runs again for the next token.

Static fallback: raw scores are adjusted by temperature, turned into ticket shares, trimmed by top-p or masks, sampled, appended to the context, and used in the next decode step.

Then the product can reshape the raffle:

Control	Plain effect
Temperature	sharpens or flattens the ticket spread
Top-p	keeps the smallest group of likely tokens whose probabilities reach a threshold
Repetition penalty	removes tickets from tired repeats
Masks	remove invalid or unsafe tokens entirely

Then one token is drawn.

That token is appended to the sequence. The KV cache gets one new set of notes. The model runs again. Another raffle. Another token. Another stream event.

That last paragraph is the loop.

Trace position

Current layer serving runtime

Tensor shape active batch: [B sequences]

Memory movement checks queue, KV blocks, priority, cancellation

Synchronization none yet

your sequence is placed beside other sequences for one token step

Active sequences 24 Context tokens 8,192 Hidden width 8,192 Tensor-parallel GPUs 4

Activation per layer 384 KiB [B, 1, H] at 2 bytes

KV read per layer 768.0 MiB past K/V for active batch

KV write per token step 7.5 MiB new K/V across all layers

Hidden per TP shard 2,048 idealized shard width

Logits buffer 5.9 MiB assuming 128k vocabulary

Static fallback: one generated token moves through scheduler admission, last-token activation, attention projections, KV-cache reads, collectives, MLP, final logits, sampling, KV append, detokenization, and network streaming.

One visible token is not one operation. It is a scheduler decision, a batch slot, dozens of layer passes, memory reads, matrix-multiply tiles, possible GPU-to-GPU exchanges, a logits vector, a sampling choice, a cache append, a detokenized fragment, and a network flush.

Reasoning models add another wrinkle. They may spend extra serving-time compute before the final answer: hidden scratch tokens, multiple sampled attempts, verifier passes, code execution, search, or tool calls. The product may show none of that internal work and stream only the final answer.

1 Allocate budget how much compute can this question spend?

2 Generate work reasoning tokens, candidate answers, or tool plans

3 Check work verifier, reward model, tests, citations, or policy

4 Choose output return one answer, not all internal work

Candidate attempts 1 Hidden / scratch tokens 1,800 Visible answer tokens 450 Verifier passes 0 Tool calls 0

Mode shape extra inference-time tokens are spent before the answer

Total token work 2,250 generated plus verifier estimate

Hidden share 80% work the user may not see directly

Latency pressure 5.0x rough multiplier versus a 450-token direct answer

Static fallback: direct decoding spends one answer stream; reasoning models spend extra inference-time tokens; best-of-N samples multiple candidates and verifies them; tool loops add external-call latency and extra model passes.

So “reasoning” in the chat app often means more one-token loops, more verification, more queue occupancy, and more memory pressure in the machinery underneath.

The Small Model Can Run Ahead

There is a speed trick that sounds like cheating.

Let a small model guess the next few tokens.

Then let the large model check them.

This is speculative decoding. The small draft model proposes a short future. The large target model verifies several proposed tokens in parallel. If the draft was right, the system accepts multiple tokens for roughly one expensive target pass. If the draft was wrong, the system falls back without changing the target model’s intended probability distribution, depending on the exact algorithm.

The user sees faster text.

The system sees a bet.

If the draft model is often right, the bet pays. If it is often wrong, the checking overhead may not help. The trick works best when the small model is cheap enough, aligned enough with the large model, and the serving stack can schedule the draft and verify work without creating new bottlenecks.

This is the same theme again: the math idea is only half the story. The serving win depends on the queue, memory, kernels, and batch shape around it.

Reasoning Spends Invisible Tokens

A normal chat request is already variable length.

A reasoning request is variable length twice.

The final answer length varies. The hidden or auxiliary work varies too.

The extra work might be hidden scratch tokens, visible reasoning-like text, multiple sampled solution attempts, verifier passes, code execution, search, or tool calls. The product may show only the final answer, but the scheduler still has to reserve room for the work that produced it.

That changes the front door, the cache, and the queue.

The front door must estimate a larger promise. The cache may hold hidden state the user never sees. The scheduler may have to protect ordinary streams from a few long reasoning jobs. The stream gate may hold the final answer until checks finish.

“Think harder” is not free.

At the bottom, it means more iterations, more memory lifetime, more tool time, and more ways for a request to be interrupted.

The next piece exists now.

But it is still an id.

It has to become text without breaking the stream.

The Stream Comes Back Alive

The serving worker detokenizes the chosen token id back into bytes or text fragments. The service frames the fragment as a stream event. The edge forwards it over the open connection. The browser decodes bytes, updates UI state, and paints.

That is why you can see half a sentence.

The server is not waiting for the whole answer. It is releasing pieces as soon as product rules allow.

But release is not always “print immediately.” Some tokens are word fragments. Some byte fragments do not form a valid UTF-8 character until the next piece arrives. A JSON tool call may be nonsense until its braces close. A citation marker may need source metadata. A safety check may hold a risky phrase until neighboring text makes it clear.

Generated fragments 1/4

Detokenizer buffer The bytes form ordinary text

pass UTF-8/text

pass tool syntax

pass policy

pass client open

Release decision release now

Backend action flush stream event and schedule the next decode iteration

token id selected

detokenizer buffer

stream frame

browser paint

reverse cancellation

Static fallback: streamed output is not printed blindly. It passes through detokenization, valid text buffering, tool/citation/policy gates, stream framing, browser rendering, and cancellation cleanup.

Release step 3/5

Current token fragment answer

1 detokenize fragment

2 merge into text buffer

3 frame as stream event

4 browser appends text

5 paint

Static fallback: generated token ids are detokenized, buffered until valid text or structures exist, checked against citation and safety gates, framed as stream events, decoded by the browser, and painted.

Streaming is a small protocol inside the product: buffer when needed, flush when safe, and keep enough state to stop cleanly.

Cancellation is the reverse journey. You hit stop. The browser closes or aborts. The edge marks the stream dead. The serving frontend marks the sequence cancelled. The scheduler removes it from future iterations. The KV allocator releases its blocks.

If a GPU kernel is already running, that iteration may finish. But the next one should not include work nobody will read.

The one-user loop is complete.

But the service is not serving one user.

Everyone Else Is in the Loop Too

Your request is one colored line in a giant moving map.

One user wants low latency: little waiting for their own answer. The provider wants high throughput: many useful tokens served per second across everyone.

Those goals fight.

Think of a bus. Leaving immediately is great for the first passenger, but wastes seats. Waiting until the bus is full uses the bus well, but makes the first passenger wait. A GPU has the same tension.

For one request, decode generates one token at a time. That is too little work to keep a large GPU busy. The trick is to generate the next token for many requests in the same model step.

Time slice 0

Active GPU batch 1 requests

Completed 0/5

Queued 0

Utilization hint low

prefill

not arrived

prefill: read prompt decode: one token per iteration arrival 0 decode token-iterations emitted so far

Static fallback: continuous batching works at the iteration level. Finished sequences leave a running batch, and newly arrived sequences can enter the next decoding step instead of waiting for every original batch-mate to finish.

People do not arrive in neat groups. One prompt is short. One is huge. One answer ends after ten tokens. Another keeps going for a thousand. A fixed batch would quickly fill with holes.

So modern serving systems rebuild the batch while work is already moving. That is continuous batching, also called in-flight batching.

Time slice 6 KV-page budget 88 Token budget 24 cancel seq-1842 after slice 5

KV pages in launch 67/88

Token work in launch 18/24

Waiting records 1

Decision protect running decodes; leave low-priority work queued

sequence phase KV pages priority tenant scheduler result

seq-1844 prefill 42 5 enterprise in next GPU launch

seq-1843 decode 6 4 interactive in next GPU launch

seq-1842 decode 19 3 paid in next GPU launch

seq-1845 prefill 78 1 batch KV budget full

seq-1842

decode

seq-1843

decode

seq-1844

prefill

seq-1845

prefill

seq-1846

not arrived

Static fallback: an LLM scheduler rebuilds each GPU launch from mutable sequence records. It chooses among prefill and decode records using token budget, KV-cache pages, priority, deadline, and cancellation state.

Every active conversation has a record attached to it: input length, generated length, prefill or decode state, KV-cache pages, remaining token budget, priority, and cancellation status.

The scheduler repeatedly asks:

Which records can fit in the next model step without breaking compute and memory budgets?

Tick 2/3 KV budget 82 Prefill chunk 4,096 Decode slots 3

A17 · paid decode keep stream smooth 18 KV pages

B04 · free prefill chunk prompt 12 KV pages

C92 · enterprise decode protect deadline 41 KV pages

D31 · paid cancelled release KV pages 9 KV pages

E08 · batch waiting admit if budget remains 0 KV pages

Decode records admitted 2

Prefill chunk admitted yes

Active KV pages 71/82

Teaching simulation: real schedulers use more state, but the invariant is the same: each tick admits, chunks, protects, cancels, and frees sequence records under KV and compute budgets.

Static fallback: each scheduler tick selects active sequence records under compute and KV-cache budgets, drops cancelled records, chunks prefills, and protects decode streams.

Now cancellation is not a UI nicety. It frees cache pages. A huge prompt is not just slow. It may need to be read in chunks so dozens of active streams do not freeze. A priority decode is not merely another queue item. It is a visible next token waiting on this iteration.

Measurement	What it tells you
Time to first token	queueing plus prefill plus first decode
Inter-token latency	time between streamed tokens
Aggregate tokens per second	fleet throughput
KV-cache occupancy	memory pressure from active conversations
Prefix-cache hit rate	repeated prompt beginnings saved from re-reading
Goodput	useful completed tokens under speed targets

The Batch Is Rebuilt Every Tick

Ordinary web servers often think in whole requests.

LLM serving cannot stop there.

A generated answer is many small iterations. On each iteration, the scheduler chooses which active conversations advance by one step. A request that already finished should leave. A newly arrived request may enter. A cancelled request should disappear. A long prompt may need prefill time. A short decode may need one quick next-token turn.

This is continuous batching.

The batch is not a sealed box. It is a moving roster.

That moving roster creates real policy choices. Should a huge prompt be allowed to block many streams while it is prefilling? Should short requests jump ahead? Should a paid tier get smoother inter-token latency? Should a nearly finished answer get priority so it frees memory sooner? Should a request with a giant future output limit be admitted now or held outside?

None of these are purely mathematical questions. They are product behavior leaking out of scheduling policy.

The best scheduler is not the one that maximizes one number. It protects useful work:

Pressure	Bad outcome if ignored
Low latency	users stare at a blank answer
High throughput	GPUs run underfilled
Memory pressure	streams are rejected or preempted
Fairness	one heavy user hurts many light users
Cancellation	paid work continues after nobody is reading

The user sees typing.

The runtime sees a live packing problem.

Under load, failure can be sudden. Queues rise. Users wait. Some cancel or retry. Retries add traffic. More open streams hold more cache. More cache pressure creates more queueing.

So the next question is not “can one rainbow answer finish?”

It is: can thousands of unfinished answers keep moving without knocking each other over?

The Fleet Keeps the Trick Boring

A good LLM service makes the miracle feel boring.

You press Enter. Text appears.

Behind that calm surface, the fleet is always changing. Some replicas are warming. Some are being drained for an update. Some hold old model snapshots. Some hold a canary snapshot. Some are overloaded because a feature just sent a traffic spike. Some are slow because one GPU link is unhappy.

Rollout stage 3/6 Traffic weight 5% Quality delta -1 Latency delta 6% Safety delta 0

snapshot copied weights and tokenizer land in model storage

canary warm small replica group loads and passes health checks

shadow compare real prompts run without exposing answers

traffic shift router sends a small percentage to the new snapshot

guardrail check quality, safety, latency, KV, and errors are compared

commit or rollback increase weight or drain the bad group

Behavior health inside guardrail

Serving health inside guardrail

Route action keep observing

This is why the fleet needs a diagnosis habit. The same sentence from a user, “the model feels worse today,” is not a bug report with one obvious owner.

It might mean the new snapshot changed behavior. It might mean a retrieval index missed evidence. It might mean a safety gate got stricter. It might mean one GPU in a tensor-parallel group is slow, so tokens arrive unevenly. It might mean the frontend is buffering stream fragments badly. The human symptom is vague because many layers meet at the same little piece of text.

So the fleet’s job is not only to serve. It is to notice which loop is drifting before the user has to name it.

A new model rollout is not just a behavior experiment. It is also a systems experiment. A safer model may be slower. A faster runtime may change rare formatting behavior. A smaller-number build may fit more traffic but damage quality in edge cases.

Canary traffic 5% Minutes observed 18

1 candidate snapshot weights, tokenizer, runtime config

2 offline evals known regressions and adversarial cases

3 shadow traffic real prompts, hidden answers

4 canary small visible traffic slice

5 promote increase routing weight

6 rollback drain streams and restore previous target

Signal citation faithfulness

Observed 89.0

Guardrail must stay above 92

Samples 3,780 (100% confidence hint)

Control action rollback model snapshot; keep runtime unchanged

Static fallback: production LLM fleets use offline evals, shadow traffic, canaries, online metrics, and rollback controls to separate model-quality regressions from serving or hardware regressions.

A Rollout Is Two Experiments

A new snapshot can be tested without showing it to users.

That is shadow traffic. The old snapshot answers the user. The new snapshot receives a duplicate prompt in the background. Its answer is logged, scored, compared, and discarded. Shadow traffic lets a team see whether the new model is slower, costlier, more likely to refuse, worse at citations, or better at a target task before users depend on it.

If the shadow path looks healthy, a canary begins.

Now a small visible slice of users reaches the new snapshot. The provider watches two families of numbers at once.

Behavior numbers:

did users regenerate more?
did citation faithfulness improve or fall?
did safety violations change?
did the model follow tool formats?
did human or model graders prefer it?

Systems numbers:

did first-token time change?
did inter-token latency change?
did KV-cache pressure change?
did GPU utilization change?
did error rate or cancellation rate move?

A rollout fails if either family fails. A model can be smarter and too slow. A runtime can be faster and subtly worse. A safety change can reduce bad output and over-refuse good requests.

This is why rollback is not only for crashes. Sometimes rollback means: the service is healthy, but the answers got worse.

The same complaint can come from different layers:

Symptom	Possible owner
late first token	front door, queue, cold replica, huge prefill
slow stream	scheduler, GPU kernel, interconnect, KV pressure
bad citation	retrieval, prompt packing, model behavior, eval coverage
truncated answer	length cap, stop token, policy gate, stream bug
network error	browser, proxy, edge, cancellation path

Incident trace 2/5

1 symptom answers sound fluent but cited tables do not support the claim

2 metric citation faithfulness 96 -> 82

3 owner eval and model rollout loop

4 action rollback snapshot, add table-heavy multilingual eval slice

5 user-visible result some users see wrong sourced answers before canary stops

Current question metric

citation faithfulness 96 -> 82

Static fallback: production LLM incidents must be traced from user symptom to metric, owning loop, control action, and user-visible result. Quality regressions, bad GPUs, safety bypasses, and KV-cache storms have different fixes.

Stability is a set of feedback loops. Watch queue depth. Watch first-token time. Watch inter-token time. Watch KV occupancy. Watch hardware health. Watch quality scores by snapshot. Watch citation faithfulness. Watch policy violations. Then decide what harm to accept.

Load shedding is deliberate refusal. A clear failure for some users can be safer than letting everyone enter a queue so long that they time out and retry.

Autoscaling helps later, but not instantly. A warm LLM replica may require copying hundreds of gigabytes of weights, verifying shards, warming kernels, allocating cache blocks, and gradually routing traffic. During the overloaded minute, the fast controls are admission, routing, batching policy, output limits, and rejection.

Incident step 2/4

leading signal long-context traffic pushes KV blocks toward the guardrail

local symptom time to first token rises while inter-token latency is still normal

control action admit short prompts, cap output, shed low-priority long requests

recovery check KV occupancy falls and active streams finish instead of timing out

Operational invariant: a fix is only real when the owning metric moves and the recovery check passes. Otherwise the system may just have moved harm to another layer.

Traffic pressure 80%

Metric p95 TTFT

Observed 1800

Guardrail must stay below 1200

Owning loop admission and scheduler

Static fallback: production stability means mapping symptoms to the right loop: admission for load, eval for quality, safety gates for jailbreaks, hardware routing for slow shards, and rollback when a snapshot regresses.

Stability Means Choosing Which Pain

Every control protects something and harms something.

Rejecting long prompts protects KV cache, but hurts users with legitimate long documents.

Lowering max output protects decode capacity, but may truncate good answers.

Routing to a smaller model protects availability, but may reduce quality.

Tightening a safety check reduces bad output, but may over-refuse.

Holding output for a stronger check improves safety, but makes the stream feel slower.

So stability is not one switch. It is a set of named tradeoffs. A mature service names both sides:

Control	Protects	Can harm
Admission limit	queues and cache	availability for heavy users
Output cap	decode capacity	completeness
Smaller fallback	uptime	answer quality
Stricter policy gate	safety	helpfulness
Region reroute	local overload	latency or policy fit
Canary rollback	quality	rollout speed

The discipline is to map the symptom to the loop that owns it. A late first token may be admission, prefill, routing, or cold replicas. Bad citations may be retrieval, prompt packing, model behavior, or evaluation coverage. Slow streamed tokens may be GPU kernels, collectives, queue policy, or KV pressure.

The journey matters because it narrows the fix.

The fleet’s job is not to make one answer possible once.

It is to make the next answer possible again and again, while traffic, models, prompts, memory, hardware, and quality all keep moving.

The Loop Closes

Now go back to the first visible word.

It did not come from “the model” in one vague sense.

It came from a loop:

Journey step 5/8

months before trained weights optimization writes tensors

0 ms request admitted gateway reserves tokens and capacity

before GPU prompt matrix token ids become vectors

first model pass prefill cache context writes KV pages

many times decode tick batch runs one next-token step

each tick sampled token logits become one id

return path streamed text detokenizer and gates release bytes

afterward fleet feedback metrics, evals, incidents, future training

Final invariant: the model never emits a whole answer in one act. It repeatedly turns the current context into one next-token distribution, while the serving system keeps the loop admitted, batched, cached, synchronized, sampled, checked, and streamed.

Static fallback: training creates weights; the gateway admits the request; tokens become vectors; prefill writes KV cache; decode repeatedly samples and streams tokens; fleet metrics feed future operations and training.

Your browser opened a stream. The front door admitted future work. The app packed a prompt larger than the sentence you typed. The tokenizer cut it into numbered pieces. A warm replica held frozen learned numbers. Transformer layers edited rows of numbers. The KV cache remembered the past. GPU memory fed arithmetic circuits. Distributed workers may have exchanged partial results. The sampler chose one next id. The stream gate released safe text. The browser painted it.

Then the chosen piece was fed back in.

And the loop ran again.

That is why the answer appears before it is finished. There is no finished answer yet. There is only the next piece, chosen under the pressure of everything before it.

LLM Chat at Every Level

Layer	What the chat is at this layer
Human	a question and a growing answer
Browser	a request body and an open response stream
Edge network	encrypted sessions, retries, timeouts, cancellation
Front door	identity, quota, routing, and capacity promises
Chat app	prompt packing, tools, retrieval, conversation state
Tokenizer	a sequence of integer ids
Model	embeddings, transformer layers, logits, sampling
Serving runtime	batches, KV-cache blocks, schedulers, warm replicas
GPU	HBM reads, SRAM tiles, tensor-core multiplies
Multi-GPU node	sharded weights and collective exchanges
Cluster	rollouts, queues, incidents, load shedding
Physical reality	transistors switching, memory cells charging, heat leaving the rack

Reference after the journey

Term	Meaning
Weight	a learned number stored in the model
Parameter	another name for a learned weight
Gradient	direction saying how a weight should change to reduce loss
Backpropagation	reverse bookkeeping pass that sends correction signals backward
Optimizer	update rule that turns gradients into weight nudges
Prompt	the full input sequence sent to the model for this turn
Context	readable input available to the model during this turn
RAG	retrieve external context before generation
Token	a text piece represented by one integer id
Token id	the integer assigned to one token
Embedding	the vector looked up for a token id
Vector	a list of numbers
Matrix	a rectangular table of numbers
Tensor	an organized block of numbers with a shape
Replica	a running copy of a model snapshot
Shard	one stored piece of a larger model
Attention	compare-and-mix step where positions look back at earlier positions
MLP	per-position editor inside a transformer layer
Logit	raw score for a possible next token
Softmax	function that turns scores into probabilities
KV cache	stored keys and values from previous positions
Prefill	pass that reads the prompt and fills KV cache
Decode	repeated pass that generates one new token
Quantization	storing numbers with fewer bits while trying to preserve behavior
Tensor parallelism	splitting one large matrix operation across GPUs
Collective	group communication operation among GPUs
Continuous batching	updating the active batch between generation iterations
Scheduler	runtime part that chooses which active requests run next
Load shedding	deliberate rejection of some work to keep the service stable

Why Does an LLM Answer One Piece at a Time?

One Chat Request, Many Layers

The Browser Sends a Control Envelope

request envelope

downstream consequence

Admission Control Counts Tokens, Not Just Requests

Admission Is a Reservation, Not a Boolean

The Production Stack Is Not Just "A Model"

The Prompt Is Built, Not Merely Sent

Evidence Has to Reach the Prompt

One Claim Needs an Evidence Chain

The Model Sees Token IDs

A Token Id Selects One Row of Weights

A Tokenizer Is a Fixed Merge Recipe

One Training Position Becomes a Weight Nudge

Post-Training Is a Factory for Better Future Answers

One Example Becomes a Post-Training Signal

A Model Snapshot Becomes a Warm Replica

A Model Artifact Becomes a Live Replica

The First Wall Is Memory

One Replica Has to Fit on Every Shard

One Forward Pass Has Two Different Costs

Attention Reads the Past; KV Cache Keeps It

For "server"

KV Cache Turns Recompute into Memory

FLOPs Only Matter If Data Arrives Fast Enough

A 70B Deployment Plan Is a Per-Rank Budget

One Tensor-Parallel Layer Has a Waiting Point

More GPUs Means More Waiting Places

Raw Scores Become One Sampled Token

One Token, Fully Traced

Reasoning Is Inference-Time Compute, Not Magic

A Streamed Token Has Gates

A Generated Token Is Not Always Released

Serving Is a Scheduling Problem

The Batch Is Rebuilt Every Iteration

One Scheduler Tick Chooses the Next Batch

A Fleet Change Is a Controlled Experiment

A Canary Is a Control Loop

Production Debugging Finds the Owning Loop

One Incident Is a Timeline, Not a Vibe

Stability Means Choosing the Right Control Loop

The Whole Journey Is One Repeated Loop