How LLMs work

Just enough to reason about their behavior in research

NoteLearning objectives
  • Explain, at a working level, what an LLM is doing when it generates text.
  • Distinguish the major training stages and connect them to observed model behaviour.
  • Use the right vocabulary for failure modes: fabrication (made-up facts and citations) and confabulation (the mechanism that produces them).
WarningScope

This is a working model of how LLMs behave, not a mechanistic account. For accuracy, see the linked technical readings. What’s here will let you reason about behaviour in your workflow, not about model internals.

What an LLM is

A language model is trained to predict the next token given all previous tokens. Tokens are not words. They are subword units, roughly 3 to 4 characters on average for English prose. “Transformer” and ” transformer” are different tokens. “GPT” may be a single token. An emoji may expand into several. Tokenisation is opaque to users and causes systematic failures on tasks that require character-level reasoning: counting letters, precise string manipulation, certain code patterns.

The prediction is carried out by a transformer, an architecture that uses attention mechanisms to weight every token in the context against every other token at each layer. The practical upshot: the model can integrate information from anywhere in the context window, but it does not distinguish things that are true from patterns that complete this sequence plausibly. There is no internal “fact store” being queried. There is a very large set of learned associations over token sequences.

Attention in one head: a toy calculation

To make this concrete, here is how one attention head computes which tokens to weight when building the representation for a single token. Use the four-token sentence:

gene · is · not · expressed

Each token is mapped to three small vectors (Q for query, K for key, V for value) by learned weight matrices. For illustration, use toy two-dimensional vectors. Real models use 64 to 128 dimensions per head.

Token K vector
gene [2, 1]
is [0, 1]
not [1, 0]
expressed [1, 1]

The query for “expressed” is Q = [1, 2]. To find how much “expressed” should attend to each preceding token, compute the dot product Q · K_i:

Token Q · K_i Scaled ÷ √2 ≈ 1.41
gene 1×2 + 2×1 = 4 4 ÷ 1.41 ≈ 2.83
is 1×0 + 2×1 = 2 2 ÷ 1.41 ≈ 1.41
not 1×1 + 2×0 = 1 1 ÷ 1.41 ≈ 0.71
expressed 1×1 + 2×1 = 3 3 ÷ 1.41 ≈ 2.12

Scaling by √d_k prevents the dot products from growing so large that softmax saturates. Apply softmax to the scaled scores (exp of each, divided by the sum):

Token exp(score) Attention weight
gene exp(2.83) ≈ 16.9 0.54
is exp(1.41) ≈ 4.1 0.13
not exp(0.71) ≈ 2.0 0.06
expressed exp(2.12) ≈ 8.3 0.27
sum 31.3 1.00

The output for “expressed” is then the weighted sum of all V vectors:

output = 0.54·V_gene + 0.13·V_is + 0.06·V_not + 0.27·V_expressed

What this tells you: with these toy vectors, “expressed” draws mostly from “gene” (54%) and from itself (27%), and barely from “not” (6%). In a real trained model the weights would look very different. The training signal would push “not” much higher, because negation is semantically critical. The arithmetic above is real. Only the Q and K vectors are invented. Every layer of every head in a frontier model runs this same computation millions of times per forward pass.

Two inference-time parameters that matter:

  • Temperature scales the sharpness of the next-token distribution before sampling. Temperature 0 is greedy: always pick the most probable token. It is deterministic and useful for structured output and code. Higher temperatures increase diversity and apparent creativity, but also incoherence and fabrication rate. API defaults vary, but most sit around 0.7 to 1.0.
  • Context window is the maximum number of tokens the model can attend to in a single inference call. Modern frontier models have windows of 100K to 1M tokens. Effective use at very long contexts is uneven: information in the middle of a very long context is attended to less reliably than information at the start or end (the “lost in the middle” finding).

Three things to keep in mind as a working researcher:

  • An LLM is a function from a token sequence to a probability distribution over next tokens.
  • The model has no internal representation of “what is true”. It only has “what is likely given the training distribution and the in-context evidence”.
  • Long contexts are not memory. The prompt window is the entire world the model can see at inference time.

Training stages (simplified)

Modern frontier models go through at least four stages:

  1. Pretraining: next-token prediction on a large corpus of text and code. This is the source of broad knowledge and most factual content. It is also the source of factual errors and biases in the training distribution.
  2. Supervised fine-tuning (SFT): training on curated instruction-and-response pairs. Shapes the assistant format and follows-instructions behaviour.
  3. Preference optimisation (RLHF, DPO, RLAIF, constitutional methods): training against human or AI preferences. Shapes helpfulness, honesty, and refusals. It also materially affects factual and reasoning behaviour, not just tone.
  4. Safety, character, and red-team passes: additional tuning for refusals, persona, and adversarial robustness.

Don’t memorise this list. The point: post-training is not just “polish”. Two models with similar pretraining can behave very differently because of preference and safety tuning.

Why models fabricate

We use fabrication for the outcome (made-up facts, fake citations, invented function signatures) and confabulation for the mechanism (the model filling gaps in plausible ways from the training distribution). “Hallucination” is the common term, but it understates the harm when applied to citations.

Why it happens:

  • Training rewards fluent, plausible completion, not abstention or calibrated uncertainty.
  • For rare facts (a specific paper, a recent tool version, a niche protocol), the model interpolates from neighbours in the training distribution.
  • Preference tuning further pushes toward confident, helpful-sounding answers.

Huang et al. (2023) distinguish two useful types:

  • Factuality fabrication: the output contradicts a verifiable real-world fact.
  • Faithfulness fabrication: the output contradicts a source the model was given. For example, a misquote from a paper you pasted in.

Both happen, and they need different defences.

What this means for you

  • Facts at the long tail of the training distribution (specific papers, recent tools, niche protocols) are the highest-risk outputs.
  • Context you provide in the prompt is qualitatively more reliable than facts the model recalls.
  • Retrieval (RAG), tool use, and grounding in real documents reduce fabrication, but they do not eliminate it. They can introduce faithfulness fabrication where they reduce factuality fabrication.
  1. A colleague asks why a frontier LLM “knows” the abstract of a recent paper from your subfield but invents the figure legends. What is the structural reason in terms of training distribution and confabulation?
  2. You set temperature = 0 and the model still produces a different answer to the same prompt across two API calls. Name two plausible causes.
  3. Distinguish factuality fabrication from faithfulness fabrication with one example each from a research workflow.

Answers: 1. Abstracts of widely cited papers appear many times in the training corpus (broad coverage at the head of the distribution); figure legends from a single paper appear once or not at all (long tail). When asked for the legends, the model has no learned association to recall and confabulates a plausible completion. 2. Non-determinism in the inference stack: batched sampling can break ties differently across calls; many providers do not guarantee bit-exact determinism even at temperature 0; system prompts or hidden context (date, region) may differ between calls. 3. Factuality: the model invents a paper title and DOI (“Smith et al. 2021, Nature Methods”) that does not exist. Faithfulness: you paste a real paper and ask for a summary; the model says the study used n=200 mice when the paper says n=20.

Further reading