Jan 9, 2026·6 min read

What is an LLM, really? A mental model for developers

A no-hype mental model of how large language models actually work, written for engineers who want to build with them instead of just prompt them.

Most explanations of large language models are either marketing or math. Neither helps you ship. Here is the model I actually use when building with them.

An LLM is a function. It takes a sequence of tokens and returns a probability distribution over the next token. That's it. Everything else — chat, reasoning, agents, tool calls — is scaffolding wrapped around that one primitive.

Tokens, not words

Text is chopped into tokens: roughly 3 to 4 characters each. "unbelievable" might be three tokens. This matters because you pay per token, context limits are measured in tokens, and the model literally cannot see characters the way you do. Asking it to count letters in a word is asking a fish to climb a tree.

Next-token prediction is the whole engine

The model was trained on a huge corpus to answer one question over and over: given everything so far, what comes next? Stack billions of parameters and enough data, and "predict the next token well" quietly forces the model to learn grammar, facts, code patterns, and a fuzzy world model — because all of those help it predict better.

Generation is just this in a loop:

Sampling is where temperature lives. Low temperature picks the most likely token and feels deterministic. Higher temperature gambles on less likely tokens and feels creative — or wrong.

Why it hallucinates

The model optimizes for plausible, not true. A confident wrong answer and a confident right answer look identical to the loss function if both are fluent continuations. There is no internal "do I actually know this?" flag. Hallucination isn't a bug bolted on top — it's the same machinery that makes the model useful, pointed at a gap in its training.

That reframes your job. You don't ask an LLM to be correct. You give it the context that makes the correct answer the most probable continuation: retrieved documents, examples, tool results.

The context window is your RAM

The model has no memory between calls. Every request is a cold start. "Memory" in a chat app is just the transcript replayed into the prompt each turn. When people say an agent "remembers," they mean something is stuffing relevant history back into the context window.

Practical consequences:

Order matters. Models weight recent and early tokens differently; burying instructions in the middle of a giant prompt hurts.
Garbage in, garbage out, expensively. Every irrelevant token costs money and dilutes attention.
Few-shot examples are programming. Showing 3 input-output pairs often beats a paragraph of instructions.

What this buys you

Once you internalize "stateless next-token function with no truth oracle," the design patterns fall out on their own. RAG is just dynamically building a better prompt. Tool use is letting the model emit a structured token sequence you intercept and execute. An agent is a while-loop around all of it.

The engineers who build reliable LLM features aren't the ones with the cleverest prompts. They're the ones who treat the model as a component with sharp edges — and engineer around those edges.

If you want the concepts underneath this — embeddings, training objectives, and where models break — work through the Machine Learning track or browse the full lesson library. When you're ready to test whether the theory stuck, jump into Cruxible and rank against an AI tuned to your tier. Can you still beat the machine?