How an LLM works, without a single equation

You may have heard the metaphor of the stochastic parrot. Or the prediction engine. They’re good metaphors, but they’re still abstract. Here’s how it really works - precisely enough to make useful decisions, without a single equation.

The token: the basic unit

An LLM doesn’t read words. It reads tokens. A token is a fragment of text - sometimes a whole word, sometimes part of a word, sometimes just a space or punctuation mark. In English, a token corresponds on average to 3/4 of a word. In French, a little less, as words are longer on average.

Why does it matter? Because the length of text you send to a template is measured in tokens, not words. And API prices are charged per token. 1,000 tokens is roughly 750 words in English, or 650 in French.

When you send “Hello, can you analyze this contract?”, the model doesn’t receive a sentence. It receives a sequence of tokens: [“Bon”, “jour”, ”,”, “pou”, “vez”, “-vous”, “anal”, “yser”, “ce”, “contra”, “t”, ”?”] (approximately - the exact breakdown depends on the model’s tokenizer).

Context: short-term memory

An LLM has no persistent memory. What it “knows” about your conversation is what’s in the context window - the sequence of tokens it receives on each call.

The context window has a maximum size. GPT-4 Turbo can process up to 128,000 tokens. Claude 3.5 Sonnet, 200,000. That’s a lot of tokens. It doesn’t mean that the model uses everything with the same efficiency. Research shows that models tend to retain what’s at the beginning and end of the window better than what’s in the middle. It’s not an absolute rule, but it’s a documented bias.

Temperature: the creativity/reliability cursor

When the model predicts the next token, it doesn’t always choose the most likely one. There’s a parameter called temperature that regulates the degree of randomness in the choice.

Low temperature (close to 0): the model almost always chooses the most probable token. Deterministic, repetitive, reliable output for factual tasks.

High temperature (close to or above 1): the model diversifies its choices, exploring less likely tokens. More creative, more varied output, but also more likely to go off in unexpected directions.

For a factual use case (information extraction, format checking, filing): low temperature. For creative writing: higher temperature. Most users won’t touch it - default interfaces use an intermediate temperature.

Why it hallucinates, mechanically

Now you know the parts. The token prediction, the context, the temperature. Put them together and hallucination becomes inevitable.

The model predicts the next most likely token. When it’s talking about something it’s had little training in (a recent event, a little-known person, a specific regulation), it doesn’t have a strong signal. He predicts anyway, because he can’t not predict. And the token it predicts may be formally plausible (it’s a plausible number for a date, it’s a plausible name for a person) but factually false.

The model doesn’t know that he doesn’t know. It has no metacognition about its shortcomings. He generates truth and invention with the same degree of fluidity.

Models trained with RLHF (Reinforcement Learning from Human Feedback) may be more likely to produce smooth, confident responses, even on uncertain topics, because human annotators tend to prefer assured responses to hesitant ones.

InstructGPT - Ouyang et al. (2022)' sourceUrl='https://arxiv.org/abs/2203.02155' date='2022-03-04

This is a crucial point. Fine-tuning by human feedback (RLHF), which makes models more pleasant to use, can aggravate hallucination in terms of displayed confidence. Human annotators reward responses that appear assured. The model learns to appear confident, even when it isn’t.

RAG: a useful crutch, not a cure

RAG stands for Retrieval-Augmented Generation. The idea: instead of putting everything into context or asking the model to “memorize” everything during training, we retrieve relevant documents on the fly and inject them into the context before asking the question.

Example: you have a database of 10,000 contracts. For each question, we look for the 5 contracts closest semantically to the question, put them in context, and the model answers based on these documents.

The RAG reduces hallucinations about the area covered. If the answer is in the documents, the model will find it. If it isn’t, it can still hallucinate. And if the documents themselves contain errors, the model will pass them on.

What you can deduce for your use cases

These mechanisms have direct implications:

Tasks suitable for an LLM: format transformation, summary, classification on well-defined categories, code generation on common patterns, first draft of writing.

Risky tasks without precautions: extraction of precise factual information (dates, figures, proper names), legal or medical verification, anything that depends on recent knowledge not present in the training data.

Unsuitable tasks without specific architecture: anything that requires long-term memory, anything that requires guaranteed formal reasoning, anything that does not tolerate undetected errors.

The practical rule: the higher the cost of an error in your context, the more you need a mechanism for human or automated validation of model output. This mechanism has a cost. This cost is part of the total cost of your AI project.

The token: the basic unit#

Context: short-term memory#

Temperature: the creativity/reliability cursor#

Why it hallucinates, mechanically#

RAG: a useful crutch, not a cure#

What you can deduce for your use cases#

Related

Beyond LLMs: the AI you already use without knowing it

The 'reasoning' of models: what it does, what it doesn't

AI wasn't born in 2022: from algorithms to machine learning