What powers ChatGPT

ChatGPT, Claude, and Gemini all work this way.

An LLM is a machine that guesses the next word. LLM means large language model. That is it. It reads everything that came before, weighs each word against every other word, and picks the most probable word to follow. Do that a few hundred times in a row and you have a paragraph. Do it billions of times during training and you get ChatGPT.

45%

30%

15%

Top-3 guesses for next word

laptop

34%

22%

door

18%

Click ??? above to reveal what it picks.

Pick a sentence. The bars show how much attention each word gets. Click ??? to see the top prediction.

The process

The transformer architecture has one key idea: attention. Before picking the next word, the model scores every other word in the context against the word it is about to predict. Words that matter more get higher scores. Words that are irrelevant get ignored.

In the sentence “The chef tasted the ???” the model attends to “chef” and “tasted.” Those two words almost guarantee the answer is a food. In “The pilot radioed the ???” it attends to “pilot” and “radioed” instead.

This happens across many different “heads.” Each attention head looks for a different kind of relationship: subject-verb agreement, pronoun references, topic cues. The outputs get combined, passed through a feedforward layer, and the process repeats for every layer of the network, sometimes 96 layers deep.

GPT-4 has an estimated 1.8 trillion parameters. Each one is a weight. Each weight was nudged billions of times during training, until the model’s next-word guesses became accurate.

Modern frontier models do more than predict the next word. The 2025 to 2026 generation runs internal reasoning loops before answering. The model generates a hidden chain of thought, tests hypotheses, backtracks from dead ends, and verifies its own logic. This is test-time compute: the model spends extra cycles thinking before it speaks. It costs more per answer. It produces sharper logic on hard problems.

A familiar example

Think of finishing a sentence for someone mid-thought. You do not process each word in isolation. You hold the whole sentence in your head and every word you already heard informs what comes next. If they say “I need to catch the...” you hear “catch” and your brain is already preparing for “bus,” “train,” or “flight.” Attention works the same way, with numbers.

Variants include

Encoder-only (BERT)

BERT reads a sentence and produces a meaning vector for each word. It does not generate text. It understands it. Search engines use BERT to understand your query. Sentiment classifiers, document taggers, and entity extractors are often fine-tuned BERTs.

Decoder-only (GPT)

GPT reads everything to the left and generates one token at a time. ChatGPT, Claude, and Gemini are all decoder-only transformers. Every word they write is a next-token prediction, one after another, until the response is done.

Encoder-decoder (T5, original GPT for translation)

The encoder reads the input and creates a compressed representation. The decoder uses that representation to generate the output. Translation and summarisation tools often use this architecture. The encoder handles the source language; the decoder produces the target.

Multimodal transformers (GPT-4V, Gemini)

The same attention mechanism that works on words works on image patches. Multimodal models treat an image as a sequence of patches and process them alongside text tokens. You can describe a photo, answer questions about a diagram, or read a chart, all with one model.

The transformer architecture handles more than text. Modern models tokenise voice, images, and video the same way they tokenise words. Gemini and Claude can read your screenshots. Voice models like ChatGPT Advanced Voice and Gemini Flash Live handle audio without transcribing to text first. Tokens now represent any kind of input.

The breaking point

A transformer has no clock. It has no memory between conversations. Everything it “knows” is baked into its weights at training time. The thing that feels like memory is a long input. Context window limits force the model to drop early tokens.

Why this feels different from older AI

Older AI did one thing well. A spam filter recognised spam. A translator translated. Each was trained on one task. LLMs broke that pattern. One model handles translation, summary, code, conversation, and analysis. The same weights do all of it. Training data alone produced this generality.

Your takeaway

Every response from every chatbot, every AI writing tool, every autocomplete suggestion on your phone is a machine guessing the next word.

Use this page with your AI

Copy the prompt below. Paste it into Claude, Copilot, Gemini, or any AI you use. The AI will ask you simple questions, then teach the page back to you using your own work as the example.

You are an expert teacher and AI strategist. Read the page below as your reference material.

PAGE: LLMs (normal AI)

CONTENT:
## Essence
An LLM is a machine that guesses the next word. LLM means large language model. It reads everything that came before, weighs each word against every other word, and picks the most probable word to follow. Do that a few hundred times in a row and you have a paragraph. Do it billions of times during training and you get ChatGPT.

## The process
The transformer architecture has one key idea: attention. Before picking the next word, the model scores every other word in the context against the word it is about to predict. Words that matter more get higher scores. Words that are irrelevant get ignored.

In the sentence "The chef tasted the ???" the model attends to "chef" and "tasted." Those two words almost guarantee the answer is a food. In "The pilot radioed the ???" it attends to "pilot" and "radioed" instead.

This happens across many different "heads." Each attention head looks for a different kind of relationship: subject-verb agreement, pronoun references, topic cues. The outputs get combined, passed through a feedforward layer, and the process repeats for every layer of the network, sometimes 96 layers deep.

GPT-4 has an estimated 1.8 trillion parameters. Each one is a weight. Each weight was nudged billions of times during training, until the model's next-word guesses became accurate.

## A familiar example
Think of finishing a sentence for someone mid-thought. You do not process each word in isolation. You hold the whole sentence in your head and every word you already heard informs what comes next. If they say "I need to catch the..." you hear "catch" and your brain is already preparing for "bus," "train," or "flight." Attention works the same way, with numbers.

## Variants include
- Encoder-only (BERT). BERT reads a sentence and produces a meaning vector for each word. It does not generate text. It understands it. Search engines use BERT to understand your query. Sentiment classifiers, document taggers, and entity extractors are often fine-tuned BERTs.
- Decoder-only (GPT). GPT reads everything to the left and generates one token at a time. ChatGPT, Claude, and Gemini are all decoder-only transformers. Every word they write is a next-token prediction, one after another, until the response is done.
- Encoder-decoder (T5, original GPT for translation). The encoder reads the input and creates a compressed representation. The decoder uses that representation to generate the output. Translation and summarisation tools often use this architecture.
- Multimodal transformers (GPT-4V, Gemini). The same attention mechanism that works on words works on image patches. Multimodal models treat an image as a sequence of patches and process them alongside text tokens. You can describe a photo, answer questions about a diagram, or read a chart, all with one model.

## The breaking point
A transformer has no clock. It has no memory between conversations. Everything it "knows" is baked into its weights at training time. The thing that feels like memory is a long input. Context window limits force the model to drop early tokens.

## Why this feels different from older AI
Older AI did one thing well. A spam filter recognised spam. A translator translated. Each was trained on one task. LLMs broke that pattern. One model handles translation, summary, code, conversation, and analysis. The same weights do all of it. Training data alone produced this generality.

## Your takeaway
Every response from every chatbot, every AI writing tool, every autocomplete suggestion on your phone is a machine guessing the next word.

Your job:

1. Ask me 3 to 5 simple questions about my work, my situation, and what I would actually use AI for. One question at a time. Wait for my answer between each.
2. Once you have my answers, explain the key ideas from the page back to me, using my answers as the example.
3. Suggest one concrete next step I could take this week. Tie it back to the page.
4. Push back if my answer is vague. Ask me to be specific.

Rules for you:
- Do not flatter me.
- Do not agree with me when I am wrong.
- If you do not know something, say so.
- Be brief. Two paragraphs at most per turn.
- Ask one question at a time. Do not stack questions.

How computers learn How AI gets smart

Learn