effort-free
Learn

LLMs (normal AI)

(the context machines)

An LLM is a machine that guesses the next word. That is it. That is the whole trick. It reads everything that came before, weighs each word against every other word, and picks the most likely word to follow. Do that a few hundred times in a row and you have a paragraph. Do it billions of times during training and you get ChatGPT.

5%
45%
30%
15%
0%

Top-3 guesses for next word

laptop
34%
email
22%
door
18%

Click ??? above to reveal what it picks.

Pick a sentence. The bars show how much attention each word gets. Click ??? to see the top prediction.

The process

The transformer architecture has one key idea: attention. Before picking the next word, the model scores every other word in the context against the word it is about to predict. Words that matter more get higher scores. Words that are irrelevant get ignored.

In the sentence “The chef tasted the ???” the model attends heavily to “chef” and “tasted.” Those two words almost guarantee the answer is a food. In “The pilot radioed the ???” it attends to “pilot” and “radioed” instead. The context shifts. The prediction shifts.

This happens simultaneously across many different “heads.” Each attention head looks for a different kind of relationship: subject-verb agreement, pronoun references, topic cues. The outputs get combined, passed through a feedforward layer, and the process repeats for every layer of the network, sometimes 96 layers deep.

GPT-4 has an estimated 1.8 trillion parameters. Each one is a weight. Each weight was nudged billions of times during training, until the model’s next-word guesses became accurate.

A familiar example

Think of finishing a sentence for someone mid-thought. You do not process each word in isolation. You hold the whole sentence in your head and every word you already heard informs what comes next. If they say “I need to catch the...” you hear “catch” and your brain is already preparing for “bus,” “train,” or “flight.” Attention works the same way, just with numbers.

Variants include

Encoder-only (BERT)

BERT reads a sentence and produces a meaning vector for each word. It does not generate text. It understands it. Search engines use BERT to understand your query. Sentiment classifiers, document taggers, and entity extractors are usually fine-tuned BERTs.

Decoder-only (GPT)

GPT reads everything to the left and generates one token at a time. ChatGPT, Claude, and Gemini are all decoder-only transformers. Every word they write is a next-token prediction, one after another, until the response is done.

Encoder-decoder (T5, original GPT for translation)

The encoder reads the input and creates a compressed representation. The decoder uses that representation to generate the output. Translation and summarisation tools often use this architecture. The encoder handles the source language; the decoder produces the target.

Multimodal transformers (GPT-4V, Gemini)

The same attention mechanism that works on words works on image patches. Multimodal models treat an image as a sequence of patches and process them alongside text tokens. You can describe a photo, answer questions about a diagram, or read a chart, all with one model.

The breaking point

A transformer has no clock. It does not know what time it is. It has no memory between conversations. Every time you start a new chat, it starts from scratch. Everything it “knows” is baked into its weights at training time. The thing that feels like memory is just a long input. When the context window fills up, it forgets the beginning.

Your takeaway

Every response from every chatbot, every AI writing tool, every autocomplete suggestion on your phone is a machine guessing the next word. The only thing that changed between a bad AI and a good one is how many times it practiced that guess.

The Zero-Data Promise
Your data never leaves your screen.
01 · No upload
Files stay put.
02 · No training
Your words, your own.
03 · No storage
No logs. No profile.
04 · No catch
Always free.