Training
(how they learn)
Training is the process that turns a blank network into a useful one. It is a loop. The model makes a guess. The guess gets scored. The score tells the model how wrong it was. The model adjusts every weight slightly to be less wrong next time. Then it does that again. A trillion times. That is training.
The process
Every weight in a neural network starts as a random number. The forward pass runs an input through the network and produces a prediction. The loss function measures how far off that prediction was from the correct answer. The further off, the higher the loss.
Backpropagation works backwards through the network. It calculates how much each weight contributed to the error. Weights that caused more error get nudged more. Weights that were fine get nudged less. This nudge is the learning step.
The size of each nudge is called the learning rate. Too large and the model overshoots and never settles. Too small and training takes forever. Getting this right is half the work.
Two ways to train
Classical training
You show the model labelled examples. It makes a guess. The loss measures how wrong it was. The weight update nudges every dial to make the next guess slightly better. Repeat until the loss is small.
A decision tree trains in seconds. A spam filter trains in minutes.
Switch between Classical and LLM training. Watch the active step travel around the loop.
During training, a large language model sees each token and must predict the next one. It does this across roughly 15 trillion tokens. That is more text than every human has written in recorded history, read once.
A familiar example
Think of a student studying for an exam with a practice paper. They answer a question. They check the mark scheme. They see which parts they got wrong. They go back and study those parts harder. Then they take another practice paper. Then another. After enough practice papers, the exam is easy. Training is that loop, but instead of a student and a practice paper, it is a model and 15 trillion tokens.
The breaking point
Nobody programs the model to learn facts about chemistry, history, or code. It picks those up as side effects of getting better at predicting the next word. A model trained to complete sentences about medicine learns medical facts, not because anyone told it to, but because those facts make its predictions more accurate. What emerges from the training loop is not what anyone designed.
Your takeaway
The model you are talking to was shaped entirely by what it was wrong about, billions of times, on text it will never see again. Every capability it has, and every gap, traces back to what it was trained to predict.