71
The Economist
April 22nd 2023
Science & technology
“Language Models are FewShot Learners”.
Before it sees any training data, the
weights in
GPT
3’s neural network are
mostly random. As a result, any text it gen
erates will be gibberish. Pushing its output
towards something which makes sense,
and eventually something that is fluent,
requires training.
GPT
3 was trained on
several sources of data, but the bulk of it
comes from snapshots of the entire inter
net between 2016 and 2019 taken from a da
tabase called Common Crawl. There’s a lot
of junk text on the internet, so the initial 45
terabytes were filtered using a different
machinelearning model to select just the
highquality text: 570 gigabytes of it, a da
taset that could fit on a modern laptop. In
addition,
GPT
4 was trained on an un
known quantity of images, probably sever
al terabytes. By comparison AlexNet, a
neural network that reignited imagepro
cessing excitement in the 2010s, was
trained on a dataset of 1.2m labelled imag
es, a total of 126 gigabytes—less than a
tenth of the size of
GPT
4’s likely dataset.
To train, the
LLM
quizzes itself on the
text it is given. It takes a chunk, covers up
some words at the end, and tries to guess
what might go there. Then the
LLM
un
covers the answer and compares it to its
guess. Because the answers are in the data
itself, these models can be trained in a
“selfsupervised” manner on massive da
tasets without requiring human labellers.
The model’s goal is to make its guesses
as good as possible by making as few errors
as possible. Not all errors are equal,
though. If the original text is “I love ice
cream”, guessing “I love ice hockey” is bet
ter than “I love ice are”. How bad a guess is
is turned into a number called the loss.
After a few guesses, the loss is sent back
into the neural network and used to nudge
the weights in a direction that will produce
better answers.
Dostları ilə paylaş: