Everything Product People Need to Know About Transformers (Part 2: GPT)
Or, How to Act Like You Know About GPT-3
This is Part 2 in the 3 Part Series on Transformers for Product People. Click here for Part 1, here for Part 3. This article relies on concepts and information covered in the previous articles. If you lack familiarity with transformers and GPT, beginning from Part 1 is recommended.

There is a face to all the developments happening in NLP right now, and that face is GPT-3. Why? Because GPT-3 has some fairly scary generative capabilities. There have been provocative headlines: Meet GPT-3. It Has Learned to Code [1], A Robot Wrote This Article. Are You Scared Yet [2], and OpenAI’s GPT-3 Writes Poetry, Music, and Code [3] (I actually recommend the third article for the best walkthrough of GPT-3 capabilities). It’s true, GPT-3 does some pretty cool things. My favorite applications of GPT-3 are:
- Generating a full email from a short list of bullet points, put together by https://magicemail.io/.
- Designing a software application using natural language. Tell GPT-3 what you want, and it will generate the Figma code. This use case is covered in two of the above articles because it’s really quite crazy.
- Using GPT-3 to generate click-baity titles for posts. This author used GPT-3 to push his posts to the top of Hacker News. Quite impressive and barely more dystopian than the status quo.
But why exactly is GPT-3 the model getting all this attention? Is GPT all you need to know? No.
GPT-3 has achieved headline-grabbing performance most of all by being huge. GPT-3 dwarfs all other current models in size (with nearly 20x the parameters of BERT!). This basically means GPT-3 pays more money to be more powerful than the competition. Creating a bigger model that works better is still impressive. GPT-3 demonstrates that language model performance scales as a power-law of model size, dataset size, and the amount of computation. In layman’s terms, throwing more resources at GPT will make it more powerful, but you get a seriously diminishing bang for your buck the more you spend. And how many bucks are we talking to train GPT-3? Reportedly, in the tens of millions (USD, for all my international readers) [1].
To get a sense of how big GPT-3 is, look at this chart from the beginning of 2020:

The original GPT model (second from left) has 110M parameters. New model sizes trended larger, though some researchers began scaling down the existing models while maintaining performance. Takeaway: researchers in 2019 thought 1.5B parameters was too big. On the far right is a model from Microsoft with 17.5B parameters; this is really big, they thought. To reiterate, GPT-3 is 175B parameters. Basically make this chart 10x bigger, and you could fit GPT-3. So really, really huge is the point I’m trying to make.
While GPT-3 doesn’t deserve all the attention, it deserves plenty. GPT-1 created the generative pre-training framework, which made transformer models useful across virtually every natural language processing task. To understand the breakthroughs happening in NLP, you need to understand the generative pre-training (GPT) framework. GPT is very powerful for three reasons:
- Training is easy
- Learning can be transferred to numerous downstream tasks
- It effectively generates text
All of this is contained in the name: generative pre-training. So to explain the model, I’m going to walk through the name.
Generative
The GPT architecture was developed by adapting the transformer to the task of generating wikipedia articles from the source text. Basically, a model is trained to learn the relationship between all the text of the cited sources of a wikipedia article and the body of a wikipedia article. It then ingests all the text of all the cited documents of a new wikipedia article and attempts to generate the body of the article. This experiment yielded the transformer-decoder adaptation. In the words of the study [5]:
“We suspect that for monolingual text-to-text tasks redundant information is re-learned about language in the encoder and decoder…We introduce a simple but effective modification to T-ED for long sequences that drops the encoder module (almost reducing model parameters by half for a given hyper-parameter set), combines the input and output sequences into a single “sentence” and is trained as a standard language model.
The decoder components of auto-encoders are sometimes referred to as generators, and function well as standalone components. This generator functions as a language model, which means that it is optimized to predict the next word in a sentence. This is key for understanding pros and cons of GPT.
This generator functions as a language model, which means that it is optimized to predict the next word in a sentence. This is key for understanding pros and cons of GPT.
Attention in this model functions unidirectionally, meaning that the model can only look at previous words when predicting the next word, and not the words that will follow the mystery token. In BERT, which I will cover later, attention is bidirectional. To train BERT, words in the middle of a sentence are masked, and the model uses information from the entire sentence (both sides of the word) to predict the masked word. This difference means that GPT is optimized for language tasks in which unidirectional information is sufficient. Basically, language modeling.

Language modeling is a recursive task that enables the model to: A. train on any text and B. generate new text of infinite length (with the caveat that the longer the generated text, the less dependence on the seed sequence, and the more that randomness propagates in the generated content). Together, these two abilities are extremely powerful for transfer learning.
In summary: GPT adapts the transformer model to continually predict the next word in a sentence. It is trained to understand language by predicting what comes next based on what’s come previously. If you provide a seed sentence, this task prepares the model to effectively generate new text by continually predicting what word should follow, starting from the sentence you provide. Which brings us to:
Pre-training
NLP solutions have traditionally been beyond the scope of most business problems because of cost. Developing a model required a large corpus of labeled training data. Enter transfer learning. Researchers at OpenAI sought to train the transformer-decoder as a language model on a large corpus of text, and then transfer that knowledge to a downstream task. The trained language model would need to be fine-tuned on labeled data, but on a much smaller set because of its ability to leverage its language learning. Basically, if you need me to solve a riddle in English, I need to understand the riddle but I also need to understand English. GPT is meant to be an off-the-shelf model that already understands English, and just needs to be fine-tuned on the specific challenge it needs to solve. I.e., it transfers its learning of English to the harder task of solving the riddle in English.
If you need me to solve a riddle in English, I need to understand the riddle but I also need to understand English. GPT is meant to be an off-the-shelf model that already understands English, and just needs to be fine-tuned on the specific challenge it needs to solve.

You can ask, “if the model knows English, shouldn’t it also have some knowledge of its own? Words are signifiers, after all, and when I form a sentence there is an idea behind it.” That is the hypothesis behind GPT-3, which is a scaled up version (1000x) of GPT that enjoyed similar gains in inference-time fine-tuning.
Aside: This is also the hypothesis behind the argument that GPT-3 borders on true artificial intelligence, which is buttressed by its uncanny ability to generate believable content. I will note, however, that human use of language functions quite differently than recursive picking the most likely next word. Additionally, with 175B parameters, GPT-3 has stored quite a bit of content, so on the spectrum of a book to a person, I’d say the intelligence of GPT-3 is closer to the book.
Inference-time fine-tuning means that the model does not need to train on labeled data and back-propagate error to adjust parameters. Instead, it leverages what the authors call “meta-learning, which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task” [6]. More specifically, they state: “For each task, we evaluate GPT-3 under 3 conditions: (a) ‘few-shot learning’, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) ‘one-shot learning’, where we allow only one demonstration, and ‘zero-shot’ learning, where no demonstrations are allowed and only an instruction in natural language is given to the model.
Here’s what this means: the model is so good, you don’t even need to fine-tune it on your task using supervised learning. If you don’t know what that means, you don’t need to know what that means anymore! In few-shot learning, you just feed it a few examples of what you want it to do along with your actual problem. In one-shot learning, you give it one example. In zero-shot, you just say, for example: “is this sentence grammatically correct: I ated all my muffins.” GPT-3 can manage that.
Not needing to fine-tune the model in training time means that GPT-3 can truly be used off the shelf through an API. You may as well try getting it to do anything.
Applications Revisited
In general, the applications of GPT follow directly from its architecture. Due to the size of GPT-3, however, people have found that the model has some surprising abilities. I will also note that while GPT-3 currently has all the media hype, some tasks are better suited to BERT based architectures.
As a language model, GPT is best suited to text prediction tasks. Examples of text prediction tasks are:
1. Autocomplete. GMail’s Smart Compose leverages the transformer decoder as a language model to predict what the user will say next. Smart Compose also uses the transformer encoder to pass contextual information (previous email, time of day, etc) to the decoder for more personalized suggestions.
GPT has the model complexity to generate full texts from prompts. Two examples cited above are generating full emails from bullet points and generating click-baity titles for posts.
2. Summarization. Like in the original paper, Generating Wikipedia By Summarizing Long Sequences [5], GPT is ideally suited to summarization tasks. You can seed the model with the long-form text, and then have the model generate a summary word by word. In an inference-time setting, this is achieved by simply adding the “summary:” instruction at the beginning of the output sequence. If you didn’t understand what that meant, don’t worry, it’s only relevant for the people literally running the models or trying to understand how they’re run.
3. Evaluation. GPT-3 can effectively be used to solve challenges like this one: “The lions ate the zebras because they are carnivores. They refers to: _____”. This challenge can be framed as an autocomplete problem because it is next word prediction, but the next word prediction requires understanding the sentence quite well. This is a challenge that GPT-3 performs well on. This specific challenge is called the Winograd Schema Challenge.
4. Translation. GPT-3 also performs well in tasks that are arguably “multilingual”. This post shows different people applying GPT-3 to generate full text from prompts in JSX/React, Figma code (probably CSS), Keras, etc [7]. These are basically translation tasks, in which GPT-3 is performing well with no training-time learning. While further developments in transformer scaling and training will likely yield better performance specifically on translation tasks, these results show that GPT-3 is already viable as a tool for “translating” from one language or syntax to another.
References
[1] Metz, Cade. “Meet GPT-3. It Has Learned to Code (and Blog and Argue).” The New York Times, The New York Times, 24 Nov. 2020, www.nytimes.com/2020/11/24/science/artificial-intelligence-ai-gpt3.html.
[2] “A Robot Wrote This Entire Article. Are You Scared Yet, Human? | GPT-3.” The Guardian, Guardian News and Media, 8 Sept. 2020, www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3.
[3] “OpenAI’s GPT-3 Neural Network Writes Poetry, Music and Code. Why Is It Still Far from Real AI, but Is Able to Change the World.” HybridTechCar, 8 Aug. 2020, hybridtechcar.com/2020/08/08/openais-gpt-3-neural-network-writes-poetry-music-and-code-why-is-it-still-far-from-real-ai-but-is-able-to-change-the-world/.
[4] “Turing-NLG: A 17-Billion-Parameter Language Model by Microsoft.” Microsoft Research, 13 Feb. 2020, www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/.
[5] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer. Generating wikipedia by summarizing long sequences. ICLR, 2018.
[6] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[7] O’Regan, Simon. “GPT-3: Demos, Use-Cases, Implications.” Medium, Towards Data Science, 22 July 2020, towardsdatascience.com/gpt-3-demos-use-cases-implications-77f86e540dc1.