How ChatGPT Works
This story is a rough-draft. Check back later for the fully-polished story. Last Updated: 2023–06–29
The title is presumptuous considering there is no published paper explaining how ChatGPT was made and I do not count myself an expert. But it’s now popular to create clickbait using the name “ChatGPT”, so why not?
JK. This isn’t one of those “Let me tell you about something vaguely similar to ChatGPT after I attract you to my web page using a title that makes it sound like I will tell you all about the actual ChatGPT.” No, I will do my best to learn what I can about ChatGPT and explain how it works to the best of my ability. In other words, I will try to make this story part of the 5% of stories with “ChatGPT” in their title which are not clickbait. While all of my friends and coworkers just try out ChatGPT and show each other how cool it is or how funny some of its outputs are, I would rather understand how it was made. That seems much more useful than just seeing what it can do. I want to learn how to make something similar and overcome some of its weaknesses while making good use of its strengths. As I learn, I write. This exercise — like most of my blog — is 80% for me and 20% for you. We learn the most when we try to teach.
What OpenAI Did before ChatGPT
OpenAI developed a series of “generative pre-trained transformer” models before developing ChatGPT. Based on the titles of the associated papers, we see the following sequence of progress: GPT-1 improved language understanding by using generative pre-training. GPT-2 improved language modeling through unsupervised multitask learning. GPT-3 showed how language models can be few shot learners. GPT-4 isn’t available and there is no published research yet; but some speculate it will be multi-modal.
InstructGPT was an earlier, ChatGPT-like model. It was trained for single-turn request-response interactions while ChatGPT was trained to handle longer, multi-turn dialogs. ChatGPT was made, I assume based on learnings that came from the experience OpenAI had building InstructGPT.
Pieces of ChatGPT
Tokenizer
ChatGPT and other LLMs rely on input text being broken into pieces. Each piece is about a word-sized sequence of characters or smaller. We call those sub-word tokens. Probably something like this tokenizer demo from OpenAI. OpenAI recommends using tiktoken Python library to count tokens before sending text to its API.
Large Language Model
175 billion parameter transformer-based decoder model in the family of GPT 3.5.
Technically, I understand that ChatGPT is a decoder-only model, but there is room for debate since it must encode its instruction input text to some degree. So it could be argued that ChatGPT is an encoder-decoder architecture if we ignore the historical and over-simplistic taxonomy of LLMs.
This kind of language model is a neural version of the older auto-regressive sequence model designed to predict the next time step of a continuous signal given the history of that signal. An auto-regressive language model is designed to output a probability distribution over all possible next tokens. It can be used to generate language by our sampling from that probability distribution, i.e., generate the next token by selecting one at random where “random” is given more chance to higher probability tokens.
The language model in ChatGPT is better than previous models because of an increased context size. It might use chained prompts. ChatGPT probably has a context length of between 4,096 and 8,192 sub-word tokens (or about 3,000 words according to Ari Seff) where GPT-3’s is 2,048 sub-word tokens. The API version of ChatGPT (as of 3/2023) has a limit of 4096 tokens for the gpt-3.5-turbo-0301
model, and that includes both input (prompt) and output (bot response) tokens.
Transformers
Data
Some say that ChatGPT was trained on 570GB of text.
Pretraining
Batch size.
How ChatGPT Was Made
I said there was no ChatGPT paper, but I should say that the people behind Open Assistant believe that the InstructGPT paper is close enough.
Step 1: Pre-Training
One of the models in the GPT 3.5 series was trained from scratch on Azure AI supercomputing infrastructure and finished training sometime in early 2022.
This is where the generative self-supervised pre-training occurs. In this task, the model is trained to predict/generate the next token in some large corpus of documents given a random location and the preceding token history at that location. It’s easy to acquire this training data because it’s “unlabeled” text. So a lot can be gathered, which gives the learning process enough statistical power to start recognizing subtle patterns in language. What an old-style language model would have considered too obscure and abstract to be learnable, a large language model trained on a lot of text will find learnable. This is where ChatGPT started learning things about language and the world that would astound its first users when it was released.
But this task of predicting next tokens in a large unfocused corpus is misaligned to its ultimate dialog task. Most experts say the pre-training task is a superset of the dialog task, in which case the later fine-tuning steps effectively refine the model into behaving according to one style of response-generation among many possible styles found in the original pre-training corpus, without injecting any new knowledge of language or of the world into the model.
Step 2: Supervised Learning
Collect demonstration data.
Train a supervision policy.
Step 3: Training a Reward Model
Collect comparison data.
Train a reward model.
Step 4: Reinforcement Learning
Proximal Policy Optimization (PPO) is the reinforcement learning algorithm used. Optimize a policy against the reward model using PPO.
I recently discovered this great talk by Andrej Karpathy. It’s one of the most authoritative and insightful explanations of how ChatGPT works and how to get LLMs to work in applications: https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2
Join the CAI Dialog on Slack at cai-dialog.slack.com
Other stories in TP on CAI you may like: