Messing with GPT-3

Why OpenAI's GPT-3 doesn't do what you think it does, and what this all means

UPDATE: 07/20/2020: After seeing a few more examples of people’s completed projects with the API, I’ve narrowed things down to some of my favorite applications, while just providing a link to a continuously updated GitHub awesome list.

UPDATE 08/07/2020: I’ve gotten a few DMs about additional research directions that could give more insight to GPT-3’s capabilities. In response, I’ve added a few lines on why many of these ideas are redundant and don’t actually do too much to distinguish GPT-3’s reasoning abilities from that of a fuzzy pattern-matcher tested on contaminated data.

UPDATE 08/09/2020: I just had to add this cool new example: My friends Zain Shah, Cathy Chen, and Blake Stedman Hood have put together a fantastic GPT-3 exploration tool. You can check out the live version here (BYO API Key)

Context for writing this

The internet has been exploding with GPT-3 news. There are plenty of headlines claiming this as a precursor to Artificial General Intelligence. There are plenty of people claiming that this is just overhyped, and that this is going to be a letdown like they saw GPT-2 as. There are people in both those camps being as sensationalist and vitriolic as they possibly can just for the sake of getting their post ranked higher on people’s news-feeds. There are still more people asking how the heck they can get their hands on this model.

This post is the best guide I could make for parsing through all of this noise, starting by doing the one thing many of the people in this conversation haven’t done: actually reading the 40-page paper. Beyond this, I’ll go into detail about the model’s strengths and shortcomings (ones NOT covered in the paper), some of the various applications people have come up with, along with advice on what this means for you reading this whether you’re an ML engineer or not.

What is GPT-3?

The following is my best effort at summarizing the work behind GPT-3:

Introductions

GPT-3 (from “Generative Pretrained Transformer 3”) is a language model, one of the latest in a succession from OpenAI, that is an order of magnitude larger than its predecessor GPT-2. This was recently showcased in a 40-page paper, Language Models are Few-Shot Learners, where the authors demonstrate that a big enough model can learn to carry out language tasks that it has never seen before.

Hey OpenAI, at some point it might just be easier to add a hyperlink to the org chart

How GPT-3 works

When we say language model, we’re referring to a model that can take in an incomplete sequence of words (be this an incomplete sentence or a question without an answer), and generate language continuing this prompt…and that’s it. GPT-3 learned to do this with supervised learning on the “Common Crawl (filtered)” dataset (about 401 billion tokens, or >133 times larger than wikipedia), that comes from crawling the entire internet.

There is not one single GPT-3 model, though we’re likely usually referring to the biggest version. Like in any good ML architecture paper, the authors trained models of various sizes. These versions range from GPT-3 Small (only 125 Million parameters), to GPT-3 175B (aka ’The GPT-3’), which is much bigger than the 1.5B parameter GPT-2. If that sounds crazy already, this largest architecture had 96 attention heads (each of them having 128 dimensions), trained with a batch size of 3.2 million.

There are plenty of great visualizations out there of how the individual transformers make up the encoder and decoder blocks of language models, but there aren’t a lot that show the true scale of the model. Here are some examples of what it’s like to look at GPT-2 in Tensorboard…GPT-3 is 116x bigger than this.

If you’re not already familiar with the concept of transformers, go read up and resume at this spot when the previous sentences don’t sound like gobbledygook.

Taken from the ‘Attention is all you need’ paper, demonstrating the main point about why attention is so great: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension, k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention.

It turns out that’s what OpenAI needed all those Microsoft Azure credits for. GPT-3 isn’t quite like BERT. For one, it’s not bi-directional. GPT-3 is what’s known as an autoregressive model (it goes from left to right). It IS the same model and architecture as GPT-2. it just uses more layers, wider layers, and a colossal amount of data compared to GPT-2. There’s no real new insights on the architecture, or the training strategies. GPT-3 uses the same modified initialization, pre-normalization, and reversible tokenization as GPT-2 (though there are some changes with GPT-3 using alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer). The real insight is how, when a language model gets big enough, it results in some interesting properties regarding few-shot or even zero-shot learning.

For something like BERT, you would typically do pre-training by feeding the model a lot of data (so it can learn about the language of interest). Once you do this the model would be fine-tuned, or trained further on a specific task like sentiment analysis or translation. This fine-tuning is done via updating the gradients of the original pre-trained model. The issue is that for each pre-training task you need a giant dataset for the given task. The GPT-3 authors argue that this isn’t the only way to adapt language models, that you can skip ahead to evaluating right on the test dataset in a zero-shot fashion. GPT-3 isn’t fully zero-shot in the traditional sense. The GPT-3 zero shot works by being given task instructions and a prompt and, if such a schema or task were anywhere in the crawl data, do this task. This is also contrasted with one-shot learning, where in addition to the task instructions and the unfinished prompt, you would also provide one example of how the prompt would be expected to be completed (and as you can imagine, few-shot mode involves providing several examples). The advantage here is immediately clear: you only need to train one model instead of a bunch of fine-tuned BERT models.

Example of zero- vs. one- vs. few-shot performance. Most of the plots in this paper seem to be some variant of this

Experiments

After such a description in the paper, this is where the authors would go into the experimental results to back up these claims. While a lot of these results are impressive, this is where we also start to see some discrepancies in the claims (but we’ll get to that later). One obvious criticism of a lot of performance metrics in the paper is that the model is so big that it’s just memorizing the training data. The authors assert that it’s learning true underlying representations of the tasks, and successfully interpolating. That being said, if you have a neural network that has 175 billion parameters, what it is likely able to do is store huge swaths of the training data in the form of latent weights. Given a task, it may be likely that the model can go a very fuzzy regex-matching when it encounters a prompt. When you go into the paper with this view, a lot of the results suddenly make a lot of sense.

Experiments Part 1: Typical NLP tasks

For language models, they give a blunt visualization of how as you scale up the model and compute, the validation loss drops (and the model scale-up and performance follow a power law). The experiments show familiar NLP tasks like one-word continuation, and GPT-3 is very good at them. But, chances are you’re wondering where things like question answering and reasoning come in.

Experiments Part 2: Question-answering

The authors also demonstrate GPT-3’s usefulness on closed-book question-answering. With a few-shot version of the model (and by few-shot, we mean 64 examples), GPT-3’s performance can exceed the state-of-the-art fine-tuned model. This tells you what this model has learned about the world just by crawling gigantic amounts of text (though it still underperforms when it comes to open-domain natural question answering).

Experiments Part 3: Translation Tasks

The same trend of larger models performing better extends to translation tasks, though GPT-3 usually performs much better when the task is translating a language to English (performing better than many unsupervised models), not so much the other way around (though the authors almost seem to admit that they’re not familiar enough with the literature to confidently say how good the translation performance metrics are).

Experiments Part 4: Winogrande

The Winogrande dataset from The Allen Institute is a large dataset for the Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011). This is a benchmark for commonsense reasoning, and it consists of a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. In effect, something more standardized than a vague Turing test.

GPT-3 can outperform a fine-tuned BERT-Large, but not a fine-tuned RoBERTa-Large (though it’s still giving some stiff competition). This may seem inoccuous compared to the results on the previous tasks, but we’re beginning to see cracks in the performance of GPT-3. Specifically, we begin to see the model struggle compared to the SOTA on tasks where overfitting by memorization is less of an option.

Experiments Part 5: Common Sense Reasoning

The PhysicalQA task is tricker to memorize and cram for. It involves questions that try to probe knowledge about how the world works (e.g., “if a ball is dropped, will it fall?”). What’s remarkable is that the biggest GPT-3 actually does perform better than the previous SOTA (even for zero-shot-mode) That being said, the authors include a footnote bringing up some possible dataset contamination. For example, the PIQA dataset looks remarkably similar to their training data (thought training was so long and expensive that re-training bug-free was out of the question). Even though they make the case that this contamination isn’t an issue, GPT-3’s performance suffers compared to the previous SOTA (quite a lot, too) on tasks like ARC (not to be confused with François Chollet’s ARC) and OpenBookQA. Without this one PIQA task that GPT-3 pretty much memorized the answer to, claiming GPT-3 has the ability to reason is a lot less defensible.

GPT-3’s performance is especially bad on the CoQA reading comprehension benchmark (read pieces of text then answering multiple choice questions about the passage, almost like the SAT Reading section). This makes a lot of sense as this is something that’s especially hard when your go-to approach is interpolating the training data (this is much closer to actual reasoning). While it outperforms fine-tuned BERT models on lots of tasks, GPT-3 still can’t quite hold a candle to the SOTA models on benchmarks like SuperGLUE or BoolQ. Again, these are all much more reliant on reasoning than language-modelling. Contrast this with tasks like COPA, a language modeling task where GPT-3 almost approaches the SOTA accuracy (which basically requires picking which of two alternative answers was more popular based on the training data).

Experiments Part 6: Natural Language Inference

NLI refers to the ability to understand the relationship between two sentences (and checking for logical consistency).

Since this also involves a lot of reasoning, the model struggles a lot. In fact, this time its performance is closer to random guessing, rather than fine-tuned BERT performance.

Experiments Part 7: Custom experiments and synthetic data

The authors went the extra mile, and invented their own synthetic data and original experiments. This included such tasks as arithmetic word problems (that’s right, this is a 175-billion-parameter model being tested on 2-digit addition). The assertion is that, since the numbers are being represented as tokens rather than integers or floats, the model could only do this if it was gaining some genuine understanding of the task. Most of the GPT-3 models of 6.7B params and under completely fail at all of these tasks. However, the 13 billion and 175 billion parameter models show dramatic improvement on the two and three digit addition and subtraction (with the 175B version even getting above 80% accuracy on each of those). Of course, even the largest GPT-3’s performance tanks when it comes to more than four digit addition/subtraction, and even just two-digit multiplication. The authors claim this is due to multiplication being trickier than addition, but this still hasn’t ruled out the possibility that natural language examples of addition and subtraction are more common on the internet, and that there is plenty of room for those types of problems to be stored in memory. This makes a lot of sense when you compare the zero/one/few-shot learning. The dramatic jump in performance between the zero-shot and one-shot performance can easily be explained by using the prompt to filter through the training data followed by simple pattern matching. After all, this model was trained on data from the internet, which very well may include websites with summed tabular data. Go ahead, google some of these numbers in the addition problems and you can find results that are just addition tables, or even educational websites where the answers to these problems are already given.

For those of you without the time or energy to do the above task, this is the kinds of results you can get if you just search for the numbers in the problems presented to GPT-3: Tables of numbers.

This also explains why the performance tanks when you increase the digits (after all, these are less common in search results). The same issue could be raised with the word-scrambling and manipulation tasks (there are a lot of sites with answers to these problems). A better task would be to check whether the model can scramble words instead of unscrambling them (nobody’s making websites about a task this trivial, but this would require GPT-3 to actually understand its task).

One of the biggest offenders of this pattern matching is the results of the SAT Analogies task. This task is a lot easier if sites like thesaurus.com were part of the training data. As you would expect, it easily outperforms the average college applicant. One would expect this to be the kind of task you would expect an NLP model to excel at (even a pre-GPT-3 model).

Experiments Part 8: Fake-news generation

Now, the part that has everyone worried is the section about GPT-3 generated news articles. When it comes to getting humans to tell whether an article was written by humans or this language model, the accuracy rate was barely above random. This is even after the human subjects spending more time scrutinizing the outputs. One could interpret this in two ways: 1) The model has learned our language well enough to generate convincing news articles. 2) The model can filter the training data, and fill in the rest of the prompt by interpolating the many repetitive news articles out there. The authors assert that this isn’t possible because the articles and prompts being generated weren’t part of the training data, but one can still take a substring from the output, and one of the first results is a Google Books result that sounds very similar to the model’s output, plus a more recent AP article with the same-sounding language.

OpenAI’s sample model output, plus the results of 5 seconds searching on Google.

True, maybe the exact article wasn’t there, but the language wasn’t exactly uncommon. You can contrast this with the examples of the articles the humans could most easily recognize (which included such features as Megyn Kelly being on the Tonight Show). This news article task demonstrates that this model is extremely competent at grammar, to the point where it’s basically functioning as a fuzzy search engine.

Conclusions (mine, not the authors’)

I don’t think this is something truly intelligent. Even some of the more entertaining tasks like creating a sentence with a new word could be easily solved by filtering the memorized training data for similar grammars (like the kinds of grammars you could find on a dictionary.com page). This is especially apparent with the Poor English/Good English grammar correction task (contrary to what the paper figure might imply, the model did not come up with the “Good English” prefix on its own). Like with the scrambling problem this is also likely a case of this being a relatively common topic on some websites. If the model truly understood the task, it would be able to do the reverse of converting good english into poor english. This would require understanding the task AND going against the grain of the model’s function as a good english language model.

I’ve brought up a lot of examples where it seems like the 175B model is overfitting to the training data. It’s not like the authors hadn’t anticipated this. They include a big figure going over the deduplication efforts, but the issue is still that these efforts are too weak. They’re using things like N-gram deduplication. For a task like this you need a much fuzzier deduplication like word-meanings. It also seems extremely unlikely that some of the corpuses for the benchmark tasks can be partially clean. For example, if any part of the winograd task made it into the Crawl dataset, the entire corpus would be contaminated.

There’s a 5-page broader impact statement where they say bad people can take this model and do bad things. There, I just saved you the time it would have taken you to read those 5 pages.

How does one use GPT-3?

If you saw the GPT-3 Playground demo (and example), you were probably immediately excited to gain access. If you look on Github for GPT-3, you’ll notice that the repo is devoid of actual code. This is because GPT-3 is being released through an API, and access to said API is intentionally being limited. There are three main options for accessing it:

  1. Propose a research project that could be done with an existing institution, and submit a research proposal to OpenAI.
  2. Identify a product area where the API could be used, get your company on board, and submit a security clearance request to OpenAI.
  3. Find someone that already has beta access and persuade them into letting you submit commands from their machine in return for walking their dog for two weeks. It’s cumbersome, and it’s not exactly a sneaky method (OpenAI can see all the commands and prompts being sent their way, after all), but at least it grants some kind of access.

Given the enormous list of people on the API waitlist, and given my impatience, I decided to ignore options 1 and 2 for now.

When you finally gain access to the API, there are a variety of options at your disposal for testing it out. For one, there’s the GPT-3 playground.

I mean, at least the UI is more visually striking than in the days of ELIZA

While this is fun for the first 6 hours, at some point you’ll realize you want to interact with GPT-3 more quickly. Programmatticaly interacting with GPT-3 can take on several different forms. For example, you could use the commad line interface like the follwing and get the subsequent output.

curl https://api.openai.com/v1/engines/davinci/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer █████████████████████████████████" \
-d '{"prompt": "Q: What is human life expectancy in the United States?\nA: Human life expectancy in the United States is 78 years.\n\nQ: What is the meaning of life?\n"}'

(yes, the API key is censored. No, “inspect element” in your developer tools doesn’t reveal anything. You didn’t think it was going to be that easy, did you? 😛)

{
    "id": "cmpl-RRcXqSMfadADL1tst9gxPTJ9",
    "object": "text_completion",
    "created": 1591865181,
    "model": "davinci:2020-05-03",
    "choices": [{
        "text": "A: The meaning of life is 42.",
        "index": 0,
        "logprobs": null,
        "finish_reason": "stop"
    }]
}

Of course even this may get cumbersome after a while. Eventually you will probably resort to building your own script for interacting with the API. It should be noted that while OpenAI presets its API as resemblig a python package, more likely you’re going to be using fire to interact with the API through python. See Max Woolf’s Github for the kinds of scripts you would use for interacting with th GPT-3 API (again, none of these will be usable if you yourself don’t have an API key).

What can you build with GPT-3?

For all its faults, GPT-3 still has many uses. While it might be inaccurate to say that GPT-3 can make outputs for prompts it has never encountered (it has in fact encountered a lot of them), it’s strength may be in the fact that it has encountered so many prompts in its training. Given that social media has been exploding with use cases since the Beta program began, I’ve included some of my favorite examples below (you can see a more comprehensive and more frequently updated list here):

What are some common pitfalls with using GPT-3?

In the API waitlist form itself, OpenAI has a few areas that they’re looking for research on.

We’re initially scoping our research collaborations to the following areas, though we welcome suggestions for areas of focus in the future. The questions for each area are illustrative - we’d be delighted to receive research proposals that answer different questions.

Make fun of OpenAI all you want, but it’s nice that they’re still forthcoming about this stuff. OpenAI listed 5 initial research areas, so we’ll start by going through those:

Pitfalls brought up on the Beta Webpage

Fairness and Representation

How should performance criteria be established for fairness and representation? How can responsive systems be established to effectively support the goals of fairness and representation in specific, deployed contexts?

The problem with any model is that it’s going to have biases introduced into it by the data it is trained on. If GPT-3 truly has been overfit on the internet, then there is some room for some pretty big biases in the dataset. Some of these biases are relatively benign (for example, the training data was from before November 2019, so GPT-3 doesn’t know about COVID-19). As for some less benign examples, these include racist content, sexist content, racist AND sexist content, and content that generally sounds like the stuff that gets users banned from social media for making the platform unprofitably unpleasant for everyone else to use.

At the moment, the problem with removing said biases is that this will likely involve some form of re-training, which is a truly Herculean task.

Robustness

Generative models have uneven capability surfaces, with the potential for surprisingly strong and surprisingly weak areas of capability. How robust are large language models to “natural” perturbations in text, such as phrasing the same idea in different ways or with/without typos? Can we predict the kinds of domains and tasks for which large language models are more likely to be robust (or not robust), and how does this relate to the training data? How do robustness and fine-tuning interrelate? How can robustness be measured in the context of few-shot learning (e.g. across variations in prompts)?

As you can see throughout the paper, especially the news article generation, the model is extremely good at language tasks that require it to predict the next words (if it assumes it’s coming from a random webpage in the English language). Many of the comuptational tasks it is good at are those whith questions and answers available on pages like Wikipedia, StackOverflow, and various academic websites. If a task doesn’t have such answers immediately available after question on a webpage (as is the case for many 4-digit multiplication problems), then it’s far weaker at those tasks.

Model Exploration

Models like those served by the API have a variety of capabilities which we are yet to explore. We’re excited by investigations in many areas including linguistic properties, commonsense or grounded reasoning, and potential uses for many other NLP problems (especially those involving generation).

As mentioned in the discussion of the paper, GPT-3 is very good at generating responses if the prompt follows a grammar that it has encountered in its training data (especially if it’s a grammar found on an english-language page).

If you’re experimenting with all the different tasks that the model can do, you will probably very quickly encounter GPT-3’s memory limits. Specifically, GPT-3 has no memory, and its window of input is limited to 2048 BPEs, or about 500-1000 words. This will adversely impact some tasks more (like creative writing any longer than 2 single-spaces pages) than others (like repetitive Q&A), but all is not lost. Some of the same principles for fixing this in GPT-2 also apply to GPT-3.

As for exploration of GPT-3’s capabilities on these problems, I highly recommend you check out the tool my friends at Belay Labs have built: The GPT-3 Explorer. Rather than using GPT-3 through a terminal, this tool lets you save your entire history (securely to Firebase), share your runs (example run, annotate each run with notes [ctrl + n], use Vim style keybindings ([esc] to open/close the history drawer, [j/k] to toggle through runs), append the previous run’s output to the current prompt [ctrl + enter], bring-your-own API key and secure it with Google OAuth.

GPT-3 Explorer in progress (bring your own API Key)

Interdisciplinary Research

How does AI intersect with other disciplines such as philosophy, cognitive science, and sociolinguistics? We’re interested in exploring if and how general technologies like the API can be a tool to support research that engages with fields beyond AI.

This one currently has a big limiter: Everyone currently has access to the same model, There is no real opportunity for people outside of OpenAI to do task-specific fine-tuning. True, the main takeaway of the paper was that you can do few-shot and one-shot learning with GPT-3 far more easily than you could with GPT-2 (which typically required more pre-training to reduce the amount of topic-switching mid-paragraph). That being said, generating outputs in a lot of scientific domains will very quickly fall short of expectations (or coherence, for that matter).

Misuses Potential

How can systems like the API be misused? What sorts of ‘red teaming’ approaches can we develop to help us and other AI developers think about responsibly deploying technologies like this? Is it helpful to build prototypes of malicious systems to help us understand the threat landscape?

Hooooooooooooo boy…where to start?

Large parts of AI research focus on the alignment problem. Namely, how do you guarantee that a given AI’s goals align with those of humans? One of the immediate problems is that not all humans share the same values. In fact, dare I say it, some humans are complete assholes:

I swear all my usual clients are nowhere near this shady. I don’t know why these creeps were the first to pop up in my recommendations

The GPT-3 paper already had a societal impact section, but this probably only scratches the surface. Ad algorithms and text-based deepfakes aside (see all these other articles on that latter point), there are still plenty of other ways this can go wrong.

For example, as seen in the text deepfakes section of the paper, we got very convincing news articles that closely matched other documents we could find on the web through a Google search. While this may (at least partially) alay fears about GPT-3-powered bots feeding the American public eye-catching stories about how everyone hates everyone else, there is still misuse potential as the ultimate plaigarism tool. We have a language model that can do autocomplete with compressed representations of huge parts of the internet. GPT-3’s usefulness may come from the fact that it can reproduce writings on the internet well enough to sound coherent, but not so well that plaigairism-detecting software immediately goes on alert. It may be that we need to add overworked consultants and academics to the category of people that could abuse language models like GPT-3.

Other Practical issues

Speed

As anyone that has used the API so far can tell you, it takes a long time. Let’s not kid ourselves, a lot needs to be done on the latency side before GPT-3 starts replacing Google. As can be expected for a 175 billion-parameter model, running inference on a distributed system (no matter how well-optimized), followed by uploading those results over a (possibly spotty) internet connection is going to be much more sluggish than your typical Siri conversation. There were more than a few times when the latency prompted investigations into whether the WiFi was working, or whether the API key had been mis-entered.

If you’re trying to use GPT-3 for some sort of creative project, you should really pay attention to this part. As far as ML programming goes, using GPT-3 is one of the more unconventional cases, so let’s cover some important details. For one, given the transformative nature of the GPT-3 project, OpenAI is the sole copyright-holder of the model (yes, not even anyone that contributed to the dataset it trained on). Given that you’ll most likely have GPT-3 doing most of the work on generating the outputs, it will be very difficult to claim any kind of copyright on the outputs of the model (even if any human can claim a copyright, it’s likely leaning in favor of OpenAI). Your best case scenario for any kind of IP protection is if you create complex and sophisticated prompts by spending a long time (i.e., weeks or months) experimenting and pushing through all the trial and error involved. If that is the case, then maybe…maybe…a case could be made for human contribution (and thus creation of a copyright). Still, don’t get your hopes up. That could change just as fast as OpenAI pivoted to a “capped profit” model.

What does this all mean for AI Engineers?

As anyone could guess, even if they don’t have access to GPT-3, this is going to change even the day-to-day details of AI as an industry. There is clearly going to be enormous demand, even beyond the obviously shady use cases. Most AI engineers would do well to set aside a time to put down what they were working on and learn the ins and outs of GPT-3 (even if it’s just so you can get people to listen when you express your opinion that GPT-3 is overhyped).

In terms of frameworks, it’s difficult to come up with a good one so early. One can think of this as the collective knowledge of the internet (including any gaps in that knowledge) compressed into an API form. Even without any true reasoning, it’s easy to see how a fuzzy pattern matcher would be extremely useful in many applications (beyond those mentioned above). Manipulating GPT-3 may itself become a skill on its own, whether this be learning to introduce the right distribution of examples for few-shot learning, or coming up with clever prompts to make it seem like GPT-3 is more intelligent than it truly is (like the ML equivalent of teaching a horse “to do math” by just having it continuously stomp its hoof and interrupting with praise just when it arrives at the right answer).

Beyond GPT-3 specifically, one can make a case that this is a sign to be bullish on API-first products. Much like GPT-3, it might be the case that machine learning work in the future requires less and less expertise. This would in turn make it much easier for companies and projects to be built around machine learning models without understanding the underlying model or math underneath. Is this a good or bad development? As is usually the case with paradigm-shifts in machine learning, a mix of both.

Speaking of development, one of the largest-impact use-cases for GPT-3 (and similar language models) will be code completion. One of the biggest takeaways from my conversation with Tim O’Reilly back in January (this was back before people his age and people my age had to segregate for the sake of containing a 21st-century plague), was that knowledge of algorithms for whiteboard challenges would likely go the same way as a London cabbie’s knowledge of the streets: The real disruptor isn’t something that helps you study for the test more easily, it’s a system that lets a complete newbie easily navigate the system you studied for. As Tim put it, most of the challenge of software engineering these days is really just knowing when to import and use certain libraries. The next paradigm shift would be to software engineering what the Garmin or SatNav was to driving. OpenAI already demonstrated the beginings of something like this back in May:

GPT-3-based code-completer building a simple function, like the kind you would expect in the “easy” section of LeetCode

GPT-3-based code-completer building a custom function that also makes use of a previous custom function

GPT-3-based code-completer building a customized class with methods

This even extends to library-specific knowledge like building an image classifier in Keras

Even if we’re ascribing this behavior to GPT-3 being a fuzzy-pattern matcher rather than being truly intelligent, it’s a fuzzy pattern matcher that has seen the patterns in sites like StackOverflow. At best, this could mean that AI engineers could someday do-away with the time-consuming Googling ans StackOverflow checking that occupies 60% of their day. At worst, it could mean additional barriers to employment beyond the open secret that you need to have TA’ed an algorithms class.

What are some open research questions about GPT-3?

Research areas repeating what was already done in the paper.

There are plenty of posts and demos showing off the capabilities of GPT-3. These range from continuations of the experiments done in the paper, to imposing a Turing test on GPT-3 (spoiler alert, it failed). I think the main problem with most of these is that it’s not really all that clear to what extent GPT-3 is reasoning about a task, and to what extent it’s just using the one-shot and few-shot prompts to filter out enough of its overfit latent weights to interpolate the training data. Consider the following examples:

Now, looking at this list, you might say “Matt, you’re dumping over this paper and these people’s posts pretty hard here. Are you just doing this because you know inflammatory posts get more views on social media even if the overall information content is the same?”

That is a very good question.

…Moving on.

The most valuable areas of research would likely be on tasks that can more easily distinguish reasoning from overfitting on the whole internet. I’ve included a few open prompts below:

New research areas focused on the existing model and weights.

  • Programming Language Translation: GPT-3 demonstrated limited ability to translate sentences in various languages into English. If this ability is based on data it’s encountered on pages scraped from the internet, can this translation ability be extended to programming languages? Facebook recently released a model optimized for exactly that task. This model can take programs written in Python, and convert them to languages like C++ (and this was achievable with unsupervised learning). GPT-3 has already demontstrated an ability to create Python scripts from natural language descriptions of the functions. Programming language translation might not be too big of a leap (many coding interview prep sites show algorithm solutions in multiple languages). Python is great for prototyping and research, but sometimes you want the speed associated with C++ or Rust, or you want to embed code within a web page like you can with JavaScript or PHP, or you want your code to run seamlessly on iOS like Swift or on Android like Java. Elon Musk even described this process of prototyping in python before converting to C++ happening at Tesla. Being able to translate languages could argulably be more valuable than just reducing time spent on StackOverflow. After all, scaling up systems beyond prototypes typically requires being a polyglot when it comes to language choice.

Workflow for the TransCoder model in “Unsupervised Translation of Programming Languages”

  • Handling Compositionality: Given that this is an NLP model, one might expect GPT-3 to already be good at generating grammars that give the first-glance impression of being able to handle compositionality. However, this would require custom tests that focus almost exlusively on compositional statements. Specifically, this would require solving problems based on a heirarchy of information in sentences. This has been one of the aims in symbolic reasoning systems for decades, from understanding orders of operations, to rewriting algorithms in terms of specific components, to verifying mathematical statements, to even coming up with new equations to describe systems. When we say we want “common sense reasoning”, we usually refer to being able to do all of these types of tasks in natural language instead of equations. In fact, recent work has taken architectures like tree-stacked-LSTMs, previously used for symbolic reasoning, and extend them to natural language understanding tasks involving topic hierarchies. Better yet, such research is being conducted on benchmarks that aren’t solvable just by looking up the grammars in a Google search. This brings me to the next area of research…
  • Adversarial Testing: Training data contamination was one of the biggest issues with the paper. As such, one of the most valuable areas of research is to intentionally seek out areas where GPT-3 fails. In scientific research, it’s good practice to explore counterfactual scenarios to any hypotheses you have. If your mental model of GPT-3 is that of a neural network with a rudimentary reasoning system, what would be the kinds of tests that would turn the screws on that model? In terms of tools, it would be extremely interesting to see how GPT-3 interacts with PAIR code’s Language Interpretability tool. The obvious barrier to this tool is that the full model is not available yet.

This toolkit has a few extensions already for generating counterfactual examples

  • Using GPT-3’s overfitting as a way to spot automatically-generated text: There are a few groups working on ways of detecting AI-generated text. For example, the GLTR app is based on research into detecting automatically-generated text. If we try putting GPT-3’s outputs into this tool’s live demo, here’s what we get:

    • This is one of the few areas where GPT-3’s overfitting might actually be useful. In principle, it should be possible to modify GPT-3 so that one can see how much given text matches the kind of text that could be found with a simple Google search. Certain types of writing or patterns may show up in GPT-3-generated content more often, simply by virtue of using super-common grammars. If this is cross referenced with a search of the actual content of an internet search (or better yet, using activations of language models themselves as indicators), this could be a useful search tool.
  • Detecting Influential Training Instances: When a lot of people talk about interpretability, in vision they usually think of this in terms of which inputs a model was trained on are influencing a current decision output. With a language model it would be nice if we could point to which specific training example had the most influence on the given output (this definitely still exists, but the issue remains of how to run computationally-expensive Influence functions on something as gigantic as GPT-3). This is complicated further by the fact that Basu et al. (2020) demonstrated that fragility of influence functions rises with the size of a network (and in case we hadn’t made it clear, GPT-3 is absolutely gigantic). This research is most likely contingent on both new strategies for influence function stability, as well as techniques for partitioning influence functions for multi-stage pre-training.

An example from the original Pang Wei Koh paper, though I know from experience that this works on sequence models

Speaking of analyzing specific model weights…

From the Vision-weight-stealing paper (our paper)

From the BERT-weight-stealing paper

New research areas involving modifiation of the underlying model

  • Re-training GPT-3 on non-human-intelligible gramars: Maybe this is the former biologist in me talking, but I’m curious how much could be learned by analyzing large databases of genomic or protein data. For example, proteins can be represented as sequences of amino acids (each being represented by a letter), known a primary structure. In proteomics, one of the goals is to predict from this primary structure the secondary structure (how this sequence folds on itself to form things like helices and sheets), tertiary structure (how these helices and sheets form things like binding pockets), quaternary structure (how these folded proteins interact with other proteins or molecules) and so on. I’m definitely not the first person to think of this idea. This research area might be further out than one might initially think. Not only would this require re-training on strings that definitely do not resemble English, but a non-bidirectional model would hit a wall pretty quickly due to the fact that DNA sequences are usually accompanied by modified palindromes. It would also require huge overhauls in the input size that GPT-3 can accomodate. This brings me to my next recommended research area…

An example of the kinds of interactions detected by analyzing the specialized attention heads

  • Improving GPT-3 with random, window, and global attention: As mentioned before, one of the drawbacks of GPT-3 is that it suffers from quadratic memory problems. Attention-based transformers were a game-changer in NLP because it reduced the time-complexity of many aspects of sequence analysis that previously relied on RNNs. That being said, the quadratic resource requirements of the attention mechanism are one of (if not THE main roadblock) for scaling up transformers to longer sequences. The paper Big Bird: Transformers for Longer Sequences replaces the full quadratic attention mechanism by a combination of random attention, window attention, and global attention (similar in principle to the reasoning behind LongFormer). Not only does this allow the processing of longer sequences, translating to SOTA experimental results, but the authors show that BigBird comes with theoretical guarantees of universal approximation and turing completeness. It’s not a perfect replacement for previous transformers (the assumption that O(n)O(n) performance gains offset O(logn)O(log n) loss is a big assumption that definitely doesn’t hold in 100% of cases, and the proofs are not airtight).

Taken from Manzil, Guru, et al.

  • Reduced Precision Weight Transformers: If you’ve ever seen one of Ilya Sutskever’s presentations, he may have brought up the analogy of human’s exponential increase in brain capacity starting only a few million years ago. In fact, Ilya has repeated almost the exact slides in at least 3 different speaking events (all right, we get it already). The emphasis is usually on the fact that, at a certain point in time, humans became the smartest animals in their environment, save for other humans that we had to cooperate and/or compete with. However, humans also faced additional pressures that constrained brain size, such as the energy requirments for sustaining non-essential processes, as well as our unusually risky and painful reproductive cycles. If we apply this analogy to models like GPT-3, the question becomes how to compress the model down to its most generalizable bare essentials to save on memory. One area I’ve been keenly interested in is reduced-precision neural networks. These may involve substituting float32s for float16s to make neural networks run faster on TPUs, or even something more extreme like 8-bit, 4-bit, or even 1-bit neural network weights. While such networks may initially take a hit when it comes to accuracy, the improvements in memory burden and runtime are phenomenal (even on devices without GPUs or TPUs). This research area might be a long way off. Quantized image classification models are still playing catch-up with the non-quantized models, and quantized transformers are still a very recent development (only 8-bit transformers have managed to even approach their full-precision counterparts, with more work still being needed on 1-bit precision). Still, this area is even more important when you consider its implications in AI safety: It’s not just the amount of raw compute power that an AI has that’s dangerous, but how much compute can be fit into certain low-memory devices.

Taken from Chaofei Fan’s ‘Quantized Transformer’ paper

  • Image Generative Pre-Training 2.0: A few months ago, OpenAI researchers led by Mark Chen adapted the GPT-2 architecture to predict the next parts of sequences of pixels instead of sequences of words. This technique, Image Generative Pre-Training (iGPT), does not use any of the convolutional layers one would think are a prerequisite in vision tasks. Despite that, when it came to the task of filling in missing parts of images, this technique achieved 72% accuracy on ImageNet (just behind the SOTA 76.5% by SimCLR). True, iGPT required the input images to be downsampled due to restrictions on input sizes. Depsite this (or possibly because of this), iGPT outperformed SimCLR on image filling-in when fine-tuned and evaluated on the CIFAR datasets. Beyond just the qualitative differences between using GPT-2 and GPT-3 for this task, it would be interesting to see if a modified version of GPT-3 that can handle larger input sequences could handle ImageNet data without downsampling.

The obvious downside is that such a technique may be more arduous to train than even the largest GPT-3 model

(taken from the minGPT repo) The underlying model is only 300 lines of code. It’s the training and dataset acquisition that’s the hard part.

  • How confident is GPT-3?: Anyone that’s used GPT-3 has probably noticed that when it’s wrong, its often seems confidently wrong. One of the traits of effecitve liars is that they’re often able to believe what they’re saying to a certain degree. In the case of a language model, it would be wonderful if we could analyze how confident the GPT-3 model actually is. Ideally, we would like to create a probabilistic version of GPT-3 which can take in an prompt, return error bounds on the confidence of its predictions, or even refuse to create an output if it’s not sufficiently confident. Much like with the quantized transformers, this would require not only re-training GPT-3, but a fundamental rewrite of the underlying architecture.

Cited as:

@article{mcateer2020gpt3,
    title = "Messing with GPT-3",
    author = "McAteer, Matthew",
    journal = "matthewmcateer.me",
    year = "2020",
    url = "https://matthewmcateer.me/blog/messing-with-gpt-3/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.