Messing with GPT-Neo

How a reproduction of GPT-3 compares to the original, and how to use it.

What is going on?

In short, EleutherAI is an a loose-knit group of independent researchers that is reconstructing GPT-3. Its GPT-Neo model (which comes in 1.3B, and 2.7B sizes) is a transformer model designed using EleutherAI’s replication of the GPT-3 architecture. GPT-Neo was trained on the Pile, a large scale curated dataset created by EleutherAI for the purpose of specific training task. While the full size of GPT-3 hasn’t been replicated yet (team member Connor Leahy estimated that the model could be finished as early as August), all 4 of these existing models are now available on HuggingFace.

GPT-Neo has plenty of great things going for it. For example, there’s no need to wait for approval on OpenAI’s waitlist. The models are all open source, plus EleutherAI is actively working on larger versions. Given the availability of GPT-Neo, there is a growing community fo researchers and developers using it. Still, GPT-3 still has an edge over Neo in other areas. The largest GPT-3 (175 billion parameters) is still MUCH larger than the largest available GPT-Neo (2.7 billion parameters). GPT-3 also has a much larger community of researchers, more detailed documentation, AND a pretty well put-together web-sandbox. GPT-3 is also hosted on OpenAI’s Cloud architecture which has been optimized to return results in a few seconds. By comparison, if you try to run GPT-Neo locally you might have to wait several minutes for a response (unless you pay extra for cloud GPUs or an A100).

How does one use GPT-Neo?

Here is two colab notebooks where one can use them (don’t use the 2.7 if you don’t have Colab Pro).

GPT-Neo 1.3B Exploration (use if you DON’T have Colab Pro)

GPT-Neo 2.7B Exploration (use if you DO have Colab Pro)

When using GPT-Neo, you input a text prompt that the model will produce a continuation of. These continuations will be bounded by the Min length and max length parameters.

For example, suppose we want to get GPT-Neo to complete a dirty limerick? If you remember my earlier post about GPT-3, one of my criticisms was that the larger version seems to memorize pieces of the data it was trained on. Beyond having a much smaller capacity for memorization than GPT-3, the temperature parameter lets us control how much the output resembles memorized training data (a few have tried to phrase this as “creativity”, but I’m skeptical that this is an appropriate label).

Prompt : "There once was a man from Nantucket"
Min Length : 50
Max Length : 70
Temperature : .7
'There once was a man from Nantucket. He was a great friend of mine. He was a '
 'good man. He was a good friend of mine.\n'
 'My father was a good man.\n'
 'And so is the president of the United States, my mother.\n'
 'And so is the pope.\n'
 'And so are all')

So at least in a few instances, the temperature parameter seems to be a good way around (some of) the model weights based on crude language.

Performance evaluations for GPT-3 Neo

GPT-Neo was trained as an autoregressive language model. This means that its core functionality is taking a string of text and predicting the next token. To see how it stood up to the original GPT-3, the authors evaluated its performance on a bunch of linguistic and scientific reasoning benchmarks.

Model and Size \
Dataset and Task
GPT-Neo 1.3BGPT-2 1.5BGPT-Neo 2.7BGPT-3 Ada
Pile BPB (Linguistic Reasoning)0.75271.04680.71650.9631
Pile PPL (Linguistic Reasoning)6.159-----5.646-----
Wikitext PPL (Linguistic Reasoning)13.1017.4811.39-----
Lambada PPL (Linguistic Reasoning)7.49810.6345.6269.954
Lambada Acc (Linguistic Reasoning)57.23%51.21%62.22%51.60%
Winogrande (Linguistic Reasoning)55.01%59.40%56.50%52.90%
Hellaswag (Linguistic Reasoning)38.66%40.03%42.73%35.93%
MathQA (Physical and Scientific Reasoning)24.05%23.64%24.72%24.29%
PubMedQA (Physical and Scientific Reasoning)54.40%58.33%57.54%52.80%
Piqa (Physical and Scientific Reasoning)71.11%70.78%72.14%68.88%

Was this ethical?

The release of GPT-3 raised plenty of concerns, much like with GPT-2 before it. ElutherAI’s stated goal is to rebuild OpenAI’s full 175 billion-parameter version of GPT-3, while giving extra attention to weeding out various social biases. Specifically, the team measured word pairings and used sentiment analysis to rate the data on gender, religion, and racial bias. At least in the examples that they showed (unclear how much cherry-picking has been going on), they showed that what they deemed “unacceptably high levels of bias” were removed.

ElutherAI has also been trying to avoid many of the biases that were inherent to GPT-3’s training set. The Pile training corpus (developed by ElutherAI, and the one used to train GPT-Neo) comprises 825GB of text. In addition to established text datasets, it includes books, github repositories, webpages, IRC chat logs, and medical, physics, math, computer science, and philosophy papers. While this appears to be an step in the right direction when it comes to data diversity, the authors admit that the dataset still contains lewd, abrasive, and profane language. Their recommendation for this is to have a human in the loop inspecting outputs.

GPT-Neo is also intended to be much more open than GPT-3. Microsoft has an exclusive license to the full model, while others can sign up for access to a test a limited version of the API. GPT-3 made headlines worldwide, but few coders have actually been able to use it. Needless to say, this made a lot of people very unhappy, escpecially after OpenAI had transitioned from a “Non-profit” to a “Capped-profit” organization. In my original post on GPT-3, I raised the possibility that the careers of ML engineers could end up being defined by their access (or lack thereof) to top-performing models in a worst-case scenario. As of this writing, the GPT-Neo project is being hosted by CoreWeave, which has been giving the project free access to infrastructure. It plans eventually to host instances for paying customers (not just Microsoft).

So at least on the axes of social bias and openness, ElutherAI has a case that it’s approaching these more cautiously than the original GPT-3 (how well they achieve this with the future 175B+ GPT-Neo remains to be seen). The biggest remaining questions are whether there’s such as thing as making the model too available. In response to such questions, I’d argue that keeping GPT-3 “closed source” was never going to be possible in the long run, and that GPT-3’s handling was a perfect case study of how not to keep the proverbial genie in the bottle.

Random people and organizatons trying to reconstruct GPT-3 was probably inevitable. After all, OpenAI did release details of the architecture for the model. All it would take to re-build GPT-3 would be a suffiently motivated party with at least a minimum ability to read CS papers plus access to huge compute reserves, not to mention a lack of caution about the negative externalities of misusing GPT-3.

It turns out the Beijing Academy of Artificial Intelligence (BAAI) may fit this description pretty well. Founded in 2018, the BAAI was created to help the Chinese government achieve its goal of becoming the “global center of AI”. The projects BAAI works on and promotes also seem to be geared towards curbing China’s AI brain drain (many of their most talented engineers wind up leaving for work overseas). According to Synced Review, BAAI recently released four new models collectively referred to as “Wu Dao”, made of 4 constituent models designated Wen Yuan, Wen Lan, Wen Hui, and Wen Su.

Most of these seem to have built specifically to compete with other high-profile machine learning models from organizations like OpenAI (and no, it hasn’t escaped me that OpenAI uses Azure, one of the services whose private data was breached in the SolarWinds hack). Wen Su is based on BERT, Wen Lan is a text-image retrieval based on CLIP, Wen Hui is trying to steal DALL-E’s thunder, and last but not least Wen Yuan is tryig to be a successor to GPT-3. According to AI Technology Review, Wen Yuan is a 2.6-billion-parameter language model (with plans to scale the model up to 100 billion parameters later this year) that matched or exceeded GPT-3’s performance in various Chinese- and English-language tasks (though as we saw with GPT-Neo, reproducibility with large language models isn’t always as straightforward as we’d like).

Officially, these tools have been described as having non-confrontational uses. In January, researchers associated with the project told Wired that it could help citizens navigate aspects of China’s bureaucracy such as the Beijing Motor Vehicles Administration (I will admit, it would be wonderful to be able to trust a large language model with helping me navigate the DMV). Still, as far back as GPT-2, there was plenty of worry about the use of such language models in bad faith. In the same way that deepfakes could be used for blackmail, the worry is that such generative models could be used for tasks like astroturfing campaigns or sophisticated spear-phishing attacks. Considering the resources and effort China has already put into astroturfing (see the tweets below), and hacking US companies and government agencies, reconstructing GPT-3 may have just been a question of who would do it first.

Was this inevitable?

When the United States first developed its nuclear weapons, plenty of people within the government felt like the threat of their use was a geopolitical ace in the hole. This all changed when the USSR demonstrated to the world that it had nukes of its own. The US using nucelar weapons to end WWII had the unintended side-effect of proving that it had plans and designs worth stealing.

Fast forward about 70 years. When OpenAI decided to delay or overall cancel releases of its larger machine learning models, many of the arguments in favor were along the lines of not letting the source code or model weights fall into the hands of bad-faith actors. At the very least, researchers needed more time to figure out ways of preventing their misuse. Unfortunately, some “bath-faith actors” might have only needed proof of GPT-3’s existence to begin work on copying or stealing the work. If we assume that the reproduction of GPT-3 by outside organizations was inevitable, EleutherAI’s mission is much more justifiable.

Going back to the analogy of nuclear weapons, it certainly would have been nice if investigations into their non-warfare uses (e.g., Operation ploughshare and Project Orion) had begun in earnest much earlier. Likewise, EleutherAI was probably wise to waste no time in making a version of GPT-3 with more of the harmful social biases finetuned out. If either nuclear weapons or socially-destabilizing language models cannot be fully contained, then it’s a moral imperative to investigate either benign uses or damage control.


Cited as:

    title = "Messing with GPT-Neo",
    author = "McAteer, Matthew",
    journal = "",
    year = "2021",
    url = ""

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.