"When I first considered using GPT-J, I was concerned that I had to compromise on the quality of completions, GPT-3 has many more billions of parameters after all. I didn't want to release a product I wasn't happy with, so when I saw that fine-tuned GPT-J was creating significantly better results with only a few hundred training examples, I was blown away. ForeFront had the unique ability to make inference on GPT-J really fast and effortlessly scaled up to meet the massive demand we had when the app went viral. I received many messages from users telling me it's the most realistic AI chatbot they had ever used. The project was a great success, and wouldn't have been possible without the tech and support provided by Forefront."
While the world impatiently awaited Kanye West's new album, "DONDA", to drop, Wesam Jawich, a software engineer at Google, had an idea. What if we could just ask the outspoken artist when the album was dropping? So the idea was born to create an AI that would simulate a text conversation with Ye.
The first step was to create a dataset of Kanye West dialogue to train GPT-J on. The dataset used was a compilation of Kanye interview transcripts, tweets, lyrics, and manufactured conversations. After compiling this data into a single text file, the model was trained using Google's TPU v3-8 for 15 minutes. The result was a GPT-J model fine-tuned to imitate Kanye West.
The next step was to deploy the fine-tuned GPT-J model so Jawich could integrate an API into his application so user requests would generate Kanye responses. The majority of GPT-J models are deployed to Google's TPU v2-8. At Forefront, we've taken a different approach by making several low-level machine code optimizations to ensure best-in-class cost and throughput. The result is the ability to deploy GPT-J in one-click and scale to high concurrent usage with fewer replicas. For a more detailed guide on deploying GPT-J on Forefront, check out our recent tutorial.
There are two aspects to performance that should be noted: how effective the model was at the given task of imitating Kanye and response speeds of the API.
While judging how well a model performs at imitating Kanye is certainly a subjective matter, we'll let you be the judge. Here are Kanye responses of OpenAI's 175B parameter GPT-3 Davinci compared to the fine-tuned 6B parameter GPT-J.
Prompt: Why won't you just relese Donda now?
Prompt: How have you been since your divorce with Kim?
Prompt: How is your relationship with Drake?
And here are some more examples and candid reactions to the fine-tuned GPT-J Kanye:
At its peak, TalkToKanye had 700 concurrent users each sending multiple requests a minute. This usage was managed with 3 replicas of GPT-J to maintain response speeds at various token lengths with no significant variation in lag times despite the spike in traffic.
To get a detailed breakdown on response speeds, you can download our inference speed comparisons to see how GPT-J on Forefront stacks up against other deployment methods.
It would be a natural assumption to assume that performance and cost would come at a trade-off. After all, no one expects a Honda Civic to beat a Porsche in a race. However, when it comes to deploying GPT-J on Forefront, increased performance comes at a lower cost.
We experimented with cost by switching to GPT-3 Curie during a high amount of concurrent usage. Before switching, there were 3 GPT-J replicas deployed on Forefront. After switching to Curie for 1.5 hours, the costs totaled over $12, or over $8 per hour. If Jawich had used Davinci instead of Curie, the cost would've been $120 for the 90 minutes.
By using GPT-J on Forefront instead of GPT-3 Davinci, Jawich saved 10x on costs with a better performing model for the given task.
TalkToKanye ended up being quite the trial by fire for the Forefront platform. Not only did we gain visibility into how well a fine-tuned GPT-J model can perform against the largest transformer-based language model today, GPT-3 Davinci, but we saw a real-world example of the platform's ability to scale to hundreds of concurrent users while saving 10x on costs compared to Davinci.