How to Deploy GPT-J

The Forefront Team
August 30, 2021

More than one year has passed since the public release of OpenAI's API for GPT-3. Since then, thousands of developers and hundreds of companies have started building on the platform to apply the transformer-based language model to a variety of NLP problems.

In its wake, EleutherAI, a team of AI researchers open-sourcing their work, released their first implementation of a GPT-like system, the 2.7B parameter GPT-Neo, and most recently, the 6B parameter GPT-J. Before getting into GPT-J deployments, let's understand why a company or developer would use GPT-J in the first place.

So why would one prefer to use the open-source 6B parameter GPT-J over the 175B parameter GPT-3 Davinci? The answer comes down to cost and performance.

First, let's talk about cost. With GPT-3, you pay per 1000 tokens. For the unacquainted, you can think of tokens as pieces of words, where 1000 tokens are about 750 words. So with GPT-3, your costs scale directly with usage. On the other end, the open-sourced GPT-J can be deployed to cloud infrastructure enabling you to effectively get unlimited usage while only incurring the cost of the cloud hardware hosting the model.

Now let's talk about performance. "Bigger is better" has become an adage for a reason, and transformer-based language models are no exception. While a 100B parameter transformer model will always generally outperform a 10B parameter one, the keyword is generally. Unless you're trying to solve general artificial intelligence, you probably have a specific use case in mind. This is where fine-tuning GPT-J, or specializing the model on a dataset for a specific task, can lead to better performance than GPT-3 Davinci.

Now that we've discussed why one would use GPT-J over GPT-3 to lower costs at scale and achieve better performance on specific tasks, we'll discuss how to deploy GPT-J.

How to deploy GPT-J on Forefront

For this tutorial, we'll be deploying the standard GPT-J-6B.

Create deployment

Once logged in, you can click "New deployment".

Select Vanilla GPT-J

From here, add a name and optional description for your deployment then select "Vanilla GPT-J".

Select Vanilla GPT-J

Press "Deploy"

Navigate to your newly created deployment, and press "Deploy" to deploy your Vanilla GPT-J model.

Deploy Vanilla GPT-J

Replica count

From your deployment, you can control the replica count for your deployments as usage increases to maintain fast response speeds at scale.

GPT-J Replica Count


To begin inferencing, copy the URL under the name and refer to our docs on a full set of instructions for passing requests and receiving responses.

Inferencing GPT-J

You can expect all the parameters you'd typically use with GPT-3 like response length, temperature, top P, top K, repetition penalty, and stop sequences.


You can also navigate to Playground to experiment with your new GPT-J deployment without needing to use Postman or implement any code.

Deploying GPT-J on Forefront takes only a few minutes. On top of the simplicity we bring to the deployment process, we've made several low-level machine code optimizations enabling your models to run at a fraction of the cost compared to deploying on Google's TPU v2 with no loss in throughput. If you're ready to get started deploying GPT-J, get in touch with our team.

Ready to try GPT-J?

Increase throughput, fine-tune for free, and save up to 33% on inference costs. Try GPT-J on Forefront today.

contact sales