{ "cells": [ { "cell_type": "markdown", "id": "2054403f", "metadata": {}, "source": [ "# Serve GPT-J on SageMaker with DJLServing using PySDK\n", "\n", "In this notebook, we explore how to host a fine-tuned GPT-J parameter model on SageMaker using [Deep Java Library (DJL) on Amazon SageMaker](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/index.html).\n", "\n", "Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.\n", "\n", "Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.\n", "\n", "SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.\n", "\n", "In this notebook, we deploy the fine tuned GPT-J model. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf \n" ] }, { "cell_type": "code", "execution_count": null, "id": "c780d5fc", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Instal boto3 library to create model and run inference workloads\n", "%pip install -Uqq boto3 awscli" ] }, { "cell_type": "markdown", "id": "d0d4c85a", "metadata": {}, "source": [ "## Section to Download Model from S3\n", "\n", "In this section we download the model archive from S3. We will decompress the file and inspect the artifacts. " ] }, { "cell_type": "code", "execution_count": null, "id": "4f5a06ca", "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "import boto3\n", "\n", "bucket = sagemaker.session.Session().default_bucket()\n", "print(bucket)\n", "\n", "from sagemaker import get_execution_role\n", "\n", "role = sagemaker.get_execution_role() # execution role for the endpoint\n", "session = (\n", " sagemaker.session.Session()\n", ") # sagemaker session for interacting with different AWS APIs\n", "region = session._region_name\n", "\n", "sm_client = boto3.client(\"sagemaker\")" ] }, { "cell_type": "markdown", "id": "f37ba706-b12d-4427-928e-dadfdd02a811", "metadata": {}, "source": [ "Next cell controls which local path to use for fetching the model" ] }, { "cell_type": "code", "execution_count": null, "id": "4fddb70e-ac30-4d03-bfad-c73155bb58b3", "metadata": { "tags": [] }, "outputs": [], "source": [ "local_model_path = \"./model\"" ] }, { "cell_type": "markdown", "id": "b625f99d-2591-4617-94ef-428324a7b4ac", "metadata": {}, "source": [ "