{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0e48316a",
   "metadata": {},
   "source": [
    "\n",
    "# Serve GPT-Neox-20b on SageMaker with DeepSpeed using DJL container.\n",
    "\n",
    "In this notebook, we explore how to host a large language model on SageMaker using the latest container that packages some of the most popular open source libraries for model parallel inference like DeepSpeed and Hugging Face Accelerate. We use DJLServing as the model serving solution in this example. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-bloom-176b-and-opt-30b-on-amazon-sagemaker-with-large-model-inference-deep-learning-containers-and-deepspeed/).\n",
    "\n",
    "Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.\n",
    "\n",
    "Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.\n",
    "\n",
    "SageMaker has rolled out DeepSpeed container which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.\n",
    "\n",
    "In this notebook, we deploy the open source GPT-NEOX-20B model across GPUs on a ml.g5.24xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. For further reading on DeepSpeed you can refer to https://arxiv.org/pdf/2207.00032.pdf\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff136863",
   "metadata": {},
   "source": [
    "## Licence agreement\n",
    " - View license information https://github.com/EleutherAI/gpt-neox/blob/main/LICENSE before using the model.\n",
    " - This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0. \n",
    "\n",
    " \n",
    " \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9ea19605",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import jinja2\n",
    "from sagemaker import image_uris\n",
    "import boto3\n",
    "import os\n",
    "import time\n",
    "import json\n",
    "from pathlib import Path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "361b75a6",
   "metadata": {},
   "outputs": [],
   "source": [
    "role = sagemaker.get_execution_role()  # execution role for the endpoint\n",
    "sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs\n",
    "bucket = sess.default_bucket()  # bucket to house artifacts\n",
    "model_bucket = sess.default_bucket()  # bucket to house artifacts\n",
    "s3_code_prefix = (\n",
    "    \"hf-large-model-djl-/code_gpt_neox\"  # folder within bucket where code artifact will go\n",
    ")\n",
    "\n",
    "region = sess._region_name\n",
    "account_id = sess.account_id()\n",
    "\n",
    "s3_client = boto3.client(\"s3\")\n",
    "sm_client = boto3.client(\"sagemaker\")\n",
    "smr_client = boto3.client(\"sagemaker-runtime\")\n",
    "\n",
    "jinja_env = jinja2.Environment()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c6d6f88",
   "metadata": {},
   "source": [
    "## Create SageMaker compatible Model artifact,  upload Model to S3 and bring your own inference script.\n",
    "\n",
    "SageMaker Large Model Inference containers can be used to host models without providing your own inference code. This is extremely useful when there is no custom pre-processing of the input data or postprocessing of the model's predictions.\n",
    "\n",
    "However in this notebook, we demonstrate how to deploy a model with custom inference code.\n",
    "\n",
    "SageMaker needs the model artifacts to be in a Tarball format. In this example, we provide the following files - `serving.properties` and `model.py`.\n",
    "\n",
    "The tarball is in the following format\n",
    "\n",
    "```\n",
    "code\n",
    "├──── \n",
    "│   └── serving.properties\n",
    "│   └── model.py\n",
    "   \n",
    "\n",
    "```\n",
    "\n",
    "- `serving.properties` is the configuration file that can be used to configure the model server.\n",
    "- `model.py` is the script handles any requests for serving.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "748515f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p code_gpt_neox_deepspeed"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b200bd54",
   "metadata": {},
   "source": [
    "#### Create serving.properties \n",
    "\n",
    "This is a configuration file to indicate to DJL Serving which model parallelization and inference optimization libraries you would like to use. Depending on your need, you can set the appropriate configuration.\n",
    "\n",
    "Here is a list of settings that we use in this configuration file -\n",
    "- `engine`: The engine for DJL to use. In this case, it is **DeepSpeed**.\n",
    "- `option.entryPoint`: The entrypoint python file or module. This should align with the engine that is being used. \n",
    "- `option.s3url`: Set this to the URI of the Amazon S3 bucket that contains the model. \n",
    "\n",
    "If you want to download the model from huggingface.co, you can set `option.modelid`. The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. \n",
    "- `option.tensor_parallel_degree`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests. For further reading on DeepSpeedyou can follow the link https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference. \n",
    "\n",
    "For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.\n",
    "\n",
    "\n",
    "DeepSpeed uses TensorParallelism where individual layers (Tensors) are sharded accross devices. For example each GPU can have a slice of each layer. The diagram below provides a high level illustartion of how this works\n",
    "\n",
    "<img src=\"images/TensorShard.png\" width=\"800\"/>\n",
    "\n",
    "The data is sent to all GPU devices where a partial result is computed on each GPU. The partial results are then collected though an All-Gather operation to compute the final result. \n",
    "TensorParallelism generally provides higher GPU utilization and better performance.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cb726c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# define a variable to contain the s3url of the location that has the model\n",
    "pretrained_model_location = f\"s3://sagemaker-examples-files-prod-{region}/models/gpt-neox-20b-model/\"\n",
    "print(f\"Pretrained model will be downloaded from ---- > {pretrained_model_location}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c295397f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile ./code_gpt_neox_deepspeed/serving.properties\n",
    "engine=DeepSpeed\n",
    "option.tensor_parallel_degree=4\n",
    "option.s3url={{ s3url }}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9019c06a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running\n",
    "template = jinja_env.from_string(Path(\"code_gpt_neox_deepspeed/serving.properties\").open().read())\n",
    "Path(\"code_gpt_neox_deepspeed/serving.properties\").open(\"w\").write(template.render(s3url=pretrained_model_location))\n",
    "!pygmentize code_gpt_neox_deepspeed/serving.properties | cat -n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e63791d5",
   "metadata": {},
   "source": [
    "### Create the Inference file which will be used to load the model, shard and then use the model for NLP tasks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9669901",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile ./code_gpt_neox_deepspeed/model.py\n",
    "from djl_python import Input, Output\n",
    "import deepspeed\n",
    "import torch\n",
    "import logging\n",
    "import math\n",
    "import os\n",
    "from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "\n",
    "\n",
    "def load_model(properties):\n",
    "    tensor_parallel = properties[\"tensor_parallel_degree\"]\n",
    "    model_location = properties['model_dir']\n",
    "    if \"model_id\" in properties:\n",
    "        model_location = properties['model_id']\n",
    "    logging.info(f\"Loading model in {model_location}\")\n",
    "    \n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_location)\n",
    "    \n",
    "    model = AutoModelForCausalLM.from_pretrained(model_location,\n",
    "                                                 low_cpu_mem_usage=True)\n",
    "    if \"dtype\" in properties:\n",
    "        if properties[\"dtype\"] == \"float16\":\n",
    "            model.to(torch.float16)\n",
    "        if properties[\"dtype\"] == \"bfloat16\":\n",
    "            model.to(torch.bfloat16)\n",
    "    \n",
    "    logging.info(f\"Starting DeepSpeed init with TP={tensor_parallel}\")\n",
    "    model = deepspeed.init_inference(model,\n",
    "                                     mp_size=tensor_parallel,\n",
    "                                     dtype=model.dtype,\n",
    "                                     replace_method='auto',\n",
    "                                     replace_with_kernel_inject=True)\n",
    "    return model.module, tokenizer\n",
    "\n",
    "\n",
    "model = None\n",
    "tokenizer = None\n",
    "generator = None\n",
    "\n",
    "\n",
    "def run_inference(model, tokenizer, data, params):\n",
    "    generate_kwargs = params\n",
    "    tokenizer.pad_token = tokenizer.eos_token\n",
    "    input_tokens = tokenizer.batch_encode_plus(data,\n",
    "                                               return_tensors=\"pt\",\n",
    "                                               padding=True)\n",
    "    for t in input_tokens:\n",
    "        if torch.is_tensor(input_tokens[t]):\n",
    "            input_tokens[t] = input_tokens[t].to(torch.cuda.current_device())\n",
    "    outputs = model.generate(**input_tokens, **generate_kwargs)\n",
    "    return tokenizer.batch_decode(outputs, skip_special_tokens=True)\n",
    "\n",
    "\n",
    "def handle(inputs: Input):\n",
    "    global model, tokenizer\n",
    "    if not model:\n",
    "        model, tokenizer = load_model(inputs.get_properties())\n",
    "\n",
    "    if inputs.is_empty():\n",
    "        # Model server makes an empty call to warmup the model on startup\n",
    "        return None\n",
    "    data = inputs.get_as_json()\n",
    "    \n",
    "    input_sentences = data[\"inputs\"]\n",
    "    params = data[\"parameters\"]\n",
    "    \n",
    "    outputs = run_inference(model, tokenizer, input_sentences, params)\n",
    "    result = {\"outputs\": outputs}\n",
    "    return Output().add_as_json(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dd60a29",
   "metadata": {},
   "source": [
    "**Image URI for the DJL container is being used here**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3884d357",
   "metadata": {},
   "outputs": [],
   "source": [
    "# inference_image_uri = f\"{account_id}.dkr.ecr.{region}.amazonaws.com/djl-ds:latest\"\n",
    "inference_image_uri = (\n",
    "    f\"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.20.0-deepspeed0.7.5-cu116\"\n",
    ")\n",
    "print(f\"Image going to be used is ---- > {inference_image_uri}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ddce346",
   "metadata": {},
   "source": [
    "**Create the Tarball and then upload to S3 location**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c9c52338",
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm model.tar.gz\n",
    "!tar czvf model.tar.gz -C code_gpt_neox_deepspeed ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f388dd32",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_code_artifact = sess.upload_data(\"model.tar.gz\", bucket, s3_code_prefix)\n",
    "print(f\"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e60ecd16",
   "metadata": {},
   "source": [
    "### To create the end point the steps are:\n",
    "\n",
    "1. Create the Model using the Image container and the Model Tarball uploaded earlier\n",
    "2. Create the endpoint config using the following key parameters\n",
    "\n",
    "    a) Instance Type is ml.g5.24xlarge \n",
    "    \n",
    "    b) ContainerStartupHealthCheckTimeoutInSeconds is 2400 to ensure health check starts after the model is ready    \n",
    "3. Create the end point using the endpoint config created    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "649cdd53",
   "metadata": {},
   "source": [
    "#### Create the Model\n",
    "Use the image URI for the DJL container and the s3 location to which the tarball was uploaded.\n",
    "\n",
    "The container downloads the model into the `/tmp` space on the instance because SageMaker maps the `/tmp` to the Amazon Elastic Block Store (Amazon EBS) volume that is mounted when we specify the endpoint creation parameter VolumeSizeInGB. It leverages `s5cmd`(https://github.com/peak/s5cmd) which offers a very fast download speed and hence extremely useful when downloading large models.\n",
    "\n",
    "For instances like p4dn, which come pre-built with the volume instance, we can continue to leverage the `/tmp` on the container. The size of this mount is large enough to hold the model.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72f58be6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.utils import name_from_base\n",
    "\n",
    "model_name = name_from_base(f\"gpt-neox-djl20-ds\")\n",
    "print(model_name)\n",
    "\n",
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=model_name,\n",
    "    ExecutionRoleArn=role,\n",
    "    PrimaryContainer={\n",
    "        \"Image\": inference_image_uri,\n",
    "        \"ModelDataUrl\": s3_code_artifact\n",
    "    },\n",
    "\n",
    ")\n",
    "model_arn = create_model_response[\"ModelArn\"]\n",
    "\n",
    "print(f\"Created Model: {model_arn}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38025fec",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_config_name = f\"{model_name}-config\"\n",
    "endpoint_name = f\"{model_name}-endpoint\"\n",
    "\n",
    "endpoint_config_response = sm_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"VariantName\": \"variant1\",\n",
    "            \"ModelName\": model_name,\n",
    "            \"InstanceType\": \"ml.g5.24xlarge\",\n",
    "            \"InitialInstanceCount\": 1,\n",
    "            \"ContainerStartupHealthCheckTimeoutInSeconds\": 2400,\n",
    "        },\n",
    "    ],\n",
    ")\n",
    "endpoint_config_response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53d84ba6",
   "metadata": {},
   "outputs": [],
   "source": [
    "create_endpoint_response = sm_client.create_endpoint(\n",
    "    EndpointName=f\"{endpoint_name}\", EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "print(f\"Created Endpoint: {create_endpoint_response['EndpointArn']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7454013",
   "metadata": {},
   "source": [
    "### This step can take ~ 10 min or longer so please be patient"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0de9b6eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "status = resp[\"EndpointStatus\"]\n",
    "print(\"Status: \" + status)\n",
    "\n",
    "while status == \"Creating\":\n",
    "    time.sleep(60)\n",
    "    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "    status = resp[\"EndpointStatus\"]\n",
    "    print(\"Status: \" + status)\n",
    "\n",
    "print(\"Arn: \" + resp[\"EndpointArn\"])\n",
    "print(\"Status: \" + status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "932d8421",
   "metadata": {},
   "source": [
    "#### While you wait for the endpoint to be created, you can read more about:\n",
    "- [Deep Learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html)\n",
    "- [DeepSpeed](https://www.deepspeed.ai/tutorials/inference-tutorial/#initializing-for-inference)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12f1fa1b",
   "metadata": {},
   "source": [
    "#### Leverage the Boto3 to invoke the endpoint. \n",
    "\n",
    "This is a generative model so we pass in a Text as a prompt and Model will complete the sentence and return the results.\n",
    "\n",
    "You can pass a batch of prompts as input to the model. This done by setting `inputs` to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters. These `parameters` need to be passed to the endpoint as a dictionary of `kwargs`. Refer this documentation - https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig for more details.\n",
    "\n",
    "The below code sample illustrates the invocation of the endpoint using a batch of prompts and also sets some parameters.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c9ff1d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "prompts = [\"Amazon.com is the best\", \"Amazon.com is the best\"]\n",
    "response_model = smr_client.invoke_endpoint(\n",
    "            EndpointName=endpoint_name,\n",
    "            Body=json.dumps(\n",
    "            {\n",
    "                \"inputs\": prompts,\n",
    "                \n",
    "                \"parameters\": {\n",
    "                    \"max_length\":50,\n",
    "                    'do_sample': True,\n",
    "                },\n",
    "            }\n",
    "            ),\n",
    "            ContentType=\"application/json\",\n",
    "        )\n",
    "\n",
    "response_model['Body'].read().decode('utf8')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c27db166",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "In this post, we demonstrated how to use SageMaker large model inference containers to host GPT-NeoX. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. For more details about Amazon SageMaker and its large model inference capabilities, refer to the following:\n",
    "\n",
    "* Model parallelism and large model inference on Sagemaker (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-inference.html)\n",
    "* Amazon SageMaker now supports deploying large models through configurable volume size and timeout quotas (https://aws.amazon.com/about-aws/whats-new/2022/09/amazon-sagemaker-deploying-large-models-volume-size-timeout-quotas/)\n",
    "* Real-time inference – Amazon SageMake (https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b11c344c",
   "metadata": {},
   "source": [
    "## Clean Up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9947d080",
   "metadata": {},
   "outputs": [],
   "source": [
    "# - Delete the end point\n",
    "sm_client.delete_endpoint(EndpointName=endpoint_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f92e2391",
   "metadata": {},
   "outputs": [],
   "source": [
    "# - In case the end point failed we still want to delete the model\n",
    "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm_client.delete_model(ModelName=model_name)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}