{ "cells": [ { "cell_type": "markdown", "id": "b9bd68bd-8f1a-4fc3-ae77-738448c9b59c", "metadata": {}, "source": [ "# GPT-NeoX-20B on SageMaker using Hugging Face Text Generation Inference (TGI) DLC\n", "\n", "This notebook demonstrates how to deploy GPT-NeoX-20B Language Model using Hugging Face Text Generation Inference (TGI) Deep Learning Container on Amazon SageMaker.\n", "GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile using the GPT-NeoX library\n", "\n", "TGI is an open source, high performance inference library that can be used to deploy large language models from Hugging Face’s repository in minutes. The library includes advanced functionality like model parallelism and dynamic batching to simplify production inference with large language models like flan-t5-xxl, LLaMa, StableLM, and GPT-NeoX.\n" ] }, { "cell_type": "markdown", "id": "a1ec0d15-478f-42f5-ad46-b7d0f3faddb4", "metadata": {}, "source": [ "## Setup\n", "\n", "Install the SageMaker Python SDK\n", "\n", "First, make sure that the latest version of SageMaker SDK is installed." ] }, { "cell_type": "code", "execution_count": null, "id": "520ceeee-e03e-4ed8-9a76-bb3a3f9e2724", "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install pip==8.1.1 --quiet #Added due to intermittent distutils package error - optional if no error\n", "%pip install \"sagemaker>=2.163.0\" --quiet" ] }, { "cell_type": "markdown", "id": "94ec60d3-e859-4818-a3f0-86665a1e1db2", "metadata": {}, "source": [ "## Setup account and role\n", "\n", "Then, we import the SageMaker python SDK and instantiate a sagemaker_session which we use to determine the current region and execution role." ] }, { "cell_type": "code", "execution_count": null, "id": "1e0943c1-e1b6-4e02-94fd-316dec595698", "metadata": { "tags": [] }, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri\n", "import time\n", "\n", "sagemaker_session = sagemaker.Session()\n", "region = sagemaker_session.boto_region_name\n", "role = sagemaker.get_execution_role()" ] }, { "cell_type": "markdown", "id": "a8cb0ec6-e370-4103-a35f-ae160c67702d", "metadata": {}, "source": [ "## Retrieve the LLM Image URI\n", "\n", "We use the helper function get_huggingface_llm_image_uri() to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference.\n", "\n", "The function takes a required parameter backend and several optional parameters. The backend specifies the type of backend to use for the model, the values can be \"lmi\" and \"huggingface\". The \"lmi\" stands for SageMaker LMI inference backend, and \"huggingface\" refers to using Hugging Face TGI inference backend." ] }, { "cell_type": "code", "execution_count": null, "id": "36512f00-f890-4dce-b862-51b588198dff", "metadata": { "tags": [] }, "outputs": [], "source": [ "image_uri = get_huggingface_llm_image_uri(backend=\"huggingface\", region=region) # or lmi" ] }, { "cell_type": "markdown", "id": "e60bd044-af1d-45db-8885-e4e61b8f2b6c", "metadata": {}, "source": [ "## Create the Hugging Face Model\n", "\n", "Next we configure the model object by specifying a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the HF_MODEL_ID which corresponds to the model from the HuggingFace Hub that will be deployed, and the HF_TASK which configures the inference task to be performed by the model.\n", "\n", "You should also define SM_NUM_GPUS, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. Here, you should set SM_NUM_GPUS to the number of available GPUs on your selected instance type. For example, in this tutorial, we set SM_NUM_GPUS to 4 because our selected instance type ml.g5.12xlarge has 4 available GPUs.\n", "\n", "Additionally, we could reduce the memory footprint of the model by setting the HF_MODEL_QUANTIZE environment variable to true." ] }, { "cell_type": "code", "execution_count": null, "id": "82720cc4-586f-4649-9a0b-dd8bee9fcee9", "metadata": { "tags": [] }, "outputs": [], "source": [ "model_name = \"gpt-neox-20b-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n", "\n", "hub = {\"HF_MODEL_ID\": \"EleutherAI/gpt-neox-20b\", \"HF_TASK\": \"text-generation\", \"SM_NUM_GPUS\": \"4\"}\n", "\n", "model = HuggingFaceModel(name=model_name, env=hub, role=role, image_uri=image_uri)" ] }, { "cell_type": "markdown", "id": "fc2cf50c-c934-4b77-942b-09a60fcecdd9", "metadata": {}, "source": [ "## Creating a SageMaker Endpoint\n", "Next we deploy the model by invoking the deploy() function. Here we use an ml.g5.12xlarge instance which come with 4 NVIDIA A10 GPUs. By setting the SM_NUM_GPUS environment variable to 4 in the last code block, we indicate that this model should be sharded across all 4 GPU devices." ] }, { "cell_type": "code", "execution_count": null, "id": "cb1c3c99-b178-4002-b6a9-067435d3d4b3", "metadata": { "tags": [] }, "outputs": [], "source": [ "predictor = model.deploy(\n", " initial_instance_count=1, instance_type=\"ml.g5.12xlarge\", endpoint_name=model_name\n", ")" ] }, { "cell_type": "markdown", "id": "2dc09603-867d-4a37-9bad-05129f0a640a", "metadata": {}, "source": [ "## Running Inference\n", "\n", "Once the endpoint is up and running, we can evaluate the model using the predict() function." ] }, { "cell_type": "code", "execution_count": null, "id": "e26d8603-4091-49ac-8d91-4944ac73cb01", "metadata": {}, "outputs": [], "source": [ "input_data = {\n", " \"inputs\": \"The diamondback terrapin was the first reptile to\",\n", " \"parameters\": {\"do_sample\": True, \"max_new_tokens\": 100, \"temperature\": 0.7, \"watermark\": True},\n", "}\n", "\n", "response = predictor.predict(input_data)\n", "print(response[0][\"generated_text\"])" ] }, { "cell_type": "markdown", "id": "fd85a0ad-0952-4e92-bd60-ad473147bfaf", "metadata": {}, "source": [ "## Create sample chatbot application backed by SageMaker (Optional)" ] }, { "cell_type": "code", "execution_count": null, "id": "d6cd476b-3e77-407c-9be8-673a2c31a1b9", "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install --upgrade pip --quiet\n", "%pip install gradio --upgrade --quiet" ] }, { "cell_type": "code", "execution_count": null, "id": "1744fcc4-09cf-4fa9-8b00-2d7d57a9d012", "metadata": { "tags": [] }, "outputs": [], "source": [ "import gradio as gr\n", "\n", "# hyperparameters for llm\n", "parameters = {\n", " \"do_sample\": True,\n", " \"top_p\": 0.7,\n", " \"temperature\": 0.7,\n", " \"top_k\": 50,\n", " \"max_new_tokens\": 256,\n", " \"repetition_penalty\": 1.03,\n", " \"stop\": [\"<|endoftext|>\"],\n", "}\n", "\n", "with gr.Blocks() as demo:\n", " gr.Markdown(\"## Chat with GPT-NeoX-20B LLM using Amazon SageMaker\")\n", " with gr.Column():\n", " chatbot = gr.Chatbot()\n", " with gr.Row():\n", " with gr.Column():\n", " message = gr.Textbox(\n", " label=\"Chat Message Box\", placeholder=\"Chat Message Box\", show_label=False\n", " )\n", " with gr.Column():\n", " with gr.Row():\n", " submit = gr.Button(\"Submit\")\n", " clear = gr.Button(\"Clear\")\n", "\n", " def respond(message, chat_history):\n", " # convert chat history to prompt\n", " converted_chat_history = \"\"\n", " if len(chat_history) > 0:\n", " for c in chat_history:\n", " converted_chat_history += (\n", " f\"<|prompter|>{c[0]}<|endoftext|><|assistant|>{c[1]}<|endoftext|>\"\n", " )\n", " prompt = f\"{converted_chat_history}<|prompter|>{message}<|endoftext|><|assistant|>\"\n", "\n", " # send request to endpoint\n", " llm_response = predictor.predict({\"inputs\": prompt, \"parameters\": parameters})\n", "\n", " # remove prompt from response\n", " parsed_response = llm_response[0][\"generated_text\"][len(prompt) :]\n", " chat_history.append((message, parsed_response))\n", " return \"\", chat_history\n", "\n", " submit.click(respond, [message, chatbot], [message, chatbot], queue=False)\n", " clear.click(lambda: None, None, chatbot, queue=False)\n", "\n", "demo.launch(share=True)" ] }, { "cell_type": "markdown", "id": "65c49ba2-bccb-4498-926c-3cd9da857a15", "metadata": {}, "source": [ "We can use this URL hosting the sample application to perform text generation with the model." ] }, { "cell_type": "markdown", "id": "aef71ec8-34b7-4b83-b9f2-c18f866eb1f7", "metadata": {}, "source": [ "## Cleaning Up\n", "\n", "After you've finished using the endpoint, it's important to delete it to avoid incurring unnecessary costs." ] }, { "cell_type": "code", "execution_count": null, "id": "08f63a15-3dde-45db-b62b-811ba99919d4", "metadata": {}, "outputs": [], "source": [ "predictor.delete_model()\n", "predictor.delete_endpoint()" ] }, { "cell_type": "markdown", "id": "d7d1b2bb-028b-40ea-955a-3e6285f26502", "metadata": { "tags": [] }, "source": [ "## Conclusion\n", "\n", "In this tutorial, we used a TGI container to deploy BLOOM-560m using 4 GPUs on a SageMaker ml.g5.12xlarge instance. With Hugging Face's Text Generation Inference and SageMaker Hosting, you can easily host large language models like GPT-NeoX, flan-t5-xxl, and LLaMa." ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }