{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "c039d896-8fb1-4181-aa7d-a0c21671376f", "metadata": { "tags": [] }, "source": [ "# Hosting bloom-7b1 on Amazon SageMaker using HuggingFace Text Generation Inference (TGI)\n", "\n", "HuggingFace [Text Generation Inference](https://huggingface.co/blog/sagemaker-huggingface-llm) (TGI) is a model server for text generation inference. TGI is powered by Python, Rust and gRPC, an open source high performance Remote Procedure Call framework.\n", "\n", "TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Llama, and T5.\n", "\n", "TGI is optimized for many large language [model architectures](https://github.com/huggingface/text-generation-inference#optimized-architectures). BLOOM is one of them.\n", "\n", "## BLOOM-7B1\n", "\n", "[BLOOM](https://bigscience.huggingface.co/) is an Open-Acces autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. BLOOM can be instructed to perform text tasks it hasn't been explicitly trained for, by casting them as text generation tasks. The BLOOM model was trained using a mix of floating point precision formats. For models with up to 7.1 billion parameters, float16 precision was used. However, for the largest model, BLOOM with 176 billion parameters, **bfloat16** precision was used. \n", "\n", " Refer to [Bloom: A 176B-parameter Open-Access Multilingual Language Model](https://arxiv.org/abs/2211.05100) paper for more details.\n", "\n", "In this notebook, we show how to deploy [bloom-7b1](https://huggingface.co/bigscience/bloom-7b1) model using HuggingFace TGI Deep Learning Container (DLC) on Amazon SageMaker.\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a6e7007e-72d3-428d-b335-b9848f9e2175", "metadata": {}, "source": [ "## Install Pre-requisites\n", "\n", "Install the SageMaker Python SDK\n", "First, make sure that the latest version of SageMaker SDK is installed.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "91edf772-d370-4d68-8ef0-0c4557c674ab", "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install \"sagemaker>=2.163.0\" watermark --upgrade --quiet" ] }, { "cell_type": "code", "execution_count": null, "id": "37e7ed6b-da2c-4aee-81b7-e71b7d91d817", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext watermark\n", "%watermark -p sagemaker" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a65bc692-fd6e-44e4-9982-7801cb9fe4ad", "metadata": {}, "source": [ "## Setup account and role\n", "\n", "We instantiate `sagemaker.Session()` which we use to determine the current region and execution role." ] }, { "cell_type": "code", "execution_count": null, "id": "643a1d60-fc2e-4d6b-86c3-0976a67b31aa", "metadata": { "tags": [] }, "outputs": [], "source": [ "from datetime import datetime\n", "from pathlib import Path\n", "from uuid import uuid4\n", "\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri\n", "\n", "sm_session = sagemaker.Session()\n", "role = sagemaker.get_execution_role()\n", "region = sm_session.boto_region_name\n", "\n", "HF_MODEL_ID = \"bigscience/bloom-7b1\"\n", "\n", "\n", "print(f\"Role: {role}\")\n", "print(f\"Region: {region}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1f853f77-0e86-486e-b1d5-fc06fe1ca0c1", "metadata": {}, "source": [ "## Create model using HF Text-Generation-Inference (TGI) backend\n", "\n", "We use the helper function `get_huggingface_llm_image_uri()` to retrieve the new HuggingFace TGI powered DLC.\n", "\n", "- The function takes a required parameter `backend` and other optional parameters. \n", "- The _backend_ specifies the type of backend to use for the model, valid values are **`lmi`** and **`huggingface`**. \n", " - using **`huggingface`** invokes Hugging Face Text-Generation-Inference (TGI) backend\n", " - specifying **`lmi`** will use SageMaker Large Model Inference (LMI) backend (with `djl-deepspeed` as the framework)\n", "\n", "\n", "Additionally, we specify the following environment variables to be use while model creation:\n", "- `HF_MODEL_ID` : model_id as defined in `huggingface` hub. e.g. `'bigscience/bloom-7b1'`\n", "- `HF_TASK` : NLP task for the model in use. e.g. `'text-generation'`\n", "- `HF_MODEL_QUANTIZE`: library to use for quantization e.g. `'bitsandbytes'`\n", "- `SM_NUM_GPUS`: Number of GPUs used per replica, specifies the tensor parallelism degree of the model. e.g. `4`\n", "\n", "\n", "Set `SM_NUM_GPUS` to the number of available GPUs when you plan to host your model on an instance type with multiple GPUs.\n", "\n", "TGI will automatically distribute and shard the model across all GPUs." ] }, { "cell_type": "code", "execution_count": null, "id": "cae53589-c4cf-433e-be83-930716f20a8a", "metadata": { "tags": [] }, "outputs": [], "source": [ "model_name = str(Path(HF_MODEL_ID.split(\"/\")[-1]))\n", "suffix = f\"{str(uuid4())[:5]}-{datetime.now().strftime('%d%b%Y')}\"\n", "\n", "# define model name, endpoint_name\n", "model_name = f\"{model_name}-{suffix}\"\n", "endpoint_name = model_name\n", "\n", "# We use ml.g5.12xlarge instance type with 4 GPUs\n", "instance_type = \"ml.g5.12xlarge\"\n", "\n", "# retrieve the llm image uri with huggingface TGI backend\n", "llm_image_uri = get_huggingface_llm_image_uri(\n", " backend=\"huggingface\", version=\"0.8.2\", region=region, session=sm_session\n", ")\n", "\n", "# print(f\"llm image uri: {llm_image_uri}\")\n", "\n", "# define hugging face config values\n", "config = {\n", " \"HF_MODEL_ID\": HF_MODEL_ID,\n", " \"HF_TASK\": \"text-generation\",\n", " \"SM_NUM_GPUS\": \"4\", # Number of GPU used per replica\n", " \"HF_MODEL_QUANTIZE\": \"bitsandbytes\", # Quantization library used\n", "}\n", "\n", "print(f\"Creating model: {model_name} using {HF_MODEL_ID}\")\n", "\n", "# create HuggingFaceModel with the HF TGI image uri\n", "model = HuggingFaceModel(name=model_name, env=config, role=role, image_uri=llm_image_uri)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ba62a5f1-662a-4376-825f-3d8be2915240", "metadata": {}, "source": [ "### Deploy model to Amazon SageMaker real-time endpoint\n", "\n", "**NOTE:** Model deployment takes approx. 10+mins" ] }, { "cell_type": "code", "execution_count": null, "id": "40b6bb70-fcac-4522-b8d7-fe3a18b5d3af", "metadata": { "tags": [] }, "outputs": [], "source": [ "health_check_timeout = 600\n", "\n", "print(f\"Deploying model to endpoint: {endpoint_name} | {instance_type} ...\")\n", "\n", "predictor = model.deploy(\n", " initial_instance_count=1,\n", " instance_type=instance_type,\n", " endpoint_name=endpoint_name,\n", " container_startup_health_check_timeout=health_check_timeout,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3664a3ae-efef-410f-acb4-77810d2ff1a9", "metadata": {}, "source": [ "### Test Inference\n", "\n", "Once the endpoint is `InService` status, we can evaluate the model using the `predict()` function.\n", "\n", "Refere [here](https://huggingface.co/blog/sagemaker-huggingface-llm#4-run-inference-and-chat-with-our-model) for a full list of parameters supported by TGI." ] }, { "cell_type": "code", "execution_count": null, "id": "03090179-3fb8-482a-b079-4d9a85001188", "metadata": { "tags": [] }, "outputs": [], "source": [ "input_data = {\n", " \"inputs\": \"What is the weather forecast for Seattle today?\",\n", " \"parameters\": {\n", " \"do_sample\": True,\n", " \"max_new_tokens\": 100,\n", " \"temperature\": 0.7,\n", " \"watermark\": True,\n", " },\n", "}\n", "\n", "prompt_length = len(input_data[\"inputs\"])\n", "\n", "response = predictor.predict(input_data)\n", "\n", "output = response[0][\"generated_text\"][prompt_length:]\n", "\n", "print(output)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9f04468b-61ee-4988-95de-59276d2943fb", "metadata": {}, "source": [ "## Cleanup\n", "\n", "After you've finished testing the endpoint, it's important to delete the `model` and `endpoint` resources to avoid incurring charges." ] }, { "cell_type": "code", "execution_count": null, "id": "ceb90ca4-0e3d-4a8c-954c-ee2a237b464a", "metadata": { "tags": [] }, "outputs": [], "source": [ "try:\n", " print(f\"Deleting model: {model_name}\")\n", " predictor.delete_model()\n", "except Exception as e:\n", " print(f\"{e}\")\n", "\n", "try:\n", " print(f\"Deleting endpoint: {endpoint_name}\")\n", " predictor.delete_endpoint()\n", "except Exception as e:\n", " print(f\"{e}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e47a7e5b-c1be-413d-b0d2-94c3c5f37cd8", "metadata": {}, "source": [ "### Conclusion\n", "\n", "In this tutorial, we used a HuggingFace TGI container to deploy **BLOOM-7B1** to an Amazon SageMaker real-time endpoint. \n", "\n", "With Hugging Face's Text Generation Inference and SageMaker Hosting, you can easily host large language models like Bloom-7b1, BLOOM-530M, GPT-NeoX, flan-t5-xxl and many more." ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "nlp", "language": "python", "name": "nlp" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 5 }