{ "cells": [ { "cell_type": "markdown", "id": "e98a7aa5-7969-4933-86a4-1b2bdb8adfcd", "metadata": {}, "source": [ "# Multiple Ensembles with GPU models using Amazon SageMaker in MME mode" ] }, { "cell_type": "markdown", "id": "d52d9586-ecba-4f7b-a582-41365acad13c", "metadata": {}, "source": [ "In this notebook, we will re-use a couple of examples listed under the parent folder ../ensemble/, and deploy them using MME. In order to create a working example and for clarify reasons, the relevant part of the notebooks are re-listed here." ] }, { "cell_type": "markdown", "id": "54f71e9b-bb40-4c4c-b106-b34f6bceecdb", "metadata": {}, "source": [ "#### A. TF+DALI Ensemble" ] }, { "cell_type": "markdown", "id": "e43dd308-5ad1-4766-acb4-1635d87c4af3", "metadata": {}, "source": [ "In this ensemble, the DALI pipeline pre-processes the input using CPU. The input from this model is fed into the TF Inception model, which runs on GPU" ] }, { "cell_type": "markdown", "id": "dd45dcd9-e77d-4119-898f-a18c649912a4", "metadata": {}, "source": [ "#### B. TRT+Python Ensemble" ] }, { "cell_type": "markdown", "id": "73347a4f-99ce-4329-8485-8d411736d1fc", "metadata": {}, "source": [ "In this ensemble, a TRT model (BERT) and the post-process python models run on GPU, whereas the pre-process model runs on CPU" ] }, { "cell_type": "markdown", "id": "8ac9f7a6-37ce-4507-b5bf-08b5091bc8c4", "metadata": {}, "source": [ "#### In both the examples, one more GPU models are executed on the same host, and each example is an ensemble with multiple models working together to create a pipeline reflective of a single model" ] }, { "cell_type": "markdown", "id": "6140dea3-06c4-41b3-beb1-c17c105c5aa8", "metadata": {}, "source": [ "## 1.Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "f6dc7f85-4967-49b4-a87e-90705db73775", "metadata": {}, "outputs": [], "source": [ "!pip install -qU pip awscli boto3 sagemaker --quiet\n", "!pip install nvidia-pyindex --quiet\n", "!pip install tritonclient[http] --quiet" ] }, { "cell_type": "code", "execution_count": null, "id": "5564289b-fa35-4e0d-a825-31fcac1240a9", "metadata": {}, "outputs": [], "source": [ "# Note: We are installing NVIDIA DALI Cuda in the below step. You need to execute this notebook on a GPU based instance.\n", "!pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110" ] }, { "cell_type": "code", "execution_count": null, "id": "ff5254ca-9566-4a61-b5f0-8a850f8743e2", "metadata": {}, "outputs": [], "source": [ "import boto3, json, sagemaker, time\n", "from sagemaker import get_execution_role\n", "import nvidia.dali as dali\n", "import nvidia.dali.types as types" ] }, { "cell_type": "code", "execution_count": null, "id": "9481ea09-619c-494e-adea-3ba853977777", "metadata": {}, "outputs": [], "source": [ "# SageMaker varaibles\n", "sm_client = boto3.client(service_name=\"sagemaker\")\n", "runtime_sm_client = boto3.client(\"sagemaker-runtime\")\n", "sagemaker_session = sagemaker.Session(boto_session=boto3.Session())\n", "role = get_execution_role()\n", "\n", "# Other Variables\n", "instance_type = \"ml.g4dn.4xlarge\"\n", "sm_model_name = \"triton-tf-dali-ensemble-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n", "endpoint_config_name = \"triton-tf-dali-ensemble-\" + time.strftime(\n", " \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n", ")\n", "endpoint_name = \"triton-tf-dali-ensemble-\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())" ] }, { "cell_type": "markdown", "id": "ba95b0b7-931c-4806-8b51-479e804ae2a8", "metadata": {}, "source": [ "## 2. TF+DALI Ensemble" ] }, { "cell_type": "code", "execution_count": null, "id": "5de42d02-d708-4369-9891-aa168a3f5092", "metadata": {}, "outputs": [], "source": [ "!mkdir -p model_repository/inception_graphdef/1\n", "!mkdir -p model_repository/dali/1\n", "!mkdir -p model_repository/ensemble_dali_inception/1" ] }, { "cell_type": "code", "execution_count": null, "id": "01a4e7a1-a219-45b9-a3fd-801c4472a920", "metadata": {}, "outputs": [], "source": [ "!wget -O /tmp/inception_v3_2016_08_28_frozen.pb.tar.gz \\\n", " https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz" ] }, { "cell_type": "code", "execution_count": null, "id": "7ec7b9e5-c664-4c84-aa87-954368c8a574", "metadata": {}, "outputs": [], "source": [ "!(cd /tmp && tar xzf inception_v3_2016_08_28_frozen.pb.tar.gz)\n", "!mv /tmp/inception_v3_2016_08_28_frozen.pb model_repository/inception_graphdef/1/model.graphdef" ] }, { "cell_type": "markdown", "id": "65e83d45-d264-45bb-82a5-1c0a945c480f", "metadata": {}, "source": [ "Write model config for ensemble" ] }, { "cell_type": "code", "execution_count": null, "id": "edb6adb9-4443-4e06-a441-2fefe3f67ba0", "metadata": {}, "outputs": [], "source": [ "%%writefile model_repository/ensemble_dali_inception/config.pbtxt\n", "name: \"ensemble_dali_inception\"\n", "platform: \"ensemble\"\n", "max_batch_size: 256\n", "input [\n", " {\n", " name: \"INPUT\"\n", " data_type: TYPE_UINT8\n", " dims: [ -1 ]\n", " }\n", "]\n", "output [\n", " {\n", " name: \"OUTPUT\"\n", " data_type: TYPE_FP32\n", " dims: [ 1001 ]\n", " }\n", "]\n", "ensemble_scheduling {\n", " step [\n", " {\n", " model_name: \"dali\"\n", " model_version: -1\n", " input_map {\n", " key: \"DALI_INPUT_0\"\n", " value: \"INPUT\"\n", " }\n", " output_map {\n", " key: \"DALI_OUTPUT_0\"\n", " value: \"preprocessed_image\"\n", " }\n", " },\n", " {\n", " model_name: \"inception_graphdef\"\n", " model_version: -1\n", " input_map {\n", " key: \"input\"\n", " value: \"preprocessed_image\"\n", " }\n", " output_map {\n", " key: \"InceptionV3/Predictions/Softmax\"\n", " value: \"OUTPUT\"\n", " }\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "id": "c9567999-46df-4ebe-b495-bbddfca0d049", "metadata": {}, "source": [ "Model config for DALI backend" ] }, { "cell_type": "code", "execution_count": null, "id": "c5261ba8-5522-48d3-9d5d-c9806f75fbf2", "metadata": {}, "outputs": [], "source": [ "%%writefile model_repository/dali/config.pbtxt\n", "name: \"dali\"\n", "backend: \"dali\"\n", "max_batch_size: 256\n", "input [\n", " {\n", " name: \"DALI_INPUT_0\"\n", " data_type: TYPE_UINT8\n", " dims: [ -1 ]\n", " }\n", "]\n", "output [\n", " {\n", " name: \"DALI_OUTPUT_0\"\n", " data_type: TYPE_FP32\n", " dims: [ 299, 299, 3 ]\n", " }\n", "]\n", "parameters: [\n", " {\n", " key: \"num_threads\"\n", " value: { string_value: \"12\" }\n", " }\n", "]" ] }, { "cell_type": "markdown", "id": "6f3ca718-d79f-4343-9ad2-341933fead7e", "metadata": {}, "source": [ "Model config for inception, using GPU" ] }, { "cell_type": "code", "execution_count": null, "id": "251b492e-6ddc-49b5-a6ab-1056c2fc4427", "metadata": {}, "outputs": [], "source": [ "%%writefile model_repository/inception_graphdef/config.pbtxt\n", "name: \"inception_graphdef\"\n", "platform: \"tensorflow_graphdef\"\n", "max_batch_size: 256\n", "input [\n", " {\n", " name: \"input\"\n", " data_type: TYPE_FP32\n", " format: FORMAT_NHWC\n", " dims: [ 299, 299, 3 ]\n", " }\n", "]\n", "output [\n", " {\n", " name: \"InceptionV3/Predictions/Softmax\"\n", " data_type: TYPE_FP32\n", " dims: [ 1001 ]\n", " label_filename: \"inception_labels.txt\"\n", " }\n", "]\n", "instance_group [\n", " {\n", " kind: KIND_GPU\n", " }\n", "]" ] }, { "cell_type": "markdown", "id": "915cc311-f324-4a34-8564-511f052ac13c", "metadata": {}, "source": [ "Download inception_labels.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "b10da8a9-d36a-4a9e-b04d-b3ecce5ea793", "metadata": {}, "outputs": [], "source": [ "region = boto3.Session().region_name\n", "\n", "s3_client = boto3.client(\"s3\")\n", "s3_client.download_file(\n", " f\"sagemaker-example-files-prod-{region}\", \"datasets/labels/inception_labels.txt\", f\"model_repository/inception_graphdef/inception_labels.txt\"\n", ")" ] }, { "cell_type": "markdown", "id": "826136b2-636e-4d9a-aa92-5cc4285660f5", "metadata": {}, "source": [ "Create DALI Pipeline" ] }, { "cell_type": "code", "execution_count": null, "id": "edba82e7-22fd-4521-ac29-67e090d608fb", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)\n", "def pipe():\n", " \"\"\"Create a pipeline which reads images and masks, decodes the images and returns them.\"\"\"\n", " images = dali.fn.external_source(device=\"cpu\", name=\"DALI_INPUT_0\")\n", " images = dali.fn.decoders.image(images, device=\"mixed\", output_type=types.RGB)\n", " images = dali.fn.resize(\n", " images, resize_x=299, resize_y=299\n", " ) # resize image to the default 299x299 size\n", " images = dali.fn.crop_mirror_normalize(\n", " images,\n", " dtype=types.FLOAT,\n", " output_layout=\"HWC\",\n", " crop=(299, 299), # crop image to the default 299x299 size\n", " mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], # crop a central region of the image\n", " std=[0.229 * 255, 0.224 * 255, 0.225 * 255], # crop a central region of the image\n", " )\n", " return images\n", "\n", "\n", "pipe().serialize(filename=\"model_repository/dali/1/model.dali\")" ] }, { "cell_type": "markdown", "id": "d602aa3a-39a0-4a2a-843c-26a6c348cde9", "metadata": {}, "source": [ "Upload model artifacts to S3" ] }, { "cell_type": "code", "execution_count": null, "id": "01233cc4-36b8-4a04-869f-33d006a203e5", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "!tar -cvzf model_tf_dali.tar.gz -C model_repository .\n", "model_uri = sagemaker_session.upload_data(\n", " path=\"model_tf_dali.tar.gz\", key_prefix=\"triton-mme-gpu-ensemble\"\n", ")\n", "print(\"S3 model uri: {}\".format(model_uri))" ] }, { "cell_type": "markdown", "id": "f5ec61ba-d8d6-4255-9ea6-de9f2064b783", "metadata": {}, "source": [ "## 3. TRT + Python Ensemble" ] }, { "cell_type": "markdown", "id": "cda1f0d4-d662-4e42-9836-196a42d2bcc4", "metadata": {}, "source": [ "For this example, we will download a pretrained model from transformers library. The rest of the models i.e. pre-process and post-process, along with config.pbtxt for all models are included in the folder `ensemble_hf`" ] }, { "cell_type": "code", "execution_count": null, "id": "ee652b24-3bb5-4452-b550-e9bb517472d3", "metadata": {}, "outputs": [], "source": [ "model_id = \"sentence-transformers/all-MiniLM-L6-v2\"" ] }, { "cell_type": "code", "execution_count": null, "id": "c05299c4-c577-4194-828f-9b810352cb44", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "! docker run --gpus=all --rm -it -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:23.03-py3 /bin/bash generate_model_trt.sh $model_id" ] }, { "cell_type": "code", "execution_count": null, "id": "6df47566-ea06-4454-b351-653d81101284", "metadata": {}, "outputs": [], "source": [ "! mkdir -p ensemble_hf/bert-trt/1 && mv workspace/model.plan ensemble_hf/bert-trt/1/model.plan && rm -rf workspace/model.onnx workspace/core*" ] }, { "cell_type": "markdown", "id": "066f93c3-fc75-410e-860c-d37ce12d3b52", "metadata": {}, "source": [ "Create a custom python conda environment with required dependencies installed" ] }, { "cell_type": "code", "execution_count": null, "id": "4ef8757c-f2ef-4a45-8ab8-1c4eb5db9fb4", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "!bash conda_dependencies.sh\n", "!cp processing_env.tar.gz ensemble_hf/postprocess/ && cp processing_env.tar.gz ensemble_hf/preprocess/\n", "!rm processing_env.tar.gz" ] }, { "cell_type": "markdown", "id": "4ea52afb-8b6b-421b-aa32-1c3c8be2a172", "metadata": {}, "source": [ "Upload model artifacts to S3" ] }, { "cell_type": "code", "execution_count": null, "id": "57316610-92c3-4109-83a0-7b497092a219", "metadata": {}, "outputs": [], "source": [ "!tar -C ensemble_hf/ -czf model_trt_python.tar.gz .\n", "model_uri = sagemaker_session.upload_data(\n", " path=\"model_trt_python.tar.gz\", key_prefix=\"triton-mme-gpu-ensemble\"\n", ")\n", "\n", "print(\"S3 model uri: {}\".format(model_uri))" ] }, { "cell_type": "markdown", "id": "969a47b8-e1d5-463a-8f0e-e61e1be8c118", "metadata": {}, "source": [ "## 4. Run ensembles on SageMaker MME GPU instance" ] }, { "cell_type": "code", "execution_count": null, "id": "b322601f-2f78-4bf9-9368-c4f7ddb0fefc", "metadata": {}, "outputs": [], "source": [ "account_id_map = {\n", " \"us-east-1\": \"785573368785\",\n", " \"us-east-2\": \"007439368137\",\n", " \"us-west-1\": \"710691900526\",\n", " \"us-west-2\": \"301217895009\",\n", " \"eu-west-1\": \"802834080501\",\n", " \"eu-west-2\": \"205493899709\",\n", " \"eu-west-3\": \"254080097072\",\n", " \"eu-north-1\": \"601324751636\",\n", " \"eu-south-1\": \"966458181534\",\n", " \"eu-central-1\": \"746233611703\",\n", " \"ap-east-1\": \"110948597952\",\n", " \"ap-south-1\": \"763008648453\",\n", " \"ap-northeast-1\": \"941853720454\",\n", " \"ap-northeast-2\": \"151534178276\",\n", " \"ap-southeast-1\": \"324986816169\",\n", " \"ap-southeast-2\": \"355873309152\",\n", " \"cn-northwest-1\": \"474822919863\",\n", " \"cn-north-1\": \"472730292857\",\n", " \"sa-east-1\": \"756306329178\",\n", " \"ca-central-1\": \"464438896020\",\n", " \"me-south-1\": \"836785723513\",\n", " \"af-south-1\": \"774647643957\",\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "4c8b6623-f155-41ab-b749-50320e191bab", "metadata": {}, "outputs": [], "source": [ "region = boto3.Session().region_name\n", "if region not in account_id_map.keys():\n", " raise (\"UNSUPPORTED REGION\")" ] }, { "cell_type": "code", "execution_count": null, "id": "1e43e42b-5fbd-4185-add0-42946205e2cc", "metadata": {}, "outputs": [], "source": [ "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n", "triton_image_uri = \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.03-py3\".format(\n", " account_id=account_id_map[region], region=region, base=base\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "7b67e2c2-aaa8-4e02-98c8-a594055bf5de", "metadata": {}, "outputs": [], "source": [ "models_s3_location = model_uri.rstrip(\"model_trt_python.tar.gz\")" ] }, { "cell_type": "code", "execution_count": null, "id": "58ac818e-e2b8-490b-a89c-43b9a1d9bcb3", "metadata": {}, "outputs": [], "source": [ "models_s3_location" ] }, { "cell_type": "code", "execution_count": null, "id": "af4f84c4-107a-46a1-a76c-607bcd84c0a1", "metadata": {}, "outputs": [], "source": [ "container = {\n", " \"Image\": triton_image_uri,\n", " \"ModelDataUrl\": models_s3_location,\n", " \"Mode\": \"MultiModel\",\n", " \"Environment\": {\"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"ensemble_dali_inception\"},\n", "}\n", "\n", "create_model_response = sm_client.create_model(\n", " ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container\n", ")\n", "\n", "model_arn = create_model_response[\"ModelArn\"]\n", "\n", "print(f\"Model Arn: {model_arn}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "b828eacc-e4f3-473d-9c62-add9605e3b85", "metadata": {}, "outputs": [], "source": [ "create_endpoint_config_response = sm_client.create_endpoint_config(\n", " EndpointConfigName=endpoint_config_name,\n", " ProductionVariants=[\n", " {\n", " \"InstanceType\": instance_type,\n", " \"InitialVariantWeight\": 1,\n", " \"InitialInstanceCount\": 1,\n", " \"ModelName\": sm_model_name,\n", " \"VariantName\": \"AllTraffic\",\n", " }\n", " ],\n", ")\n", "\n", "endpoint_config_arn = create_endpoint_config_response[\"EndpointConfigArn\"]\n", "\n", "print(f\"Endpoint Config Arn: {endpoint_config_arn}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "8dedb286-d46b-4f9a-a0dd-46a76f08c43b", "metadata": {}, "outputs": [], "source": [ "create_endpoint_response = sm_client.create_endpoint(\n", " EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n", ")\n", "\n", "endpoint_arn = create_endpoint_response[\"EndpointArn\"]\n", "\n", "print(f\"Endpoint Arn: {endpoint_arn}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "96238343-c277-4554-8f3b-42f2781c4e71", "metadata": {}, "outputs": [], "source": [ "rv = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", "status = rv[\"EndpointStatus\"]\n", "print(f\"Endpoint Creation Status: {status}\")\n", "\n", "while status == \"Creating\":\n", " time.sleep(60)\n", " rv = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " status = rv[\"EndpointStatus\"]\n", " print(f\"Endpoint Creation Status: {status}\")\n", "\n", "endpoint_arn = rv[\"EndpointArn\"]\n", "\n", "print(f\"Endpoint Arn: {endpoint_arn}\")\n", "print(f\"Endpoint Status: {status}\")" ] }, { "cell_type": "markdown", "id": "7a799d30-fe17-47e1-aecf-96810d337bc2", "metadata": {}, "source": [ "## 5. Create inference payload and send requests to respective models" ] }, { "cell_type": "markdown", "id": "22393821-89de-45a7-aabb-cd30c156586c", "metadata": {}, "source": [ "### 5.1. TF + Dali Ensemble" ] }, { "cell_type": "code", "execution_count": null, "id": "6a5f5cba-eb8f-4c3e-9ca8-c9baa855d985", "metadata": {}, "outputs": [], "source": [ "sample_img_fname = \"shiba_inu_dog.jpg\"\n", "\n", "import numpy as np\n", "\n", "region = boto3.Session().region_name\n", "s3_client = boto3.client(\"s3\")\n", "s3_client.download_file(\n", " f\"sagemaker-example-files-prod-{region}\", \"datasets/image/pets/shiba_inu_dog.jpg\", sample_img_fname\n", ")\n", "\n", "\n", "def load_image(img_path):\n", " \"\"\"\n", " Loads image as an encoded array of bytes.\n", " This is a typical approach you want to use in DALI backend\n", " \"\"\"\n", " with open(img_path, \"rb\") as f:\n", " img = f.read()\n", " return np.array(list(img)).astype(np.uint8)\n", "\n", "\n", "rv = load_image(sample_img_fname)\n", "print(f\"Shape of image {rv.shape}\")\n", "\n", "rv2 = np.expand_dims(rv, 0)\n", "print(f\"Shape of expanded image array {rv2.shape}\")\n", "\n", "payload = {\n", " \"inputs\": [\n", " {\n", " \"name\": \"INPUT\",\n", " \"shape\": rv2.shape,\n", " \"datatype\": \"UINT8\",\n", " \"data\": rv2.tolist(),\n", " }\n", " ]\n", "}" ] }, { "cell_type": "code", "execution_count": null, "id": "c2827460-695d-4f8e-ace0-579b953a4d53", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "# Run inference - first inference request will take substantially longer than next ones as the model is loaded on the first requestß\n", "\n", "response = runtime_sm_client.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " ContentType=\"application/octet-stream\",\n", " Body=json.dumps(payload),\n", " TargetModel=\"model_tf_dali.tar.gz\",\n", ")\n", "\n", "print(json.loads(response[\"Body\"].read().decode(\"utf8\")))" ] }, { "cell_type": "markdown", "id": "9c3c232a-a9b2-41db-8ee0-6792bb19fd1a", "metadata": {}, "source": [ "### 5.2 TRT + Python backend" ] }, { "cell_type": "markdown", "id": "bcd38f82-ba5e-455a-b8db-1392f5d4abe5", "metadata": {}, "source": [ "For this example, we will use binary+json payload. If you choose to not use binary+json payload, you can submit an inference payload similar to the TF + Dali ensemble example previously." ] }, { "cell_type": "code", "execution_count": null, "id": "af260e2f-c6e1-4fea-ba12-7899f8ca2354", "metadata": {}, "outputs": [], "source": [ "import tritonclient.http as http_client" ] }, { "cell_type": "code", "execution_count": null, "id": "0e5f20a7-0b10-480a-a128-00aa5c7e84b6", "metadata": {}, "outputs": [], "source": [ "text_inputs = [\"Sentence 1\", \"Sentence 2\"]\n", "\n", "inputs = []\n", "inputs.append(http_client.InferInput(\"INPUT0\", [len(text_inputs), 1], \"BYTES\"))\n", "\n", "batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]\n", "\n", "input0_real = np.array(batch_request, dtype=np.object_)\n", "\n", "inputs[0].set_data_from_numpy(input0_real, binary_data=True)\n", "\n", "len(input0_real)\n", "\n", "outputs = []\n", "outputs.append(http_client.InferRequestedOutput(\"finaloutput\"))\n", "\n", "request_body, header_length = http_client.InferenceServerClient.generate_request_body(\n", " inputs, outputs=outputs\n", ")\n", "\n", "print(request_body)" ] }, { "cell_type": "code", "execution_count": null, "id": "e1dcc896-dc80-4398-8d22-d10763b1a537", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "# Run inference - first inference request will take substantially longer than next ones as the model is loaded on the first requestß\n", "\n", "response = runtime_sm_client.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " ContentType=\"application/vnd.sagemaker-triton.binary+json;json-header-size={}\".format(\n", " header_length\n", " ),\n", " Body=request_body,\n", " TargetModel=\"model_trt_python.tar.gz\",\n", ")\n", "\n", "## json.loads fails\n", "# a = json.loads(response[\"Body\"].read().decode(\"utf8\"))\n", "\n", "header_length_prefix = \"application/vnd.sagemaker-triton.binary+json;json-header-size=\"\n", "header_length_str = response[\"ContentType\"][len(header_length_prefix) :]\n", "\n", "# Read response body\n", "result = http_client.InferenceServerClient.parse_response_body(\n", " response[\"Body\"].read(), header_length=int(header_length_str)\n", ")\n", "\n", "outputs_data = result.as_numpy(\"finaloutput\")\n", "\n", "for idx, output in enumerate(outputs_data):\n", " print(text_inputs[idx])\n", " print(output)" ] }, { "cell_type": "markdown", "id": "652fcdce-c420-43c2-9a1e-22d0a7b18f3f", "metadata": {}, "source": [ "## Terminate your resources" ] }, { "cell_type": "code", "execution_count": null, "id": "29d3f8f4-7c32-4d0d-a421-8149e9987dd9", "metadata": {}, "outputs": [], "source": [ "sm_client.delete_endpoint(EndpointName=endpoint_name)\n", "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n", "sm_client.delete_model(ModelName=sm_model_name)" ] }, { "cell_type": "markdown", "id": "fb64ac12-765d-4317-8ef5-00b7d2a0304a", "metadata": {}, "source": [ "## Notes: \n", "\n", "Here are some note-worthy points when developing with MME on GPU with ensembles:\n", "\n", "1. Larger ensembles are not recommended for smaller instance types due to different memory management behavior of framework backends e.g., Tensorflow backend does not release memory upon UNLOAD due to the framework's GPU memory allocation strategy. When coupled with other model backends, non-deterministic behavior may occur.\n", "2. Each ensemble is treated as a single-model in SageMaker i.e. hierarchy of models is not flat.\n", "3. Model names may be re-used across ensembles, however, each ensemble must have its own copy of the model with duplicated name.\n" ] }, { "cell_type": "markdown", "id": "1b3aba65-18df-4287-8a85-1945625ae830", "metadata": {}, "source": [ "## Conclusion:\n", "We saw an example of the capability of SageMaker-Triton container to support multiple ensembles on GPU. We also saw two different methods of sending inference payload to the models. Customizing the hosting job is further possible by setting environment variables. The supported environment variables are listed here - https://github.com/triton-inference-server/server/blob/main/docker/sagemaker/serve. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 5 }