{ "cells": [ { "cell_type": "markdown", "id": "ce5723a5", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Prerequisites\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "ce5723a5", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n", "Install the necessary Python modules to use and interact with [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/)." ] }, { "cell_type": "code", "execution_count": null, "id": "f5995424", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "! pip install torch sagemaker transformers tritonclient[all]" ] }, { "cell_type": "markdown", "id": "bd720222-93da-4dc9-9508-3d8b926706d3", "metadata": {}, "source": [ "
\n", "Warning: You should run this notebook on a SageMaker Notebook Instance with access to the same GPU as the instance you will deploy your model to (g4dn is the one configured by default in this example). There are model optimization steps contained in this notebook that are GPU architecture-dependent.\n", " \u2b07\u2b07\u2b07\u2b07\u2b07 change in the next cell if required\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "id": "15aa9357-3d14-4f8f-b1a7-46f12331c4a6", "metadata": {}, "outputs": [], "source": [ "# change this cell if you are running this notebook in a different instance type\n", "notebook_instance_type = 'ml.g4dn.xlarge'" ] }, { "cell_type": "markdown", "id": "4d9d12fa", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Part 1 - Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "3aef44c4", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import argparse\n", "import boto3\n", "import copy\n", "import datetime\n", "import json\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "import pprint\n", "import re\n", "import sagemaker\n", "import sys\n", "import time\n", "from time import gmtime, strftime\n", "import tritonclient.http as http_client" ] }, { "cell_type": "code", "execution_count": null, "id": "fe9a1636", "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "session = boto3.Session()\n", "role = sagemaker.get_execution_role()\n", "\n", "sm_client = session.client(\"sagemaker\")\n", "sagemaker_session = sagemaker.Session(boto_session=session)\n", "sm_runtime_client = boto3.client(\"sagemaker-runtime\")\n", "\n", "region = boto3.Session().region_name" ] }, { "cell_type": "markdown", "id": "d31659f5", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "***" ] }, { "cell_type": "markdown", "id": "14a0ba73", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Part 2 - Generate TensorRT Model\n", "\n", "In the following cells, we are using [HuggingFace Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto) to load a pre-trained model from the [HuggingFace Model Hub](https://huggingface.co/models). We then convert the model to the ONNX format, and compile it using NVIDIA TensorRT - namely its command-line wrapper tool, `trtexec` -, using the scripts provided in the official AWS Sample for [SageMaker Triton](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-triton).\n", "\n", "NVIDIA TensorRT is an SDK that facilitates high-performance machine learning inference. You can use it to create `engines` from models that have already been trained, \n", "optimizing for a selected GPU architecture. Triton natively supports the TensorRT runtime, which enables you to easily deploy a TensorRT engine and pair it with the rich features that Triton provides.\n", "\n", "### Parameters:\n", "\n", "* `model_name`: Model identifier from the Hugging Face model hub library" ] }, { "cell_type": "code", "execution_count": null, "id": "5e3f1883", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "model_id = \"sentence-transformers/all-MiniLM-L6-v2\"" ] }, { "cell_type": "markdown", "id": "436d115f", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Option 1 - TensorRT Model with Amazon SageMaker Studio" ] }, { "cell_type": "markdown", "id": "629bfade-8f14-44b6-9be7-88a2fdb84ba9", "metadata": {}, "source": [ "> **WARNING**: The next cell will only work if you have first created a custom Studio image, described in Step 2 of this repository's README. Change the `RUNNING_IN_STUDIO` to `True` if this is the case." ] }, { "cell_type": "code", "execution_count": null, "id": "6df2bf7d-a595-46b1-8d1a-43ab946bc858", "metadata": {}, "outputs": [], "source": [ "RUNNING_IN_STUDIO = False" ] }, { "cell_type": "code", "execution_count": null, "id": "7c0eb376", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "if RUNNING_IN_STUDIO:\n", " !/bin/bash ./workspace/generate_model_trt.sh $model_id && rm -rf ensemble_hf/bert-trt/1 && mkdir -p ensemble_hf/bert-trt/1 && cp ./model.plan ensemble_hf/bert-trt/1/model.plan && rm -rf ./model.plan ./conversion_bs16_dy.txt ./model.onnx" ] }, { "cell_type": "markdown", "id": "b34cf8a7", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Option 2 - TensorRT Model with SageMaker Notebook Instances \n", "\n", "To make sure we use TensorRT version and dependencies that are compatible with the ones in our Triton container, we compile the model using the corresponding version of NVIDIA's PyTorch container image.\n", "\n", "If you take a look at the python files within the `workspace` folder, you will see that we are first convert the model into ONNX format, specifying dynamic axis indexes so that inputs with a different batch size and sequence length can be passed to the model. TensorRT will treat other input dimensions as fixed, and optimize for those.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4f1e86f3", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "! docker run --gpus=all --rm -it -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash generate_model_trt.sh $model_id" ] }, { "cell_type": "code", "execution_count": null, "id": "77a1fc87-a3d4-46c5-bd6a-126ef00c92f9", "metadata": {}, "outputs": [], "source": [ "! mkdir -p ensemble_hf/bert-trt/1 && mv workspace/model.plan ensemble_hf/bert-trt/1/model.plan && rm -rf workspace/model.onnx workspace/core*" ] }, { "cell_type": "markdown", "id": "fe095659", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Explore the output logs of the compilation process; at the very end, we get a section headlined \"=== Performance summary ===\" which gives us a series of metrics on the obtained engine's performance (latency, throughput, etc...). " ] }, { "cell_type": "markdown", "id": "8455c0ca", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Part 3 - Run Local Triton Inference Server" ] }, { "cell_type": "markdown", "id": "9e6e6a82-66c9-4db5-9d61-c75a7d2372a0", "metadata": {}, "source": [ "When using the Triton Python backend, you can include your own environment and dependencies. The recommended way to do this is to use [conda pack](https://conda.github.io/conda-pack/) to generate a conda environment archive in `tar.gz` format, include it in your model repository, and point to it in the `config.pbtxt` file of python models that should use it, adding the snippet: \n", "\n", "```\n", "parameters: {\n", " key: \"EXECUTION_ENV_PATH\",\n", " value: {string_value: \"$$TRITON_MODEL_DIRECTORY/your_env.tar.gz\"}\n", "}\n", "\n", "```\n", "Let's create this file and save it to the pre and postprocessing model repositories." ] }, { "cell_type": "code", "execution_count": null, "id": "859edea3-dd44-4068-9743-0bd085531b01", "metadata": {}, "outputs": [], "source": [ "!bash conda_dependencies.sh\n", "!cp processing_env.tar.gz ensemble_hf/postprocess/ && cp processing_env.tar.gz ensemble_hf/preprocess/\n", "!rm processing_env.tar.gz" ] }, { "cell_type": "markdown", "id": "ed0cd1c9-c8ab-4f75-a368-a719dd165c04", "metadata": {}, "source": [ "> **WARNING**: The cells under part 3 will only work if run within a SageMaker Notebook Instance!\n" ] }, { "cell_type": "markdown", "id": "9c22df98", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n", "\n", "The following cells run the Triton Inference Server container in the background and load all the models within the folder `/ensemble_hf`. The docker won't fail if one or more of the model fails because of `--exit-on-error=false`, which is useful for iterative code and model repository building. Remove `-d` to see the logs." ] }, { "cell_type": "code", "execution_count": null, "id": "35cac085", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!sudo docker system prune -f" ] }, { "cell_type": "code", "execution_count": null, "id": "d6965857", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!docker run --gpus=all -d --shm-size=4G --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/ensemble_hf:/model_repository nvcr.io/nvidia/tritonserver:22.10-py3 tritonserver --model-repository=/model_repository --exit-on-error=false --strict-model-config=false\n", "time.sleep(90)" ] }, { "cell_type": "code", "execution_count": null, "id": "b9950b9c", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "CONTAINER_ID=!docker container ls -q\n", "FIRST_CONTAINER_ID = CONTAINER_ID[0]" ] }, { "cell_type": "markdown", "id": "3f903432-6e87-449c-84ea-4c3ed2fa445b", "metadata": {}, "source": [ "Uncomment the next cell and run it to view the container logs and understand Triton model loading." ] }, { "cell_type": "code", "execution_count": null, "id": "9ca9f7dc", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!docker logs $FIRST_CONTAINER_ID" ] }, { "cell_type": "markdown", "id": "7052821e-b394-4946-9246-859ba9d13632", "metadata": {}, "source": [ "
\n", "Warning: Rerun the cell above to check the container logs until you verify that Triton has loaded all models successfully, otherwise inference request will fail.\n", "
" ] }, { "cell_type": "markdown", "id": "f5946837", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Test TensorRT model by invoking the local Triton Server" ] }, { "cell_type": "code", "execution_count": null, "id": "7b5775f1", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Start a local Triton client\n", "try:\n", " triton_client = http_client.InferenceServerClient(url=\"localhost:8000\", verbose=True)\n", "except Exception as e:\n", " print(\"context creation failed: \" + str(e))\n", " sys.exit()" ] }, { "cell_type": "code", "execution_count": null, "id": "36b46556", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Create inputs to send to Triton\n", "model_name = \"ensemble\"\n", "\n", "text_inputs = [\"Sentence 1\", \"Sentence 2\"]\n", "\n", "# Text is passed to Trtion as BYTES\n", "inputs = []\n", "inputs.append(http_client.InferInput(\"INPUT0\", [len(text_inputs), 1], \"BYTES\"))\n", "\n", "# We need to structure batch inputs as such\n", "batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]\n", "input0_real = np.array(batch_request, dtype=np.object_)\n", "\n", "inputs[0].set_data_from_numpy(input0_real)" ] }, { "cell_type": "code", "execution_count": null, "id": "b47da0cb", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "outputs = []\n", "\n", "outputs.append(http_client.InferRequestedOutput(\"finaloutput\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "2588e261", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)" ] }, { "cell_type": "code", "execution_count": null, "id": "a405f8ee", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "outputs_data = results.as_numpy(\"finaloutput\")\n", "\n", "for idx, output in enumerate(outputs_data):\n", " print(text_inputs[idx])\n", " print(output)" ] }, { "cell_type": "code", "execution_count": null, "id": "1d88a95c", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Use this to stop the container that was started in detached mode\n", "!docker kill $FIRST_CONTAINER_ID" ] }, { "cell_type": "markdown", "id": "dc75efff", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "***" ] }, { "cell_type": "markdown", "id": "ef0e365f", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Part 4 - Deploy Triton to SageMaker Real-Time Endpoint" ] }, { "cell_type": "markdown", "id": "706db9cb", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Deploy with SageMaker Triton container" ] }, { "cell_type": "markdown", "id": "bdceed91-9fbc-4ea3-ab9e-0e2599ba7281", "metadata": {}, "source": [ "First we get the URI for the Sagemaker Triton container image that matches the one we used for TensorRT model compilation." ] }, { "cell_type": "code", "execution_count": null, "id": "e0dd85b6", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "account_id_map = {\n", " \"us-east-1\": \"785573368785\",\n", " \"us-east-2\": \"007439368137\",\n", " \"us-west-1\": \"710691900526\",\n", " \"us-west-2\": \"301217895009\",\n", " \"eu-west-1\": \"802834080501\",\n", " \"eu-west-2\": \"205493899709\",\n", " \"eu-west-3\": \"254080097072\",\n", " \"eu-north-1\": \"601324751636\",\n", " \"eu-south-1\": \"966458181534\",\n", " \"eu-central-1\": \"746233611703\",\n", " \"ap-east-1\": \"110948597952\",\n", " \"ap-south-1\": \"763008648453\",\n", " \"ap-northeast-1\": \"941853720454\",\n", " \"ap-northeast-2\": \"151534178276\",\n", " \"ap-southeast-1\": \"324986816169\",\n", " \"ap-southeast-2\": \"355873309152\",\n", " \"cn-northwest-1\": \"474822919863\",\n", " \"cn-north-1\": \"472730292857\",\n", " \"sa-east-1\": \"756306329178\",\n", " \"ca-central-1\": \"464438896020\",\n", " \"me-south-1\": \"836785723513\",\n", " \"af-south-1\": \"774647643957\",\n", "}\n", "\n", "if region not in account_id_map.keys():\n", " raise (\"UNSUPPORTED REGION\")\n", "\n", "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n", "\n", "triton_image_uri = \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.10-py3\".format(\n", " account_id=account_id_map[region], region=region, base=base\n", ")\n", "\n", "triton_image_uri" ] }, { "cell_type": "code", "execution_count": null, "id": "081d1204", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "print(sagemaker_session.default_bucket())" ] }, { "cell_type": "code", "execution_count": null, "id": "0adf2f45", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "ensemble_prefix = \"triton-tests/ensemble-singlemodel\"\n", "!tar -C ensemble_hf/ -czf ensemble-sentencetrans.tar.gz .\n", "model_uri_tf = sagemaker_session.upload_data(\n", " path=\"ensemble-sentencetrans.tar.gz\", key_prefix=ensemble_prefix\n", ")\n", "\n", "print(\"S3 model uri: {}\".format(model_uri_tf))" ] }, { "cell_type": "code", "execution_count": null, "id": "da4eb8ef", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# Important to define what which one of the models loaded by Triton is the default to be served by SM\n", "# That is, SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\n", "container_model = {\n", " \"Image\": triton_image_uri,\n", " \"ModelDataUrl\": model_uri_tf,\n", " \"Mode\": \"SingleModel\",\n", " \"Environment\": {\"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"ensemble\"},\n", "}" ] }, { "cell_type": "markdown", "id": "16b0ce98-b921-476d-8921-5e3ffb5cc7d4", "metadata": {}, "source": [ "Register the model with Sagemaker." ] }, { "cell_type": "code", "execution_count": null, "id": "e15ae6ac", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "sm_model_name = \"triton-sentence-ensemble\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n", "\n", "create_model_response = sm_client.create_model(\n", " ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container_model\n", ")\n", "\n", "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])" ] }, { "cell_type": "markdown", "id": "695e71ab-c842-4001-8674-0d5c67ffd79f", "metadata": {}, "source": [ "Create an endpoint configuration." ] }, { "cell_type": "code", "execution_count": null, "id": "c6096778", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "endpoint_config_name = \"triton-sentence-ensemble\" + time.strftime(\n", " \"%Y-%m-%d-%H-%M-%S\", time.gmtime()\n", ")\n", "\n", "create_endpoint_config_response = sm_client.create_endpoint_config(\n", " EndpointConfigName=endpoint_config_name,\n", " ProductionVariants=[\n", " {\n", " \"InstanceType\": notebook_instance_type,\n", " \"InitialVariantWeight\": 1,\n", " \"InitialInstanceCount\": 1,\n", " \"ModelName\": sm_model_name,\n", " \"VariantName\": \"AllTraffic\",\n", " }\n", " ],\n", ")\n", "\n", "print(\"Endpoint Config Arn: \" + create_endpoint_config_response[\"EndpointConfigArn\"])" ] }, { "cell_type": "markdown", "id": "c4893f46-dd98-4e60-9490-36026f128446", "metadata": {}, "source": [ "Deploy the endpoint." ] }, { "cell_type": "code", "execution_count": null, "id": "e9932ae0", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "endpoint_name = \"triton-sentence-ensemble\" + time.strftime(\"%Y-%m-%d-%H-%M-%S\", time.gmtime())\n", "\n", "create_endpoint_response = sm_client.create_endpoint(\n", " EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n", ")\n", "\n", "print(\"Endpoint Arn: \" + create_endpoint_response[\"EndpointArn\"])" ] }, { "cell_type": "markdown", "id": "a3ec6a48-5fc5-4495-9b1d-10f94ef95277", "metadata": {}, "source": [ "Wait for the endpoint to be up and running." ] }, { "cell_type": "code", "execution_count": null, "id": "f257b3e7", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", "status = resp[\"EndpointStatus\"]\n", "print(\"Status: \" + status)\n", "\n", "while status == \"Creating\":\n", " time.sleep(60)\n", " resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " status = resp[\"EndpointStatus\"]\n", " print(\"Status: \" + status)\n", "\n", "print(\"Arn: \" + resp[\"EndpointArn\"])\n", "print(\"Status: \" + status)" ] }, { "cell_type": "markdown", "id": "7aee2ad4", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "***" ] }, { "cell_type": "markdown", "id": "59448028", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "### Test the SageMaker Triton Endpoint" ] }, { "cell_type": "code", "execution_count": null, "id": "d55e6ea4", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "text_inputs = [\"Sentence 1\", \"Sentence 2\"]\n", "\n", "inputs = []\n", "inputs.append(http_client.InferInput(\"INPUT0\", [len(text_inputs), 1], \"BYTES\"))\n", "\n", "batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]\n", "\n", "input0_real = np.array(batch_request, dtype=np.object_)\n", "\n", "inputs[0].set_data_from_numpy(input0_real, binary_data=True)\n", "\n", "len(input0_real)" ] }, { "cell_type": "code", "execution_count": null, "id": "71acf686", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "outputs = []\n", "\n", "outputs.append(http_client.InferRequestedOutput(\"finaloutput\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "8cc9a386", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "request_body, header_length = http_client.InferenceServerClient.generate_request_body(\n", " inputs, outputs=outputs\n", ")\n", "\n", "print(request_body)" ] }, { "cell_type": "code", "execution_count": null, "id": "2782361a", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "response = sm_runtime_client.invoke_endpoint(\n", " EndpointName=endpoint_name,\n", " ContentType=\"application/vnd.sagemaker-triton.binary+json;json-header-size={}\".format(\n", " header_length\n", " ),\n", " Body=request_body,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "65724cd1", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "## json.loads fails\n", "# a = json.loads(response[\"Body\"].read().decode(\"utf8\"))\n", "\n", "header_length_prefix = \"application/vnd.sagemaker-triton.binary+json;json-header-size=\"\n", "header_length_str = response[\"ContentType\"][len(header_length_prefix) :]\n", "\n", "# Read response body\n", "result = http_client.InferenceServerClient.parse_response_body(\n", " response[\"Body\"].read(), header_length=int(header_length_str)\n", ")\n", "\n", "outputs_data = result.as_numpy(\"finaloutput\")\n", "\n", "for idx, output in enumerate(outputs_data):\n", " print(text_inputs[idx])\n", " print(output)" ] }, { "cell_type": "markdown", "id": "fca4e66f-645e-4d92-a1fb-8f0e67f89264", "metadata": {}, "source": [ "## Part 5 - Clean up" ] }, { "cell_type": "code", "execution_count": null, "id": "72bb18c2-1b1d-4798-bc99-3fb0ed0b5a56", "metadata": {}, "outputs": [], "source": [ "sm_client.delete_endpoint(EndpointName=endpoint_name)\n", "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n", "sm_client.delete_model(ModelName=sm_model_name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-triton|ensemble|sentence-transformer-trt|examples|triton_sentence_embeddings.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.g4dn.xlarge", "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }