{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "776a6990",
   "metadata": {},
   "source": [
    "## SageMaker Serverless Inference Provisioned Concurrency XGBoost\n",
    "\n",
    "Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for customers to deploy and scale ML models. Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints also automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. \n",
    "\n",
    "Serverless Inference however can be prone to cold-starts, as if your serverless endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. In this notebook we specifically explore <b>Provisioned Concurrency</b>, a new feature in Serverless Inference which can help mitigate this issue. With Provisioned Concurrency you can keep the compute enviroment initialized and reduce cold-start as your serverless endpoint is kept ready.\n",
    "\n",
    "For this notebook we'll be working with the SageMaker XGBoost Algorithm to train a model and then deploy a serverless endpoint. We will be using the public S3 Abalone regression dataset for this example."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a0f75a8",
   "metadata": {},
   "source": [
    "### Notebook Setting\n",
    "\n",
    "- <b>SageMaker Classic Notebook Instance</b>: ml.m5.xlarge Notebook Instance & conda_python3 Kernel\n",
    "- <b>SageMaker Studio</b>: Python 3 (Data Science)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e698428",
   "metadata": {},
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca8d44b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install sagemaker botocore boto3 awscli --upgrade"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10d894b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pwd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8501d8d7",
   "metadata": {},
   "source": [
    "### SageMaker Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "136a7145",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup clients\n",
    "import boto3\n",
    "\n",
    "client = boto3.client(service_name=\"sagemaker\")\n",
    "runtime = boto3.client(service_name=\"sagemaker-runtime\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "03a4dbb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "from sagemaker.estimator import Estimator\n",
    "\n",
    "boto_session = boto3.session.Session()\n",
    "region = boto_session.region_name\n",
    "print(region)\n",
    "\n",
    "sagemaker_session = sagemaker.Session()\n",
    "base_job_prefix = \"xgboost-example\"\n",
    "role = sagemaker.get_execution_role()\n",
    "print(role)\n",
    "\n",
    "default_bucket = sagemaker_session.default_bucket()\n",
    "s3_prefix = base_job_prefix\n",
    "\n",
    "training_instance_type = \"ml.m5.xlarge\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2035b16a",
   "metadata": {},
   "source": [
    "### Model Training\n",
    "\n",
    "We will run a training job on the Abalone Regression Dataset with the Built-In XGBoost Algorithm. We will then utilized the trained model artifacts to deploy a Serverless Endpoint. If you have pre-trained model artifacts you can also deploy them directly to Serverless Inference and skip this portion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7ab33db",
   "metadata": {},
   "outputs": [],
   "source": [
    "# retrieve data\n",
    "!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35d6f93f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# upload data to S3\n",
    "!aws s3 cp abalone_dataset1_train.csv s3://{default_bucket}/xgboost-regression/train.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "68d04df4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.inputs import TrainingInput\n",
    "\n",
    "training_path = f\"s3://{default_bucket}/xgboost-regression/train.csv\"\n",
    "train_input = TrainingInput(training_path, content_type=\"text/csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0e01db2",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_path = f\"s3://{default_bucket}/{s3_prefix}/xgb_model\"\n",
    "\n",
    "# retrieve xgboost image\n",
    "image_uri = sagemaker.image_uris.retrieve(\n",
    "    framework=\"xgboost\",\n",
    "    region=region,\n",
    "    version=\"1.0-1\",\n",
    "    py_version=\"py3\",\n",
    "    instance_type=training_instance_type,\n",
    ")\n",
    "\n",
    "# Configure Training Estimator\n",
    "xgb_train = Estimator(\n",
    "    image_uri=image_uri,\n",
    "    instance_type=training_instance_type,\n",
    "    instance_count=1,\n",
    "    output_path=model_path,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    role=role,\n",
    ")\n",
    "\n",
    "# Set Hyperparameters\n",
    "xgb_train.set_hyperparameters(\n",
    "    objective=\"reg:linear\",\n",
    "    num_round=50,\n",
    "    max_depth=5,\n",
    "    eta=0.2,\n",
    "    gamma=4,\n",
    "    min_child_weight=6,\n",
    "    subsample=0.7,\n",
    "    silent=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51a8fac4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit model\n",
    "xgb_train.fit({\"train\": train_input})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dad10eb",
   "metadata": {},
   "source": [
    "### Retrieve Model Artifacts\n",
    "\n",
    "If you have a pre-trained model, provide these in a model.tar.gz as SageMaker expects a tarball format for the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62d1ed2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve model data from training job\n",
    "model_artifacts = xgb_train.model_data\n",
    "model_artifacts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dcfbb8b",
   "metadata": {},
   "source": [
    "### SageMaker Model Creation\n",
    "\n",
    "Here we can specify the container image you are using as well as your model artifacts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e85754c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import gmtime, strftime\n",
    "\n",
    "model_name = \"xgboost-serverless-pc\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "print(\"Model name: \" + model_name)\n",
    "\n",
    "# dummy environment variables\n",
    "byo_container_env_vars = {\"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"20\", \"SOME_ENV_VAR\": \"myEnvVar\"}\n",
    "\n",
    "create_model_response = client.create_model(\n",
    "    ModelName=model_name,\n",
    "    Containers=[\n",
    "        {\n",
    "            \"Image\": image_uri,\n",
    "            \"Mode\": \"SingleModel\",\n",
    "            \"ModelDataUrl\": model_artifacts,\n",
    "            \"Environment\": byo_container_env_vars,\n",
    "        }\n",
    "    ],\n",
    "    ExecutionRoleArn=role,\n",
    ")\n",
    "\n",
    "print(\"Model Arn: \" + create_model_response[\"ModelArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "077e16c3",
   "metadata": {},
   "source": [
    "### SageMaker Endpoint Configuration\n",
    "\n",
    "Here you can specify your <b>ProvisionedConcurrency</b> parameter, ensure that this is less than or equal to the Maximum Concurrency that you specify for the endpoint. In this instance since we are comparing performance between a vanilla serverless endpoint and a provisioned serverless endpoint we will create two endpoint configs: One with Provisioned Concurrency enabled and the other without."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63db6973",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgboost_epc_name_pc = \"xgboost-serverless-epc-pc\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "xgboost_epc_name_on_demand = \"xgboost-serverless-epc-on-demand\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "\n",
    "endpoint_config_response_pc = client.create_endpoint_config(\n",
    "    EndpointConfigName=xgboost_epc_name_pc,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"VariantName\": \"byoVariant\",\n",
    "            \"ModelName\": model_name,\n",
    "            \"ServerlessConfig\": {\n",
    "                \"MemorySizeInMB\": 4096,\n",
    "                \"MaxConcurrency\": 1,\n",
    "                # Providing Provisioned Concurrency in EPC\n",
    "                \"ProvisionedConcurrency\": 1\n",
    "            },\n",
    "        },\n",
    "    ],\n",
    ")\n",
    "\n",
    "endpoint_config_response_on_demand = client.create_endpoint_config(\n",
    "    EndpointConfigName=xgboost_epc_name_on_demand,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"VariantName\": \"byoVariant\",\n",
    "            \"ModelName\": model_name,\n",
    "            \"ServerlessConfig\": {\n",
    "                \"MemorySizeInMB\": 4096,\n",
    "                \"MaxConcurrency\": 1,\n",
    "            },\n",
    "        },\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(\"Endpoint Configuration Arn Provisioned Concurrency: \" + endpoint_config_response_pc[\"EndpointConfigArn\"])\n",
    "print(\"Endpoint Configuration Arn On Demand Serverless: \" + endpoint_config_response_on_demand[\"EndpointConfigArn\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4f61737",
   "metadata": {},
   "source": [
    "### Endpoint Creation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d5b5fad",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_name_pc = \"xgboost-serverless-ep-pc\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "\n",
    "create_endpoint_response = client.create_endpoint(\n",
    "    EndpointName=endpoint_name_pc,\n",
    "    EndpointConfigName=xgboost_epc_name_pc,\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn Provisioned Concurrency: \" + create_endpoint_response[\"EndpointArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "550e25f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# wait for endpoint to reach a terminal state (InService) using describe endpoint\n",
    "import time\n",
    "\n",
    "describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name_pc)\n",
    "\n",
    "while describe_endpoint_response[\"EndpointStatus\"] == \"Creating\":\n",
    "    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name_pc)\n",
    "    print(describe_endpoint_response[\"EndpointStatus\"])\n",
    "    time.sleep(15)\n",
    "\n",
    "describe_endpoint_response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb155e3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_name_on_demand = \"xgboost-serverless-ep-on-demand\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
    "\n",
    "create_endpoint_response = client.create_endpoint(\n",
    "    EndpointName=endpoint_name_on_demand,\n",
    "    EndpointConfigName=xgboost_epc_name_on_demand,\n",
    ")\n",
    "\n",
    "print(\"Endpoint Arn Provisioned Concurrency: \" + create_endpoint_response[\"EndpointArn\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84b153f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# wait for endpoint to reach a terminal state (InService) using describe endpoint\n",
    "import time\n",
    "\n",
    "describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name_on_demand)\n",
    "\n",
    "while describe_endpoint_response[\"EndpointStatus\"] == \"Creating\":\n",
    "    describe_endpoint_response = client.describe_endpoint(EndpointName=endpoint_name_on_demand)\n",
    "    print(describe_endpoint_response[\"EndpointStatus\"])\n",
    "    time.sleep(15)\n",
    "\n",
    "describe_endpoint_response"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4f65bb4",
   "metadata": {},
   "source": [
    "### Sample Inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b7885fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "#On Demand Serverless Endpoint Test\n",
    "response = runtime.invoke_endpoint(\n",
    "    EndpointName=endpoint_name_on_demand,\n",
    "    Body=b\".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0\",\n",
    "    ContentType=\"text/csv\",\n",
    ")\n",
    "\n",
    "print(response[\"Body\"].read())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73f4b21f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "#Provisioned Endpoint Test\n",
    "response = runtime.invoke_endpoint(\n",
    "    EndpointName=endpoint_name_pc,\n",
    "    Body=b\".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0\",\n",
    "    ContentType=\"text/csv\",\n",
    ")\n",
    "\n",
    "print(response[\"Body\"].read())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bce42d3",
   "metadata": {},
   "source": [
    "### Evaluate Performance of PC vs On Demand Serverless Inference\n",
    "\n",
    "Note that the following cell will take an hour to run. In this block we will wait 10 minutes between each request to compare cold-start times between a On Demand endpoint and the endpoint we have with Provisioned Concurrency enabled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "381d728e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import numpy as np\n",
    "print(\"Testing cold start for serverless inference with PC vs no PC\")\n",
    "\n",
    "pc_times = []\n",
    "non_pc_times = []\n",
    "\n",
    "# ~50 minutes\n",
    "for i in range(5):\n",
    "    time.sleep(600)\n",
    "    start_pc = time.time()\n",
    "    pc_response = runtime.invoke_endpoint(\n",
    "        EndpointName=endpoint_name_pc,\n",
    "        Body=b\".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0\",\n",
    "        ContentType=\"text/csv\",\n",
    "    )\n",
    "    end_pc = time.time() - start_pc\n",
    "    pc_times.append(end_pc)\n",
    "\n",
    "    start_no_pc = time.time()\n",
    "    response = runtime.invoke_endpoint(\n",
    "        EndpointName=endpoint_name_on_demand,\n",
    "        Body=b\".345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0\",\n",
    "        ContentType=\"text/csv\",\n",
    "    )\n",
    "    end_no_pc = time.time() - start_no_pc\n",
    "    non_pc_times.append(end_no_pc)\n",
    "\n",
    "pc_cold_start = np.mean(pc_times)\n",
    "non_pc_cold_start = np.mean(non_pc_times)\n",
    "\n",
    "print(\"Provisioned Concurrency Serverless Inference Average Cold Start: {}\".format(pc_cold_start))\n",
    "print(\"On Demand Serverless Inference Average Cold Start: {}\".format(non_pc_cold_start))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "caa373b1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "data = {'PC Cold-Start':pc_cold_start, 'On Demand Cold-Start':non_pc_cold_start}\n",
    "cold_starts = list(data.keys())\n",
    "values = list(data.values())\n",
    "  \n",
    "fig = plt.figure(figsize = (10, 5))\n",
    " \n",
    "# creating the bar plot\n",
    "plt.bar(cold_starts, values, color ='maroon',\n",
    "        width = 0.4)\n",
    " \n",
    "plt.xlabel(\"Serverless Inference Options\")\n",
    "plt.ylabel(\"Cold-Start Average Times\")\n",
    "plt.title(\"Provisioned Concurrency vs On Demand Serverless Inference Cold-Start Times\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca9b9beb",
   "metadata": {},
   "source": [
    "### Cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a27fcfb",
   "metadata": {},
   "outputs": [],
   "source": [
    "client.delete_model(ModelName=model_name)\n",
    "client.delete_endpoint_config(EndpointConfigName=xgboost_epc_name_pc)\n",
    "client.delete_endpoint_config(EndpointConfigName=xgboost_epc_name_on_demand)\n",
    "client.delete_endpoint(EndpointName=endpoint_name_pc)\n",
    "client.delete_endpoint(EndpointName=endpoint_name_on_demand)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}