{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e2527f22",
   "metadata": {},
   "source": [
    "# Deploy SageMaker Serverless Endpoint\n",
    "\n",
    "## Sentiment Binary Classification (fine-tuning with KoELECTRA-Small-v3 model and Naver Sentiment Movie Corpus dataset)\n",
    "\n",
    "- KoELECTRA: https://github.com/monologg/KoELECTRA\n",
    "- Naver Sentiment Movie Corpus Dataset: https://github.com/e9t/nsmc\n",
    "\n",
    "---\n",
    "\n",
    "## Overview\n",
    "\n",
    "Amazon SageMaker Serverless Inference는 re:Invent 2021에 런칭된 신규 추론 옵션으로 호스팅 인프라 관리에 대한 부담 없이 머신 러닝을 모델을 쉽게 배포하고 확장할 수 있도록 제작된 신규 추론 옵션입니다. SageMaker Serverless Inference는 컴퓨팅 리소스를 자동으로 시작하고 트래픽에 따라 자동으로 스케일 인/아웃을 수행하므로 인스턴스 유형을 선택하거나 스케일링 정책을 관리할 필요가 없습니다. 따라서, 트래픽 급증 사이에 유휴 기간이 있고 콜드 스타트를 허용할 수 있는 워크로드에 이상적입니다.\n",
    "\n",
    "## Difference from Lambda Serverless Inference\n",
    "\n",
    "\n",
    "### Lambda Serverless Inference\n",
    "\n",
    "- Lambda 컨테이너용 도커 이미지 빌드/디버그 후 Amazon ECR(Amazon Elastic Container Registry)에 푸시\n",
    "- Option 1: Lambda 함수를 생성하여 직접 모델 배포 수행\n",
    "- Option 2: SageMaker API로 SageMaker에서 모델 배포 수행 (`LambdaModel` 및 `LambdaPredictor` 리소스를 순차적으로 생성) 단, Option 2를 사용하는 경우 적절한 권한을 직접 설정해 줘야 합니다.\n",
    "    - SageMaker과 연결된 role 대해 ECR 억세스를 허용하는 policy 생성 및 연결\n",
    "    - SageMaker 노트북에서 lambda를 실행할 수 있는 role 생성\n",
    "    - Lambda 함수가 ECR private 리포지토리에 연결하는 억세스를 허용하는 policy 생성 및 연결 \n",
    "\n",
    "\n",
    "### SageMaker Serverless Inference\n",
    "\n",
    "기존 Endpoint 배포 코드에서 Endpoint 설정만 변경해 주시면 되며, 별도의 도커 이미지 빌드가 필요 없기에 쉽고 빠르게 서버리스 추론을 수행할 수 있습니다.\n",
    "\n",
    "**주의**\n",
    "- 현재 서울 리전을 지원하지 않기 때문에 아래 리전 중 하나를 선택해서 수행하셔야 합니다.\n",
    "    - 현재 지원하는 리전: US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (Ireland), Asia Pacific (Tokyo) and Asia Pacific (Sydney)\n",
    "- boto3, botocore, sagemaker, awscli는 2021년 12월 버전 이후를 사용하셔야 합니다."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16b6b42a",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## 1. Upload Model Artifacts\n",
    "---\n",
    "\n",
    "모델을 아카이빙하여 S3로 업로드합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ff9b59c",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -qU sagemaker botocore boto3 awscli\n",
    "!pip install --ignore-installed PyYAML\n",
    "!pip install transformers==4.12.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00e80119",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "import torchvision\n",
    "import torchvision.models as models\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker.utils import name_from_base\n",
    "from sagemaker.pytorch import PyTorchModel\n",
    "import boto3\n",
    "import datetime\n",
    "import time\n",
    "from time import strftime,gmtime\n",
    "import json\n",
    "import os\n",
    "import io\n",
    "import torchvision.transforms as transforms\n",
    "from src.utils import print_outputs, upload_model_artifact_to_s3, NLPPredictor \n",
    "\n",
    "role = get_execution_role()\n",
    "boto_session = boto3.session.Session()\n",
    "sm_session = sagemaker.session.Session()\n",
    "sm_client = boto_session.client(\"sagemaker\")\n",
    "sm_runtime = boto_session.client(\"sagemaker-runtime\")\n",
    "region = boto_session.region_name\n",
    "bucket = sm_session.default_bucket()\n",
    "prefix = 'serverless-inference-kornlp-nsmc'\n",
    "\n",
    "print(f'region = {region}')\n",
    "print(f'role = {role}')\n",
    "print(f'bucket = {bucket}')\n",
    "print(f'prefix = {prefix}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b5ae146",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_variant = 'modelA'\n",
    "nlp_task = 'nsmc'\n",
    "model_path = f'model-{nlp_task}'\n",
    "model_s3_uri = upload_model_artifact_to_s3(model_variant, model_path, bucket, prefix)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64c8e40e",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## 2. Create SageMaker Serverless Endpoint\n",
    "---\n",
    "\n",
    "SageMaker Serverless Endpoint는 기존 SageMaker 리얼타임 엔드포인트 배포와 99% 유사합니다. 1%의 차이가 무엇일까요? Endpoint 구성 설정 시, ServerlessConfig에서 메모리 크기(`MemorySizeInMB`), 최대 동시 접속(`MaxConcurrency`)에 대한 파라메터만 추가하시면 됩니다.\n",
    "\n",
    "```python\n",
    "sm_client.create_endpoint_config(\n",
    "  ...\n",
    "  \"ServerlessConfig\": {\n",
    "    \"MemorySizeInMB\": 2048,\n",
    "    \"MaxConcurrency\": 20\n",
    "  }\n",
    ")\n",
    "```\n",
    "\n",
    "자세한 내용은 아래 링크를 참조해 주세요.\n",
    "- Amazon SageMaker Developer Guide - Serverless Inference: https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f427e61c",
   "metadata": {},
   "source": [
    "### Create Inference containter definition for Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1f8c748",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.image_uris import retrieve\n",
    "\n",
    "deploy_instance_type = 'ml.m5.xlarge'\n",
    "pt_ecr_image_uri = retrieve(\n",
    "    framework='pytorch',\n",
    "    region=region,\n",
    "    version='1.7.1',\n",
    "    py_version='py3',\n",
    "    instance_type = deploy_instance_type,\n",
    "    accelerator_type=None,\n",
    "    image_scope='inference'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe77df3e",
   "metadata": {},
   "source": [
    "### Create a SageMaker Model\n",
    "\n",
    "`create_model` API를 호출하여 위 코드 셀에서 생성한 컨테이너의 정의를 포함하는 모델을 생성합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f505b02",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_name = f\"KorNLPServerless-{nlp_task}-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
    "\n",
    "create_model_response = sm_client.create_model(\n",
    "    ModelName=model_name,\n",
    "    Containers=[\n",
    "        {\n",
    "            \"Image\": pt_ecr_image_uri,\n",
    "            \"Mode\": \"SingleModel\",\n",
    "            \"ModelDataUrl\": model_s3_uri,\n",
    "            \"Environment\": {\n",
    "                \"SAGEMAKER_CONTAINER_LOG_LEVEL\": \"20\",\n",
    "                \"SAGEMAKER_PROGRAM\": \"inference_nsmc.py\",\n",
    "                \"SAGEMAKER_SUBMIT_DIRECTORY\": model_s3_uri,\n",
    "            },                \n",
    "        }        \n",
    "        \n",
    "    ],\n",
    "    ExecutionRoleArn=role,\n",
    ")\n",
    "print(f\"Created Model: {create_model_response['ModelArn']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6232b6d3",
   "metadata": {},
   "source": [
    "### Create Endpoint Configuration\n",
    "\n",
    "`ServerlessConfig`으로 엔드포인트에 대한 서버리스 설정을 조정할 수 있습니다. 최대 동시 호출(`MaxConcurrency`; max concurrent invocations)은 1에서 50 사이이며, `MemorySize`는 1024MB, 2048MB, 3072MB, 4096MB, 5120MB 또는 6144MB를 선택할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eab981df",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_config_name = f\"KorNLPServerlessEndpointConfig-{nlp_task}-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
    "endpoint_config_response = sm_client.create_endpoint_config(\n",
    "    EndpointConfigName=endpoint_config_name,\n",
    "    ProductionVariants=[\n",
    "        {\n",
    "            \"VariantName\": \"AllTraffic\",\n",
    "            \"ModelName\": model_name,\n",
    "            \"ServerlessConfig\": {\n",
    "                \"MemorySizeInMB\": 4096,\n",
    "                \"MaxConcurrency\": 20,\n",
    "            },            \n",
    "        },\n",
    "    ],\n",
    ")\n",
    "print(f\"Created EndpointConfig: {endpoint_config_response['EndpointConfigArn']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34271860",
   "metadata": {},
   "source": [
    "### Create a SageMaker Multi-container endpoint\n",
    "\n",
    "create_endpoint API로 멀티 컨테이너 엔드포인트를 생성합니다. 기존의 엔드포인트 생성 방법과 동일합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d93e4b6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "endpoint_name = f\"KorNLPServerlessEndpoint-{nlp_task}-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}\"\n",
    "endpoint_response = sm_client.create_endpoint(\n",
    "    EndpointName=endpoint_name, \n",
    "    EndpointConfigName=endpoint_config_name\n",
    ")\n",
    "print(f\"Creating Endpoint: {endpoint_response['EndpointArn']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d435bd7",
   "metadata": {},
   "source": [
    "`describe_endpoint` API를 사용하여 엔드포인트 생성 상태를 확인할 수 있습니다. SageMaker 서버리스 엔드포인트는 일반적인 엔드포인트 생성보다 빠르게 생성됩니다. (약 2-3분)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f9288c7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "waiter = boto3.client('sagemaker').get_waiter('endpoint_in_service')\n",
    "print(\"Waiting for endpoint to create...\")\n",
    "waiter.wait(EndpointName=endpoint_name)\n",
    "resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n",
    "print(f\"Endpoint Status: {resp['EndpointStatus']}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0694a36",
   "metadata": {},
   "source": [
    "### Direct Invocation for Model \n",
    "\n",
    "최초 호출 시 Cold start로 지연 시간이 발생하지만, 최초 호출 이후에는 warm 상태를 유지하기 때문에 빠르게 응답합니다. 물론 수 분 동안 호출이 되지 않거나 요청이 많아지면 cold 상태로 바뀐다는 점을 유의해 주세요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9610286f",
   "metadata": {},
   "outputs": [],
   "source": [
    "model_sample_path = 'samples/nsmc.txt'\n",
    "!cat $model_sample_path\n",
    "with open(model_sample_path, mode='rb') as file:\n",
    "    model_input_data = file.read()  \n",
    "\n",
    "model_response = sm_runtime.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/jsonlines\",\n",
    "    Accept=\"application/jsonlines\",\n",
    "    Body=model_input_data\n",
    ")\n",
    "\n",
    "model_outputs = model_response['Body'].read().decode()\n",
    "print()\n",
    "print_outputs(model_outputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57441617",
   "metadata": {},
   "source": [
    "### Check Model Latency"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a61db0d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "start = time.time()\n",
    "for _ in range(10):\n",
    "    model_response = sm_runtime.invoke_endpoint(\n",
    "    EndpointName=endpoint_name,\n",
    "    ContentType=\"application/jsonlines\",\n",
    "    Accept=\"application/jsonlines\",\n",
    "    Body=model_input_data\n",
    ")\n",
    "inference_time = (time.time()-start)\n",
    "print(f'Inference time is {inference_time:.4f} ms.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0645b4d2",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## Clean Up\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da123982",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sm_client.delete_endpoint(EndpointName=endpoint_name)\n",
    "sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)\n",
    "sm_client.delete_model(ModelName=model_name)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_latest_p37",
   "language": "python",
   "name": "conda_pytorch_latest_p37"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}