{ "cells": [ { "cell_type": "markdown", "id": "71a329f0", "metadata": {}, "source": [ "# Open-LLAMA 7B implementation using LMI container on SageMaker\n", "### Model source: https://github.com/openlm-research/open_llama ; \n", "#### Model download hub: https://huggingface.co/openlm-research/open_llama_7b_400bt_preview; \n", "#### License: Apache-2.0\n", "In this tutorial, you will bring your own container from docker hub to SageMaker and run inference with it.\n", "Please make sure the following permission granted before running the notebook:\n", "\n", "- ECR Push/Pull access\n", "- S3 bucket push access\n", "- SageMaker access\n", "\n", "#### Attribution: this notebook is based on the content of https://github.com/deepjavalibrary/djl-demo/tree/master and was debugged with the help of lanking520.\n", "\n", "## Step 1: Let's bump up SageMaker and import stuff" ] }, { "cell_type": "code", "execution_count": 1, "id": "ce21e78b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Note: you may need to restart the kernel to use updated packages.\n" ] } ], "source": [ "%pip install --upgrade pip --quiet" ] }, { "cell_type": "code", "execution_count": null, "id": "67fa3208", "metadata": {}, "outputs": [], "source": [ "%pip install sagemaker boto3 awscli --upgrade --quiet" ] }, { "cell_type": "code", "execution_count": 3, "id": "ec9ac353", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "from sagemaker import Model, serializers, deserializers\n", "\n", "role = sagemaker.get_execution_role() # execution role for the endpoint\n", "sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs\n", "region = sess._region_name # region name of the current SageMaker Studio environment\n", "account_id = sess.account_id() # account_id of the current SageMaker Studio environment" ] }, { "cell_type": "code", "execution_count": 4, "id": "7f195c0b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "arn:aws:iam::328296961357:role/service-role/AmazonSageMaker-ExecutionRole-20191125T182032 us-west-2 328296961357\n" ] } ], "source": [ "print(role, region, account_id)" ] }, { "cell_type": "code", "execution_count": 5, "id": "a2d4667f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'2.161.0'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sagemaker.__version__" ] }, { "cell_type": "markdown", "id": "71542f98", "metadata": {}, "source": [ "## Step 2 pull and push the docker from Docker hub to ECR repository (optional)\n", "*Note: you can either use a prebuilt container or use the cell below (change cell type to 'code' from 'raw\")\n", "\n", "*Note: Please make sure you have the permission in AWS credential to push to ECR repository*\n", "\n", "This process may take a while, depends on the container size and your network bandwidth\n", "### Note: you only need to build this container once. Once you pushed it in ECR, you can pull the image via \n", "image_uri = f\"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:latest\"" ] }, { "cell_type": "raw", "id": "7ba4fa61", "metadata": {}, "source": [ "%%sh\n", "\n", "# The name of our container\n", "repo_name=djlserving-byoc\n", "# Target container\n", "target_container=\"deepjavalibrary/djl-serving:0.22.1-deepspeed\"\n", "\n", "account=$(aws sts get-caller-identity --query Account --output text)\n", "\n", "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n", "region=$(aws configure get region)\n", "region=${region:-us-west-2}\n", "\n", "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${repo_name}:latest\"\n", "echo \"Creating ECR repository ${fullname}\"\n", "\n", "# If the repository doesn't exist in ECR, create it.\n", "\n", "aws ecr describe-repositories --repository-names \"${repo_name}\" > /dev/null 2>&1\n", "\n", "if [ $? -ne 0 ]\n", "then\n", " aws ecr create-repository --repository-name \"${repo_name}\" > /dev/null\n", "fi\n", "\n", "# Get the login command from ECR and execute it directly\n", "aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin \"${account}.dkr.ecr.${region}.amazonaws.com\"\n", "\n", "# Build the docker image locally with the image name and then push it to ECR\n", "# with the full name.\n", "echo \"Start pulling container: ${target_container}\"\n", "\n", "docker pull ${target_container}\n", "docker tag ${target_container} ${fullname}\n", "docker push ${fullname}" ] }, { "cell_type": "markdown", "id": "81deac79", "metadata": {}, "source": [ "## Step 3: Start preparing model artifacts\n", "In LMI container, we expect some artifacts to help set up the model\n", "- serving.properties (required): Defines the model server settings\n", "- model.py (optional): A python file to define the core inference logic\n", "- requirements.txt (optional): Any additional pip wheel need to install" ] }, { "cell_type": "code", "execution_count": 7, "id": "b011bf5f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing serving.properties\n" ] } ], "source": [ "%%writefile serving.properties\n", "engine=Python\n", "option.tensor_parallel_degree=1\n", "option.model_id=openlm-research/open_llama_7b_400bt_preview" ] }, { "cell_type": "code", "execution_count": 8, "id": "4dd9b8ce", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing model.py\n" ] } ], "source": [ "%%writefile model.py\n", "from djl_python import Input, Output\n", "from djl_python.streaming_utils import StreamingUtils\n", "import os\n", "import deepspeed\n", "import torch\n", "import logging\n", "from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer\n", "from transformers.models.llama.modeling_llama import LlamaDecoderLayer\n", "##\n", "from transformers import LlamaTokenizer, LlamaForCausalLM\n", "\n", "model = None\n", "tokenizer = None\n", "\n", "\n", "def get_model(properties):\n", " model_name = properties[\"model_id\"]\n", " tensor_parallel_degree = properties[\"tensor_parallel_degree\"]\n", " max_tokens = int(properties.get(\"max_tokens\", \"1024\"))\n", " dtype = torch.float16\n", "\n", " model = LlamaForCausalLM.from_pretrained(model_name, low_cpu_mem_usage=True, torch_dtype=dtype, device_map='auto')\n", " tokenizer = LlamaTokenizer.from_pretrained(model_name)\n", " tokenizer.pad_token = tokenizer.eos_token\n", "\n", " return model, tokenizer\n", "\n", "\n", "def inference(inputs):\n", " try:\n", " input_map = inputs.get_as_json()\n", " data = input_map.pop(\"inputs\", input_map)\n", " parameters = input_map.pop(\"parameters\", {})\n", " outputs = Output()\n", "\n", " enable_streaming = inputs.get_properties().get(\"enable_streaming\",\n", " \"false\").lower() == \"true\"\n", " if enable_streaming:\n", " stream_generator = StreamingUtils.get_stream_generator(\n", " \"DeepSpeed\")\n", " outputs.add_stream_content(\n", " stream_generator(model, tokenizer, data,\n", " **parameters))\n", " return outputs\n", "\n", " tokenizer.add_special_tokens({'pad_token': '[PAD]'})\n", " tokenizer.padding_side = 'left'\n", " input_tokens = tokenizer(data, padding=True,\n", " return_tensors=\"pt\").to(\n", " torch.cuda.current_device())\n", " with torch.no_grad():\n", " output_tokens = model.generate(input_tokens.input_ids, **parameters)\n", " generated_text = tokenizer.batch_decode(output_tokens,\n", " skip_special_tokens=True)\n", "\n", " outputs.add_as_json([{\"generated_text\": s} for s in generated_text])\n", " return outputs\n", " except Exception as e:\n", " logging.exception(\"Huggingface inference failed\")\n", " # error handling\n", " outputs = Output().error(str(e))\n", "\n", "\n", "\n", "def handle(inputs: Input) -> None:\n", " global model, tokenizer\n", " if not model:\n", " model, tokenizer = get_model(inputs.get_properties())\n", "\n", " if inputs.is_empty():\n", " # Model server makes an empty call to warmup the model on startup\n", " return None\n", "\n", " return inference(inputs)" ] }, { "cell_type": "code", "execution_count": 9, "id": "e8b50a6c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing requirements.txt\n" ] } ], "source": [ "%%writefile requirements.txt\n", "transformers==4.28.1\n", "protobuf==3.20.1" ] }, { "cell_type": "code", "execution_count": 10, "id": "b0142973", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mymodel/\n", "mymodel/requirements.txt\n", "mymodel/model.py\n", "mymodel/serving.properties\n" ] } ], "source": [ "%%sh\n", "mkdir mymodel\n", "mv serving.properties mymodel/\n", "mv model.py mymodel/\n", "mv requirements.txt mymodel/\n", "tar czvf mymodel.tar.gz mymodel/\n", "rm -rf mymodel" ] }, { "cell_type": "markdown", "id": "2e58cf33", "metadata": {}, "source": [ "## Step 4: Start building SageMaker endpoint\n", "In this step, we will build SageMaker endpoint from scratch\n", "\n", "### 4.1 Upload artifact on S3 and create SageMaker model" ] }, { "cell_type": "code", "execution_count": 12, "id": "38b1e5ca", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-west-2-328296961357/large-model-lmi/code/mymodel.tar.gz\n", "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.22.1-deepspeed0.8.3-cu118\n" ] } ], "source": [ "s3_code_prefix = \"large-model-lmi/code\"\n", "bucket = sess.default_bucket() # bucket to house artifacts\n", "code_artifact = sess.upload_data(\"mymodel.tar.gz\", bucket, s3_code_prefix)\n", "print(f\"S3 Code or Model tar ball uploaded to --- > {code_artifact}\")\n", "\n", "#repo_name=\"djlserving-byoc\"\n", "#image_uri = f\"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:latest\"\n", "image_uri = sagemaker.image_uris.retrieve(\n", " framework=\"djl-deepspeed\",\n", " region=region,\n", " version=\"0.22.1\"\n", " )\n", "\n", "\n", "print(image_uri)\n", "\n", "model = Model(image_uri=image_uri, model_data=code_artifact, role=role)" ] }, { "cell_type": "markdown", "id": "004f39f6", "metadata": {}, "source": [ "### 4.2 Create SageMaker endpoint\n", "\n", "You need to specify the instance to use and endpoint names" ] }, { "cell_type": "code", "execution_count": 13, "id": "8e0e61cd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------!" ] } ], "source": [ "instance_type = \"ml.g5.2xlarge\" #\"ml.g5.2xlarge\" - #single GPU. really need one GPU for this since tensor split is '1'\n", "\n", "endpoint_name = sagemaker.utils.name_from_base(\"open-llama-lmi-model\")\n", "\n", "model.deploy(initial_instance_count=1,\n", " instance_type=instance_type,\n", " endpoint_name=endpoint_name,\n", " container_startup_health_check_timeout=900)\n", "\n", "# our requests and responses will be in json format so we specify the serializer and the deserializer\n", "predictor = sagemaker.Predictor(\n", " endpoint_name=endpoint_name,\n", " sagemaker_session=sess,\n", " serializer=serializers.JSONSerializer(),\n", " deserializer=deserializers.JSONDeserializer(),\n", ")" ] }, { "cell_type": "markdown", "id": "bb63ee65", "metadata": {}, "source": [ "## Step 5a: Test and benchmark inference latency\n", "### The latency is heavily dependent on 'max_new_tokens' parameter" ] }, { "cell_type": "code", "execution_count": 14, "id": "2bcef095", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.2340340614318848\n" ] } ], "source": [ "import time\n", "tic = time.time()\n", "predictor.predict(\n", " {\"inputs\": \"tuna sandwich nutritional content is \", \"parameters\": {\"max_new_tokens\": 16}}\n", ")\n", "toc = time.time()\n", "print (toc-tic)" ] }, { "cell_type": "markdown", "id": "807e258b", "metadata": {}, "source": [ "## Let us define a helper function to get a histogram of invocation latency distribution" ] }, { "cell_type": "code", "execution_count": 15, "id": "3b35a270", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Matplotlib is building the font cache; this may take a moment.\n" ] } ], "source": [ "import matplotlib.pyplot as plt\n", "import time\n", "import numpy as np\n", "from tqdm import tqdm\n", "\n", "def _latency_hist_plot(endpoint_name, invocation_number=100, sleep_time=1):\n", " latency_array = []\n", " for i in tqdm(range(invocation_number)):\n", " tic = time.time()\n", " response_ = predictor.predict(\n", " {\"inputs\": \"Large model inference is\", \"parameters\": {\"max_new_tokens\": 256}}\n", " )\n", " toc = time.time()\n", " latency_array.append(toc-tic)\n", " time.sleep(sleep_time)\n", " \n", " latency_array_np = np.array(latency_array)\n", " _ = plt.hist(latency_array_np, bins='auto') # arguments are passed to np.histogram\n", " plt.title(\"Invocation Latency Histogram with 'auto' bins\")\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": 16, "id": "020c7bd0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 10/10 [01:53<00:00, 11.35s/it]\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "114.2704861164093\n", "CPU times: user 258 ms, sys: 39.5 ms, total: 298 ms\n", "Wall time: 1min 54s\n" ] } ], "source": [ "%%time\n", "inv_start_time=time.time()\n", "invocation_number = 10\n", "# Real-time endpoint\n", "_latency_hist_plot(endpoint_name, invocation_number, sleep_time = 1)\n", "inv_lapse_time=time.time()-inv_start_time\n", "print(inv_lapse_time)" ] }, { "cell_type": "code", "execution_count": 17, "id": "a53c2f1d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "open-llama-lmi-model-2023-06-02-00-16-24-723\n", "us-west-2\n" ] } ], "source": [ "endpoint_name = predictor.endpoint_name\n", "print(endpoint_name)\n", "print(region)" ] }, { "cell_type": "markdown", "id": "a1305440", "metadata": {}, "source": [ "## Step 5b: Analyze Inference Latency via CloudWatch" ] }, { "cell_type": "code", "execution_count": 18, "id": "9b31abc1", "metadata": {}, "outputs": [], "source": [ "# https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-test-endpoints.html\n", "import pandas as pd\n", "\n", "cw = boto3.client(\"cloudwatch\", region_name=region)\n", "\n", "def get_invocation_metrics_for_endpoint(endpoint_name, metric_name, start_time, end_time):\n", "# metric = \"Sum\"\n", " metric = \"Average\"\n", " metrics = cw.get_metric_statistics(\n", " Namespace=\"AWS/SageMaker\",\n", " MetricName=metric_name,\n", " StartTime=start_time,\n", " EndTime=end_time,\n", " Period=1,\n", " Statistics=[metric],\n", " Dimensions=[\n", " {\"Name\": \"EndpointName\", \"Value\": endpoint_name},\n", " {\"Name\": \"VariantName\", \"Value\": \"AllTraffic\"}\n", " ]\n", " )\n", " return (\n", " pd.DataFrame(metrics[\"Datapoints\"])\n", " .sort_values(\"Timestamp\")\n", " .set_index(\"Timestamp\")\n", " .drop(\"Unit\", axis=1)\n", " .rename(columns={metric: metric_name})\n", " )\n", "# return metrics" ] }, { "cell_type": "code", "execution_count": 19, "id": "17e16745", "metadata": {}, "outputs": [], "source": [ "import datetime\n", "\n", "def plot_endpoint_metrics(start_time=None, end_time=None):\n", "# start_time = start_time or datetime.datetime.now() - datetime.timedelta(seconds=inv_lapse_time+60)\n", "# end_time = datetime.datetime.now()\n", " model_metrics = get_invocation_metrics_for_endpoint(\n", " endpoint_name, \"ModelLatency\", start_time, end_time\n", " )\n", " overhead_metrics = get_invocation_metrics_for_endpoint(\n", " endpoint_name, \"OverheadLatency\", start_time, end_time\n", " )\n", " total_metrics = model_metrics.join(overhead_metrics)\n", " total_metrics[\"ModelLatency\"] = total_metrics[\"ModelLatency\"]/1000\n", " total_metrics[\"OverheadLatency\"] = total_metrics[\"OverheadLatency\"]/1000\n", "# total_metrics[\"TotalLatency in ms\"] = total_metrics[[\"ModelLatency\",\"OverheadLatency\"]].sum(axis=1)\n", "# total_metrics = total_metrics.drop(['ModelLatency', 'OverheadLatency'], axis=1)\n", " total_metrics.plot()\n", " return total_metrics" ] }, { "cell_type": "code", "execution_count": 20, "id": "9249d41e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2023-06-02 00:26:07.841647\n", "2023-06-02 00:23:13.571161\n" ] } ], "source": [ "endtime=datetime.datetime.now()\n", "print(endtime)\n", "startime = endtime - datetime.timedelta(seconds=inv_lapse_time+60)\n", "print(startime)\n" ] }, { "cell_type": "code", "execution_count": 21, "id": "3b43f54e", "metadata": {}, "outputs": [], "source": [ "#wait for cloudwatch metrics to populate\n", "time.sleep(300)" ] }, { "cell_type": "code", "execution_count": 22, "id": "2b3f536a", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "total_metrics = plot_endpoint_metrics(start_time=startime, end_time=endtime)\n", "#total_metrics = plot_endpoint_metrics(start_time=startime, end_time=endtime)" ] }, { "cell_type": "code", "execution_count": 23, "id": "43d10a1d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ModelLatencyOverheadLatency
Timestamp
2023-06-02 00:24:00+00:008692.398222.5878
2023-06-02 00:25:00+00:0010338.15987.3332
2023-06-02 00:26:00+00:0010314.65505.9240
\n", "
" ], "text/plain": [ " ModelLatency OverheadLatency\n", "Timestamp \n", "2023-06-02 00:24:00+00:00 8692.3982 22.5878\n", "2023-06-02 00:25:00+00:00 10338.1598 7.3332\n", "2023-06-02 00:26:00+00:00 10314.6550 5.9240" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Latency expressed in ms\n", "total_metrics" ] }, { "cell_type": "markdown", "id": "401d0e40", "metadata": {}, "source": [ "## Clean up the environment" ] }, { "cell_type": "code", "execution_count": null, "id": "5552dd36", "metadata": {}, "outputs": [], "source": [ "sess.delete_endpoint(endpoint_name)\n", "sess.delete_endpoint_config(endpoint_name)\n", "model.delete_model()" ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 5 }