{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MLOps workshop with Amazon SageMaker\n", "\n", "## Module 04: Amazon SageMaker Model Monitor\n", "This notebook shows how to:\n", "* Host a machine learning model in Amazon SageMaker and capture inference requests, results, and metadata \n", "* Analyze a training dataset to generate baseline constraints\n", "* Monitor a live endpoint for violations against constraints\n", "\n", "---\n", "## Background\n", "\n", "Amazon SageMaker provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly. Amazon SageMaker is a fully-managed service that encompasses the entire machine learning workflow. You can label and prepare your data, choose an algorithm, train a model, and then tune and optimize it for deployment. You can deploy your models to production with Amazon SageMaker to make predictions and lower costs than was previously possible.\n", "\n", "In addition, Amazon SageMaker enables you to capture the input, output and metadata for invocations of the models that you deploy. It also enables you to analyze the data and monitor its quality. In this notebook, you learn how Amazon SageMaker enables these capabilities.\n", "\n", "---\n", "## Setup\n", "\n", "To get started, make sure you have these prerequisites completed.\n", "\n", "* Specify an AWS Region to host your model.\n", "* An IAM role ARN exists that is used to give Amazon SageMaker access to your data in Amazon Simple Storage Service (Amazon S3). See the documentation for how to fine tune the permissions needed. \n", "* Create an S3 bucket used to store the data used to train your model, any additional model data, and the data captured from model invocations. For demonstration purposes, you are using the same bucket for these. In reality, you might want to separate them with different security policies." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install --upgrade sagemaker -q" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sklearn.model_selection import train_test_split\n", "\n", "region = boto3.Session().region_name\n", "role = get_execution_role()\n", "sess = sagemaker.session.Session()\n", "bucket = sess.default_bucket() \n", "prefix = 'tf-2-workflow'\n", "\n", "s3_capture_upload_path = 's3://{}/{}/monitoring/datacapture'.format(bucket, prefix)\n", "\n", "reports_prefix = '{}/reports'.format(prefix)\n", "s3_report_path = 's3://{}/{}'.format(bucket,reports_prefix)\n", "\n", "print(\"Capture path: {}\".format(s3_capture_upload_path))\n", "print(\"Report path: {}\".format(s3_report_path))\n", "print(f'SageMaker Version: {sagemaker.__version__}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# PART A: Capturing real-time inference data from Amazon SageMaker endpoints\n", "Create an endpoint to showcase the data capture capability in action.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deploy the model to Amazon SageMaker\n", "Start with deploying the trained TensorFlow model from lab 03." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "def get_latest_training_job_name(base_job_name):\n", " client = boto3.client('sagemaker')\n", " response = client.list_training_jobs(NameContains=base_job_name, SortBy='CreationTime', \n", " SortOrder='Descending', StatusEquals='Completed')\n", " if len(response['TrainingJobSummaries']) > 0 :\n", " return response['TrainingJobSummaries'][0]['TrainingJobName']\n", " else:\n", " raise Exception('Training job not found.')\n", "\n", "def get_training_job_s3_model_artifacts(job_name):\n", " client = boto3.client('sagemaker')\n", " response = client.describe_training_job(TrainingJobName=job_name)\n", " s3_model_artifacts = response['ModelArtifacts']['S3ModelArtifacts']\n", " return s3_model_artifacts\n", "\n", "latest_training_job_name = get_latest_training_job_name('tf-2-workflow')\n", "print(latest_training_job_name)\n", "model_path = get_training_job_s3_model_artifacts(latest_training_job_name)\n", "print(model_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, you create the model object with the image and model data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tensorflow.model import TensorFlowModel\n", "\n", "tensorflow_model = TensorFlowModel(\n", " model_data = model_path,\n", " role = role,\n", " framework_version = '2.6'\n", ")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime\n", "from sagemaker.model_monitor import DataCaptureConfig\n", "\n", "endpoint_name = 'tf-2-workflow-endpoint-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "print(endpoint_name)\n", "\n", "predictor = tensorflow_model.deploy(\n", " initial_instance_count=1,\n", " instance_type='ml.m5.xlarge',\n", " endpoint_name=endpoint_name,\n", " data_capture_config=DataCaptureConfig(\n", " enable_capture=True,\n", " sampling_percentage=100,\n", " destination_s3_uri=s3_capture_upload_path\n", " )\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare dataset\n", "\n", "Next, we'll import the dataset. The dataset itself is small and relatively issue-free. For example, there are no missing values, a common problem for many other datasets. Accordingly, preprocessing just involves normalizing the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/california_housing/cal_housing.tgz ." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "!tar -zxf cal_housing.tgz 2>/dev/null" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "columns = [\n", " \"longitude\",\n", " \"latitude\",\n", " \"housingMedianAge\",\n", " \"totalRooms\",\n", " \"totalBedrooms\",\n", " \"population\",\n", " \"households\",\n", " \"medianIncome\",\n", " \"medianHouseValue\",\n", "]\n", "df = pd.read_csv(\"CaliforniaHousing/cal_housing.data\", names=columns, header=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "columns_to_normalize = [\n", " 'medianIncome', 'housingMedianAge', 'totalRooms', \n", " 'totalBedrooms', 'population', 'households', 'medianHouseValue'\n", "]\n", "\n", "for column in columns_to_normalize:\n", " df[column] = np.log1p(df[column])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "X = df.drop(\"medianHouseValue\", axis=1)\n", "Y = df[\"medianHouseValue\"].copy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "x_train = x_train.to_numpy()\n", "x_test = x_test.to_numpy()\n", "y_train = y_train.to_numpy()\n", "y_test = y_test.to_numpy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "scaler.fit(x_train)\n", "x_train = scaler.transform(x_train)\n", "x_test = scaler.transform(x_test)\n", "print(x_train.shape)\n", "print(x_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Invoke the deployed model\n", "\n", "You can now send data to this endpoint to get inferences in real time. Because you enabled the data capture in the previous steps, the request and response payload, along with some additional metadata, is saved in the Amazon Simple Storage Service (Amazon S3) location you have specified in the DataCaptureConfig." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step invokes the endpoint with included sample data for about 3 minutes. Data is captured based on the sampling percentage specified and the capture continues until the data capture option is turned off." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time \n", "\n", "import time\n", "\n", "print(\"Sending test traffic to the endpoint {}. \\nPlease wait...\".format(endpoint_name))\n", "\n", "flat_list =[]\n", "for item in x_test[0:600]:\n", " result = predictor.predict(item)['predictions'] \n", " flat_list.append(float('%.1f'%(np.array(result))))\n", " time.sleep(0.5)\n", " \n", "print(\"Done!\")\n", "print('predictions: \\t{}'.format(np.array(flat_list)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## View captured data\n", "\n", "Now list the data capture files stored in Amazon S3. You should expect to see different files from different time periods organized based on the hour in which the invocation occurred. The format of the Amazon S3 path is:\n", "\n", "`s3://{destination-bucket-prefix}/{endpoint-name}/{variant-name}/yyyy/mm/dd/hh/filename.jsonl`\n", "\n", "<b>Note that the delivery of capture data to Amazon S3 can require a couple of minutes so next cell might error. If this happens, please retry after a minute.</b>" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client = boto3.Session().client('s3')\n", "result = s3_client.list_objects(Bucket=bucket, Prefix='tf-2-workflow/monitoring/datacapture/')\n", "capture_files = [capture_file.get(\"Key\") for capture_file in result.get('Contents')]\n", "print(\"Found Capture Files:\")\n", "print(\"\\n \".join(capture_files))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, view the contents of a single capture file. Here you should see all the data captured in an Amazon SageMaker specific JSON-line formatted file. Take a quick peek at the first few lines in the captured file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_obj_body(obj_key):\n", " return s3_client.get_object(Bucket=bucket, Key=obj_key).get('Body').read().decode(\"utf-8\")\n", "\n", "capture_file = get_obj_body(capture_files[-1])\n", "print(capture_file[:2000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, the contents of a single line is present below in a formatted JSON file so that you can observe a little better." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "print(json.dumps(json.loads(capture_file.split('\\n')[0]), indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, each inference request is captured in one line in the jsonl file. The line contains both the input and output merged together. In the example, you provided the ContentType as `text/csv` which is reflected in the `observedContentType` value. Also, you expose the encoding that you used to encode the input and output payloads in the capture format with the `encoding` value.\n", "\n", "To recap, you observed how you can enable capturing the input or output payloads to an endpoint with a new parameter. You have also observed what the captured format looks like in Amazon S3. Next, continue to explore how Amazon SageMaker helps with monitoring the data collected in Amazon S3." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# PART B: Model Monitor - Baselining and continuous monitoring" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to collecting the data, Amazon SageMaker provides the capability for you to monitor and evaluate the data observed by the endpoints. For this:\n", "1. Create a baseline with which you compare the realtime traffic. \n", "1. Once a baseline is ready, setup a schedule to continously evaluate and compare against the baseline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Constraint suggestion with baseline/training dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training dataset with which you trained the model is usually a good baseline dataset. Note that the training dataset data schema and the inference dataset schema should exactly match (i.e. the number and order of the features).\n", "\n", "From the training dataset you can ask Amazon SageMaker to suggest a set of baseline `constraints` and generate descriptive `statistics` to explore the data. For this example, upload the training dataset that was used to train the pre-trained model included in this example. If you already have it in Amazon S3, you can directly point to it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare training dataset with headers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['medianHouseValue'], axis=1)\n", "df.to_csv(\"training-dataset-with-header.csv\", index = False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# copy over the training dataset to Amazon S3 (if you already have it in Amazon S3, you could reuse it)\n", "baseline_prefix = prefix + '/baselining'\n", "baseline_data_prefix = baseline_prefix + '/data'\n", "baseline_results_prefix = baseline_prefix + '/results'\n", "\n", "baseline_data_uri = 's3://{}/{}'.format(bucket,baseline_data_prefix)\n", "baseline_results_uri = 's3://{}/{}'.format(bucket, baseline_results_prefix)\n", "print('Baseline data uri: {}'.format(baseline_data_uri))\n", "print('Baseline results uri: {}'.format(baseline_results_uri))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_data_file = open(\"training-dataset-with-header.csv\", 'rb')\n", "s3_key = os.path.join(baseline_prefix, 'data', 'training-dataset-with-header.csv')\n", "boto3.Session().resource('s3').Bucket(bucket).Object(s3_key).upload_fileobj(training_data_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a baselining job with training dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you have the training data ready in Amazon S3, start a job to `suggest` constraints. `DefaultModelMonitor.suggest_baseline(..)` starts a `ProcessingJob` using an Amazon SageMaker provided Model Monitor container to generate the constraints." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.model_monitor import DefaultModelMonitor\n", "from sagemaker.model_monitor.dataset_format import DatasetFormat\n", "\n", "my_default_monitor = DefaultModelMonitor(\n", " role=role,\n", " instance_count=1,\n", " instance_type='ml.m5.xlarge',\n", " volume_size_in_gb=20,\n", " max_runtime_in_seconds=3600,\n", ")\n", "\n", "my_default_monitor.suggest_baseline(\n", " baseline_dataset=baseline_data_uri+'/training-dataset-with-header.csv',\n", " dataset_format=DatasetFormat.csv(header=True),\n", " output_s3_uri=baseline_results_uri,\n", " wait=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explore the generated constraints and statistics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client = boto3.Session().client('s3')\n", "result = s3_client.list_objects(Bucket=bucket, Prefix=baseline_results_prefix)\n", "report_files = [report_file.get(\"Key\") for report_file in result.get('Contents')]\n", "print(\"Found Files:\")\n", "print(\"\\n \".join(report_files))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "baseline_job = my_default_monitor.latest_baselining_job\n", "schema_df = pd.io.json.json_normalize(baseline_job.baseline_statistics().body_dict[\"features\"])\n", "schema_df.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "constraints_df = pd.io.json.json_normalize(baseline_job.suggested_constraints().body_dict[\"features\"])\n", "constraints_df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Analyzing collected data for data quality issues" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a schedule" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can create a model monitoring schedule for the endpoint created earlier. Use the baseline resources (constraints and statistics) to compare against the realtime traffic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the analysis above, you saw how the captured data is saved - that is the standard input and output format for Tensorflow models. But Model Monitor is framework-agnostic, and expects a specific format [explained in the docs](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-pre-and-post-processing.html#model-monitor-pre-processing-script):\n", "- Input\n", " - Flattened JSON `{\"feature0\": <value>, \"feature1\": <value>...}`\n", " - Tabular `\"<value>, <value>...\"`\n", "- Output:\n", " - Flattened JSON `{\"prediction0\": <value>, \"prediction1\": <value>...}`\n", " - Tabular `\"<value>, <value>...\"`\n", " \n", "We need to transform the input records to comply with this requirement. Model Monitor offers _pre-processing scripts_ in Python to transform the input. The cell below has the script that will work for our case." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile preprocessing.py\n", "\n", "import json\n", "\n", "def preprocess_handler(inference_record):\n", " input_data = json.loads(inference_record.endpoint_input.data)\n", " input_data = {f\"feature{i}\": val for i, val in enumerate(input_data)}\n", " \n", " output_data = json.loads(inference_record.endpoint_output.data)[\"predictions\"][0][0]\n", " output_data = {\"prediction0\": output_data}\n", " \n", " return{**input_data}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll upload this script to an s3 destination and pass it as the `record_preprocessor_script` parameter to the `create_monitoring_schedule` call." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preprocessor_s3_dest_path = f\"s3://{bucket}/{prefix}/artifacts/modelmonitor\"\n", "preprocessor_s3_dest = sagemaker.s3.S3Uploader.upload(\"preprocessing.py\", preprocessor_s3_dest_path)\n", "print(preprocessor_s3_dest)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.model_monitor import CronExpressionGenerator\n", "from time import gmtime, strftime\n", "\n", "mon_schedule_name = 'DEMO-tf-2-workflow-model-monitor-schedule-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "my_default_monitor.create_monitoring_schedule(\n", " monitor_schedule_name=mon_schedule_name,\n", " endpoint_input=predictor.endpoint,\n", " record_preprocessor_script=preprocessor_s3_dest,\n", " output_s3_uri=s3_report_path,\n", " statistics=my_default_monitor.baseline_statistics(),\n", " constraints=my_default_monitor.suggested_constraints(),\n", " schedule_cron_expression=CronExpressionGenerator.hourly(),\n", " enable_cloudwatch_metrics=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating violations artificially\n", "\n", "In order to get some result relevant to monitoring analysis, you can try and generate artificially some inferences with feature values causing specific violations, and then invoke the endpoint with this data\n", "\n", "Looking at our `housingMedianAge` and AGE `medianIncome`:\n", "\n", "- housingMedianAge - median house age in block group\n", "- medianIncome - median income in block group\n", "\n", "Let's simulate a situation where the average number of rooms and proportion of owner-occupied units built are both -10." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_with_violations = pd.read_csv(\"training-dataset-with-header.csv\")\n", "df_with_violations[\"housingMedianAge\"] = -10\n", "df_with_violations[\"medianIncome\"] = -10\n", "df_with_violations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Start generating some artificial traffic\n", "The cell below starts a thread to send some traffic to the endpoint. Note that you need to stop the kernel to terminate this thread. If there is no traffic, the monitoring jobs are marked as `Failed` since there is no data to process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from threading import Thread\n", "from time import sleep\n", "import time\n", "\n", "def invoke_endpoint():\n", " for item in df_with_violations.to_numpy():\n", " result = predictor.predict(item)['predictions'] \n", " time.sleep(0.5)\n", "\n", "def invoke_endpoint_forever():\n", " while True:\n", " invoke_endpoint()\n", " \n", "thread = Thread(target = invoke_endpoint_forever)\n", "thread.start()\n", "\n", "# Note that you need to stop the kernel to stop the invocations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Describe and inspect the schedule\n", "Once you describe, observe that the MonitoringScheduleStatus changes to Scheduled." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "desc_schedule_result = my_default_monitor.describe_schedule()\n", "print('Schedule status: {}'.format(desc_schedule_result['MonitoringScheduleStatus']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List executions\n", "The schedule starts jobs at the previously specified intervals. Here, you list the latest five executions. Note that if you are kicking this off after creating the hourly schedule, you might find the executions empty. You might have to wait until you cross the hour boundary (in UTC) to see executions kick off. The code below has the logic for waiting.\n", "\n", "Note: Even for an hourly schedule, Amazon SageMaker has a buffer period of 20 minutes to schedule your execution. You might see your execution start in anywhere from zero to ~20 minutes from the hour boundary. This is expected and done for load balancing in the backend." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mon_executions = my_default_monitor.list_executions()\n", "print(\"We created a hourly schedule above and it will kick off executions ON the hour (plus 0 - 20 min buffer.\\nWe will have to wait till we hit the hour...\")\n", "\n", "while len(mon_executions) == 0:\n", " print(\"Waiting for the 1st execution to happen...\")\n", " time.sleep(60)\n", " mon_executions = my_default_monitor.list_executions() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect a specific execution (latest execution)\n", "In the previous cell, you picked up the latest completed or failed scheduled execution. Here are the possible terminal states and what each of them mean: \n", "* Completed - This means the monitoring execution completed and no issues were found in the violations report.\n", "* CompletedWithViolations - This means the execution completed, but constraint violations were detected.\n", "* Failed - The monitoring execution failed, maybe due to client error (perhaps incorrect role premissions) or infrastructure issues. Further examination of FailureReason and ExitMessage is necessary to identify what exactly happened.\n", "* Stopped - job exceeded max runtime or was manually stopped." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "latest_execution = mon_executions[-1] # latest execution's index is -1, second to last is -2 and so on..\n", "#time.sleep(60)\n", "latest_execution.wait(logs=False)\n", "\n", "print(\"Latest execution status: {}\".format(latest_execution.describe()['ProcessingJobStatus']))\n", "print(\"Latest execution result: {}\".format(latest_execution.describe()['ExitMessage']))\n", "\n", "latest_job = latest_execution.describe()\n", "if (latest_job['ProcessingJobStatus'] != 'Completed'):\n", " print(\"====STOP==== \\n No completed executions to inspect further. Please wait till an execution completes or investigate previously reported failures.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "report_uri=latest_execution.output.destination\n", "print('Report Uri: {}'.format(report_uri))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List the generated reports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from urllib.parse import urlparse\n", "s3uri = urlparse(report_uri)\n", "report_bucket = s3uri.netloc\n", "report_key = s3uri.path.lstrip('/')\n", "print('Report bucket: {}'.format(report_bucket))\n", "print('Report key: {}'.format(report_key))\n", "\n", "s3_client = boto3.Session().client('s3')\n", "result = s3_client.list_objects(Bucket=report_bucket, Prefix=report_key)\n", "report_files = [report_file.get(\"Key\") for report_file in result.get('Contents')]\n", "print(\"Found Report Files:\")\n", "print(\"\\n \".join(report_files))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Violations report" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If there are any violations compared to the baseline, they will be listed here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "violations = my_default_monitor.latest_monitoring_constraint_violations()\n", "pd.set_option('display.max_colwidth', -1)\n", "constraints_df = pd.io.json.json_normalize(violations.body_dict[\"violations\"])\n", "constraints_df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Triggering execution manually\n", "\n", "In order to trigger the execution manually, we first get all paths to data capture, baseline statistics, baseline constraints, etc.\n", "Then, we use a utility fuction, defined in <a href=\"./monitoringjob_utils.py\">monitoringjob_utils.py</a>, to run the processing job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = s3_client.list_objects(Bucket=bucket, Prefix='tf-2-workflow/monitoring/datacapture/')\n", "capture_files = ['s3://{0}/{1}'.format(bucket, capture_file.get(\"Key\")) for capture_file in result.get('Contents')]\n", "\n", "print(\"Capture Files: \")\n", "print(\"\\n\".join(capture_files))\n", "\n", "data_capture_path = capture_files[len(capture_files) - 1][: capture_files[len(capture_files) - 1].rfind('/')]\n", "statistics_path = baseline_results_uri + '/statistics.json'\n", "constraints_path = baseline_results_uri + '/constraints.json'\n", "\n", "print(data_capture_path)\n", "print(preprocessor_s3_dest)\n", "print(statistics_path)\n", "print(constraints_path)\n", "print(s3_report_path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from monitoringjob_utils import run_model_monitor_job_processor\n", "\n", "processor = run_model_monitor_job_processor(region, 'ml.m5.xlarge', \n", " role, \n", " data_capture_path, \n", " statistics_path, \n", " constraints_path, \n", " s3_report_path,\n", " preprocessor_path=preprocessor_s3_dest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect the execution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "def get_latest_model_monitor_processing_job_name(base_job_name):\n", " client = boto3.client('sagemaker')\n", " response = client.list_processing_jobs(NameContains=base_job_name, SortBy='CreationTime', \n", " SortOrder='Descending', StatusEquals='Completed')\n", " if len(response['ProcessingJobSummaries']) > 0 :\n", " return response['ProcessingJobSummaries'][0]['ProcessingJobName']\n", " else:\n", " raise Exception('Processing job not found.')\n", "\n", "def get_model_monitor_processing_job_s3_report(job_name):\n", " client = boto3.client('sagemaker')\n", " response = client.describe_processing_job(ProcessingJobName=job_name)\n", " s3_report_path = response['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']\n", " return s3_report_path\n", "\n", "latest_model_monitor_processing_job_name = get_latest_model_monitor_processing_job_name('sagemaker-model-monitor-analyzer')\n", "print(latest_model_monitor_processing_job_name)\n", "report_path = get_model_monitor_processing_job_s3_report(latest_model_monitor_processing_job_name)\n", "print(report_path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_colwidth', -1)\n", "df = pd.read_json('{}/constraint_violations.json'.format(report_path))\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Delete the resources\n", "\n", "You can keep your endpoint running to continue capturing data. If you do not plan to collect more data or use this endpoint further, you should delete the endpoint to avoid incurring additional charges. Note that deleting your endpoint does not delete the data that was captured during the model invocations. That data persists in Amazon S3 until you delete it yourself.\n", "\n", "But before that, you need to delete the schedule first." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_default_monitor.delete_monitoring_schedule()\n", "time.sleep(120) # actually wait for the deletion" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (TensorFlow 2.6 Python 3.8 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/tensorflow-2.6-cpu-py38-ubuntu20.04-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" } }, "nbformat": 4, "nbformat_minor": 4 }