{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# データ品質モニタリングのステップA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "このノートブックを実行する時のヒント: \n", "- KernelはPython3(Data Science)で動作確認をしています。\n", "- デフォルトではSageMakerのデフォルトBucketを利用します。必要に応じて変更することも可能です。\n", "- 実際に動かさなくても出力を確認できるようにセルのアウトプットを残しています。きれいな状態から実行したい場合は、右クリックメニューから \"Clear All Outputs\"を選択して出力をクリアしてから始めてください。\n", "- 作成されたスケジュールはSageMaker Studioの`SageMaker resource` (左側ペインの一番下)のEndpointメニューからも確認可能です" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "モデル名は実行前に設定変更が必要です" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# step-0-train-model.ipynb でトレーニングしたモデルの名前に変更してください\n", "model_name = 'nyctaxi-xgboost-regression-2022-12-15-10-58-19-model'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "複数のノートブックで共通で使用する変数" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# エンドポイント名を指定する\n", "endpoint_name = 'nyctaxi-xgboost-endpoint'\n", "\n", "# エンドポイントConfigの名前を指定する\n", "endpoint_config_name = f'{endpoint_name}-config'\n", "\n", "# データ品質のモニタリングスケジュールの名前を指定する\n", "data_quality_monitoring_schedule = f'{endpoint_name}-data-quality-schedule'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "モニタリング結果を保管するための、ベースラインやレポートのS3上のPrefixを設定します" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# ベースラインの出力先Prefixを設定する\n", "baseline_prefix = 'model_monitor/data_quality_baseline'\n", "\n", "# 時系列での可視化のために、複数のレポートに共通するPrefixを設定する\n", "report_prefix = 'model_monitor/data_quality_monitoring_report'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A1. 推論エンドポイントにデータキャプチャの設定を行う" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### データキャプチャの設定を入れた新しい推論エンドポイントを作成する場合" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import time\n", "import sagemaker" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "sm_client = boto3.client('sagemaker')\n", "bucket = sagemaker.Session().default_bucket()\n", "\n", "# Create endpoint config\n", "create_endpoint_config_response = sm_client.create_endpoint_config(\n", " EndpointConfigName=endpoint_config_name,\n", " ProductionVariants=[{\n", " 'InstanceType': 'ml.t2.medium',\n", " 'InitialVariantWeight': 1,\n", " 'InitialInstanceCount': 1,\n", " 'ModelName': model_name,\n", " 'VariantName': 'AllTraffic'}],\n", " # Set data capture config\n", " DataCaptureConfig={\n", " 'EnableCapture': True,\n", " 'InitialSamplingPercentage': 100,\n", " 'DestinationS3Uri': f's3://{bucket}/model_monitor/endpoint-data-capture',\n", " 'CaptureOptions': [{'CaptureMode': 'Input'}, {'CaptureMode': 'Output'}],\n", " 'CaptureContentTypeHeader': {\n", " 'CsvContentTypes': ['text/csv'],\n", " 'JsonContentTypes': ['application/json']\n", " }\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "エンドポイントのデプロイに10分程度の時間がかかります" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Status: Creating\n", "Status: Creating\n", "Status: Creating\n", "Status: Creating\n", "Status: Creating\n", "Status: Creating\n", "Finished! InService\n" ] } ], "source": [ "def wait_for_endpoint_creation(endpoint_name):\n", " sm_client = boto3.client('sagemaker')\n", " \n", " # Check endpoint creation status\n", " resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " status = resp['EndpointStatus']\n", " while status=='Creating':\n", " print(\"Status: \" + status)\n", " time.sleep(60)\n", " resp = sm_client.describe_endpoint(EndpointName=endpoint_name)\n", " status = resp['EndpointStatus']\n", " \n", " print('Finished!', status)\n", "\n", " \n", "# Create endpoint\n", "create_endpoint_response = sm_client.create_endpoint(\n", " EndpointName=endpoint_name,\n", " EndpointConfigName=endpoint_config_name\n", ")\n", "wait_for_endpoint_creation(endpoint_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A2. ベースラインを作成する\n", "ベースラインの計算には20分から25分程度の時間がかかります" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import sagemaker\n", "from sagemaker import model_monitor\n", "from sagemaker.model_monitor.dataset_format import DatasetFormat\n", "\n", "my_default_monitor = model_monitor.DefaultModelMonitor(\n", " role=sagemaker.get_execution_role(),\n", " instance_count=1,\n", " instance_type='ml.t3.large',\n", " volume_size_in_gb=100,\n", " max_runtime_in_seconds=3600,\n", ")\n", "\n", "my_default_monitor.suggest_baseline(\n", " baseline_dataset=f's3://{bucket}/model_monitor/data_quality_baseline_input/',\n", " dataset_format=DatasetFormat.csv(header=True, output_columns_position='START'),\n", " output_s3_uri=f's3://{bucket}/{baseline_prefix}',\n", " wait=True\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A3. モニタリングをスケジュールする" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ベースラインを作成したModelMonitorインスタンスをそのまま使う場合" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "my_default_monitor.create_monitoring_schedule(\n", " monitor_schedule_name=data_quality_monitoring_schedule,\n", " endpoint_input=endpoint_name,\n", " output_s3_uri=f's3://{bucket}/{report_prefix}',\n", " statistics=my_default_monitor.baseline_statistics(),\n", " constraints=my_default_monitor.suggested_constraints(),\n", " schedule_cron_expression=model_monitor.CronExpressionGenerator.hourly(),\n", " enable_cloudwatch_metrics=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "作成されたスケジュールを確認" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'MonitoringScheduleArn': 'arn:aws:sagemaker:ap-northeast-1:370828233696:monitoring-schedule/nyctaxi-xgboost-endpoint-data-quality-schedule',\n", " 'MonitoringScheduleName': 'nyctaxi-xgboost-endpoint-data-quality-schedule',\n", " 'MonitoringScheduleStatus': 'Pending',\n", " 'MonitoringType': 'DataQuality',\n", " 'CreationTime': datetime.datetime(2022, 12, 16, 10, 19, 36, 357000, tzinfo=tzlocal()),\n", " 'LastModifiedTime': datetime.datetime(2022, 12, 16, 10, 19, 36, 447000, tzinfo=tzlocal()),\n", " 'MonitoringScheduleConfig': {'ScheduleConfig': {'ScheduleExpression': 'cron(0 * ? * * *)'},\n", " 'MonitoringJobDefinitionName': 'data-quality-job-definition-2022-12-16-10-19-35-972',\n", " 'MonitoringType': 'DataQuality'},\n", " 'EndpointName': 'nyctaxi-xgboost-endpoint',\n", " 'LastMonitoringExecutionSummary': {'MonitoringScheduleName': 'nyctaxi-xgboost-endpoint-data-quality-schedule',\n", " 'ScheduledTime': datetime.datetime(2022, 12, 16, 9, 0, tzinfo=tzlocal()),\n", " 'CreationTime': datetime.datetime(2022, 12, 16, 9, 5, 34, 301000, tzinfo=tzlocal()),\n", " 'LastModifiedTime': datetime.datetime(2022, 12, 16, 9, 5, 38, 923000, tzinfo=tzlocal()),\n", " 'MonitoringExecutionStatus': 'Failed',\n", " 'EndpointName': 'nyctaxi-xgboost-endpoint',\n", " 'FailureReason': 'Job inputs had no data'},\n", " 'ResponseMetadata': {'RequestId': 'b303d219-2fc3-44cc-aa0d-599f77e3ce0a',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'x-amzn-requestid': 'b303d219-2fc3-44cc-aa0d-599f77e3ce0a',\n", " 'content-type': 'application/x-amz-json-1.1',\n", " 'content-length': '921',\n", " 'date': 'Fri, 16 Dec 2022 10:19:36 GMT'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sm_client.describe_monitoring_schedule(MonitoringScheduleName=data_quality_monitoring_schedule)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ステップAに必要なコードの実行はここまで" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 以下は参考" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 参考|既存の推論エンドポイントにデータキャプチャの設定を行う場合" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# データキャプチャの設定をする既存の推論エンドポイントの名前を設定\n", "endpoint_name = 'nyctaxi-xgboost-endpoint'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.model_monitor import DataCaptureConfig\n", "import sagemaker\n", "\n", "bucket = sagemaker.Session().default_bucket()\n", "data_capture_config = DataCaptureConfig(\n", " enable_capture = True,\n", " sampling_percentage=100,\n", " destination_s3_uri=f's3://{bucket}/model_monitor/endpoint-data-capture',\n", " capture_options=[\"REQUEST\", \"RESPONSE\"],\n", " csv_content_types=[\"text/csv\"],\n", " json_content_types=[\"application/json\"]\n", ")\n", "\n", "predictor = sagemaker.Predictor(endpoint_name=endpoint_name)\n", "predictor.update_data_capture_config(data_capture_config=data_capture_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 参考|過去に作成したベースラインを利用してスケジュールを作成する場合" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [], "source": [ "from sagemaker import model_monitor\n", "\n", "statistics_from_s3 = model_monitor.Statistics.from_s3_uri(f's3://{bucket}/{baseline_prefix}/statistics.json',)\n", "constraints_from_s3 = model_monitor.Constraints.from_s3_uri(f's3://{bucket}/{baseline_prefix}/constraints.json',)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "my_default_monitor = model_monitor.DefaultModelMonitor(\n", " role=sagemaker.get_execution_role(),\n", " instance_count=1,\n", " instance_type='ml.t3.large',\n", " volume_size_in_gb=100,\n", " max_runtime_in_seconds=3600,\n", ")\n", "\n", "my_default_monitor.create_monitoring_schedule(\n", " monitor_schedule_name=data_quality_monitoring_schedule,\n", " endpoint_input=endpoint_name,\n", " output_s3_uri=f's3://{bucket}/{report_prefix}',\n", " statistics=statistics_from_s3,\n", " constraints=constraints_from_s3,\n", " schedule_cron_expression=model_monitor.CronExpressionGenerator.daily(),\n", " enable_cloudwatch_metrics=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 参考|スケジュール用cron式の生成例" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(model_monitor.CronExpressionGenerator.hourly())\n", "print(model_monitor.CronExpressionGenerator.daily())\n", "print(model_monitor.CronExpressionGenerator.daily_every_x_hours(6))\n", "print(model_monitor.CronExpressionGenerator.daily_every_x_hours(6, starting_hour=2))" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-northeast-1:102112518831:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }