{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# B2. モニタリング結果の分析" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "このノートブックを実行する時のヒント: \n", "- このノートブックは大容量のRawデータを読み込むため、メモリー8GB以上のインスタンスで実行してください\n", "- KernelはPython3(Data Science)で動作確認をしています。\n", "- ステップBでは、推論リクエストを実行して、モニタリングのレポートが出力されるのを待つ必要があります。モニタリングをhourlyでスケジュールした場合は、推論を実行した時間の翌時間(16時台に推論を実行した場合は17時)の0分から20分の間にモニタリングジョブが実行されるので、Processing Jobの実行状況を確認して、完了を待ってから続行してください。\n", "- デフォルトではSageMakerのデフォルトBucketを利用します。必要に応じて変更することも可能です。\n", "- 実際に動かさなくても出力を確認できるように一部のセルのアウトプットを残しています。きれいな状態から実行したい場合は、右クリックメニューから \"Clear All Outputs\"を選択して出力をクリアしてから始めてください。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "複数のノートブックで共通で使用する変数" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# エンドポイント名を指定する\n", "endpoint_name = 'nyctaxi-xgboost-endpoint'\n", "\n", "# エンドポイントConfigの名前を指定する\n", "endpoint_config_name = '{}-config'.format(endpoint_name)\n", "\n", "# データ品質のモニタリングスケジュールの名前を指定する\n", "data_quality_monitoring_schedule = f'{endpoint_name}-data-quality-schedule'\n", "\n", "# SageMaker default bucketをModel Monitorのバケットとして使用\n", "# それ以外のバケットを使用している場合はここで指定する\n", "import sagemaker\n", "bucket = sagemaker.Session().default_bucket()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## オプションA この環境で推論リクエストを実行する\n", "推論リクエストの実行後に、次の周期のモニタリングジョブの稼働を待つ必要があります" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# ベースラインの出力先Prefixを設定する\n", "baseline_prefix = 'model_monitor/data_quality_baseline'\n", "\n", "# 時系列での可視化のために、複数のレポートに共通するPrefixを設定する\n", "report_prefix = 'model_monitor/data_quality_monitoring_report'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "推論を実行して次のモニタリング周期まで待つ" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# 推論を実行する日付を指定する\n", "prediction_target_date = '2021-08-15'\n", "\n", "# データのサンプリングレートを指定する(モデル作成時の設定に合わせる)\n", "sampling_rate = 20" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "import os\n", "import boto3\n", "import pandas as pd\n", "import time\n", "from datetime import datetime\n", "import model_utils" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def get_data_for_pred(target, sampling_rate):\n", " previous_year, previous_month = model_utils.get_previous_year_month(target.year, target.month)\n", " df_previous_month = model_utils.get_raw_data(previous_year, previous_month, sampling_rate)\n", " df_current_month = model_utils.get_raw_data(target.year, target.month, sampling_rate)\n", " df_data = pd.concat([df_previous_month, df_current_month])\n", " del df_previous_month\n", " del df_current_month\n", "\n", " # Extract features\n", " df_features = model_utils.extract_features(df_data)\n", " df_features = model_utils.filter_current_month(df_features, target.year, target.month)\n", " \n", " return df_features" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading data for 2021-08\n", "Predicting 2021-08-15 00:00:00 nyctaxi-xgboost-endpoint\n", "\n", "2021-08-15 データの推論リクエストを実行しました\n", "次の時刻の0分から20分の間にモニタリングジョブが実行されます\n" ] } ], "source": [ "target_date = pd.to_datetime(prediction_target_date)\n", "print('Loading data for', target_date.strftime('%Y-%m'))\n", "df_features = get_data_for_pred(target_date, sampling_rate)\n", " \n", "# Exec prediction for the target date\n", "print('Predicting', target_date, endpoint_name)\n", "df_pred = df_features[df_features.index == target_date].copy()\n", "df_pred[['pred', 'inference_id']] = model_utils.exec_prediction(endpoint_name, df_pred)\n", "\n", "print('')\n", "print(f'{prediction_target_date} データの推論リクエストを実行しました')\n", "print('次の時刻の0分から20分の間にモニタリングジョブが実行されます')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### モニタリングジョブが出力したレポートを確認する" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2022-12-15 12:24:21 24055 model_monitor/data_quality_monitoring_report/nyctaxi-xgboost-endpoint/nyctaxi-xgboost-endpoint-data-quality-schedule/2022/12/15/12/constraint_violations.json\n", "2022-12-15 12:21:07 24353 model_monitor/data_quality_monitoring_report/nyctaxi-xgboost-endpoint/nyctaxi-xgboost-endpoint-data-quality-schedule/2022/12/15/12/constraints.json\n", "2022-12-15 12:21:07 408519 model_monitor/data_quality_monitoring_report/nyctaxi-xgboost-endpoint/nyctaxi-xgboost-endpoint-data-quality-schedule/2022/12/15/12/statistics.json\n", "2022-12-16 11:21:47 25817 model_monitor/data_quality_monitoring_report/nyctaxi-xgboost-endpoint/nyctaxi-xgboost-endpoint-data-quality-schedule/2022/12/16/11/constraint_violations.json\n", "2022-12-16 11:18:43 24353 model_monitor/data_quality_monitoring_report/nyctaxi-xgboost-endpoint/nyctaxi-xgboost-endpoint-data-quality-schedule/2022/12/16/11/constraints.json\n", "2022-12-16 11:18:43 568365 model_monitor/data_quality_monitoring_report/nyctaxi-xgboost-endpoint/nyctaxi-xgboost-endpoint-data-quality-schedule/2022/12/16/11/statistics.json\n" ] } ], "source": [ "!aws s3 ls s3://$bucket/$report_prefix/ --recursive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "正常にレポートが出力されていれば、上のセルにS3上のレポートファイルが出力されるので、model_monitor/xxxx/YYYY/MM/DD/HHまでをspecific_report_prefixにセットしてください" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "# モニタリングジョブで出力された特定のレポートのPrefixを設定する\n", "specific_report_prefix = 'model_monitor/data_quality_monitoring_report/nyctaxi-xgboost-endpoint/nyctaxi-xgboost-endpoint-data-quality-schedule/2022/12/16/11'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## オプションB モニタリングの実行をまたずにサンプルで分析してみる\n", "推論の実行後にモニタリングジョブを待つのは時間がかかるため、サンプルコードに含まれるレポートで分析や可視化を試したい場合は、以下のセルを実行してS3バケットにサンプルのレポートをアップロードしてください \n", "ご自身のレポートで可視化を行う場合は、このセルはスキップしてください" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sagemaker.s3.S3Uploader.upload('data_quality_samples', f's3://{bucket}/model_monitor/data_quality_samples')\n", "\n", "baseline_prefix = 'model_monitor/data_quality_samples/baseline'\n", "report_prefix = 'model_monitor/data_quality_samples/reports'\n", "specific_report_prefix = 'model_monitor/data_quality_samples/reports/2020/03/16/01'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## B2-2. Model Monitorインスタンスから直前に実行したモニタリングのレポートを取得する" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "import random\n", "import boto3\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import sagemaker\n", "from sagemaker import model_monitor\n", "from sagemaker import get_execution_role\n", "import monitor_utils as mu\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [], "source": [ "default_monitor = model_monitor.DefaultModelMonitor(\n", " role=sagemaker.get_execution_role(),\n", " instance_count=1,\n", " instance_type='ml.t3.medium',\n", " volume_size_in_gb=100,\n", " max_runtime_in_seconds=3600,\n", ")\n", "existing_model_monitor = default_monitor.attach(monitor_schedule_name=data_quality_monitoring_schedule)\n", "statistics = existing_model_monitor.latest_monitoring_statistics()\n", "if statistics:\n", " statistics.body_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## B2-3. S3バケットからベースラインとレポートを取得する" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Get baselines\n", "baseline_statistics = model_monitor.Statistics.from_s3_uri(f's3://{bucket}/{baseline_prefix}/statistics.json').body_dict\n", "baseline_constraints = model_monitor.Constraints.from_s3_uri(f's3://{bucket}/{baseline_prefix}/constraints.json').body_dict\n", "\n", "# Get report stats and constraint\n", "statistics = model_monitor.Statistics.from_s3_uri(f's3://{bucket}/{specific_report_prefix}/statistics.json').body_dict\n", "constraints = model_monitor.Constraints.from_s3_uri(f's3://{bucket}/{specific_report_prefix}/constraints.json').body_dict\n", "constraint_violations = model_monitor.ConstraintViolations.from_s3_uri(f's3://{bucket}/{specific_report_prefix}/constraint_violations.json').body_dict" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# ベースライン統計の表示\n", "baseline_statistics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# レポートの統計の表示\n", "statistics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# レポートの制約の表示\n", "constraints" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# レポートの制約違反の表示\n", "constraint_violations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## B2-4. 単一のレポートを可視化する" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 制約違反を可視化する" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "mu.show_violation_df(\n", " baseline_statistics=baseline_statistics,\n", " latest_statistics=statistics,\n", " violations=constraint_violations['violations'],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 分布を可視化する" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ベースラインとレポートの比較" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | num_present | \n", "num_missing | \n", "mean | \n", "sum | \n", "std_dev | \n", "min | \n", "max | \n", "
---|---|---|---|---|---|---|---|
extra_mean_16slot | \n", "192 | \n", "0 | \n", "0.956503 | \n", "183.648663 | \n", "0.316831 | \n", "0.404255 | \n", "1.873249 | \n", "
fare_amount_mean_100slot | \n", "192 | \n", "0 | \n", "14.976413 | \n", "2875.471358 | \n", "3.722576 | \n", "10.900000 | \n", "35.083333 | \n", "
fare_amount_mean_104slot | \n", "192 | \n", "0 | \n", "14.983530 | \n", "2876.837749 | \n", "3.725141 | \n", "10.900000 | \n", "35.083333 | \n", "
fare_amount_mean_16slot | \n", "192 | \n", "0 | \n", "14.443428 | \n", "2773.138159 | \n", "2.507316 | \n", "11.177522 | \n", "26.997273 | \n", "
history_100slots | \n", "192 | \n", "0 | \n", "200.354167 | \n", "38468.000000 | \n", "111.425609 | \n", "6.000000 | \n", "405.000000 | \n", "
history_1338slots | \n", "192 | \n", "0 | \n", "162.083333 | \n", "31120.000000 | \n", "99.804775 | \n", "6.000000 | \n", "437.000000 | \n", "
history_1346slots | \n", "192 | \n", "0 | \n", "168.510417 | \n", "32354.000000 | \n", "96.432402 | \n", "6.000000 | \n", "437.000000 | \n", "
history_144slots | \n", "192 | \n", "0 | \n", "200.614583 | \n", "38518.000000 | \n", "110.895256 | \n", "6.000000 | \n", "378.000000 | \n", "
history_156slots | \n", "192 | \n", "0 | \n", "201.479167 | \n", "38684.000000 | \n", "109.814235 | \n", "6.000000 | \n", "378.000000 | \n", "
history_164slots | \n", "192 | \n", "0 | \n", "203.760417 | \n", "39122.000000 | \n", "106.980016 | \n", "6.000000 | \n", "378.000000 | \n", "
history_192slots | \n", "192 | \n", "0 | \n", "200.755208 | \n", "38545.000000 | \n", "108.413644 | \n", "10.000000 | \n", "378.000000 | \n", "
history_2008slots | \n", "192 | \n", "0 | \n", "175.572917 | \n", "33710.000000 | \n", "106.679841 | \n", "8.000000 | \n", "373.000000 | \n", "
history_2012slots | \n", "192 | \n", "0 | \n", "178.119792 | \n", "34199.000000 | \n", "104.678980 | \n", "8.000000 | \n", "373.000000 | \n", "
history_236slots | \n", "192 | \n", "0 | \n", "191.473958 | \n", "36763.000000 | \n", "102.839019 | \n", "10.000000 | \n", "362.000000 | \n", "
history_268slots | \n", "192 | \n", "0 | \n", "184.979167 | \n", "35516.000000 | \n", "107.271655 | \n", "10.000000 | \n", "362.000000 | \n", "
history_32slots | \n", "192 | \n", "0 | \n", "200.526042 | \n", "38501.000000 | \n", "111.368941 | \n", "11.000000 | \n", "405.000000 | \n", "
history_668slots | \n", "192 | \n", "0 | \n", "181.057292 | \n", "34763.000000 | \n", "110.352758 | \n", "5.000000 | \n", "411.000000 | \n", "
tolls_amount_mean_100slot | \n", "192 | \n", "0 | \n", "0.463363 | \n", "88.965641 | \n", "0.375753 | \n", "0.000000 | \n", "2.848000 | \n", "
tolls_amount_mean_12slot | \n", "192 | \n", "0 | \n", "0.471510 | \n", "90.529998 | \n", "0.326537 | \n", "0.000000 | \n", "1.943939 | \n", "
trip_distance_mean_192slot | \n", "192 | \n", "0 | \n", "32.719694 | \n", "6282.181235 | \n", "254.713765 | \n", "2.406549 | \n", "2899.881852 | \n", "