{ "cells": [ { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "# Amazon SageMaker での AutoGluon-Tabular の活用\n", "\n", "[AutoGluon](https://github.com/awslabs/autogluon) は高精度な機械学習モデル構築を自動化します。数行のコードによって、テーブル、画像、テキストなどのデータに対して、精度の良い深層学習モデルを学習し、デプロイすることができます。\n", "\n", "このノートブックでは、Amazon SageMaker で、独自コンテナを用いてAutoGluon-Tabular を使用する方法をお伝えします。" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "## 準備\n", "\n", "このノートブックでは、`conda_mxnet_p36` カーネルを使用します。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "# Make sure docker compose is set up properly for local mode\n", "!./setup.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "import os\n", "import boto3\n", "import sagemaker\n", "from time import sleep\n", "from collections import Counter\n", "import numpy as np\n", "import pandas as pd\n", "from sagemaker import get_execution_role, local, Model, utils, s3\n", "from sagemaker.estimator import Estimator\n", "from sagemaker.predictor import Predictor\n", "from sagemaker.serializers import CSVSerializer\n", "from sagemaker.deserializers import StringDeserializer\n", "from sklearn.metrics import accuracy_score, classification_report\n", "from IPython.core.display import display, HTML\n", "from IPython.core.interactiveshell import InteractiveShell\n", "\n", "# Print settings\n", "InteractiveShell.ast_node_interactivity = \"all\"\n", "pd.set_option('display.max_columns', 500)\n", "pd.set_option('display.max_rows', 10)\n", "\n", "# Account/s3 setup\n", "session = sagemaker.Session()\n", "local_session = local.LocalSession()\n", "bucket = session.default_bucket()\n", "prefix = 'sagemaker/autogluon-tabular'\n", "region = session.boto_region_name\n", "role = get_execution_role()\n", "client = session.boto_session.client(\n", " \"sts\", region_name=region, endpoint_url=utils.sts_regional_endpoint(region)\n", " )\n", "account = client.get_caller_identity()['Account']\n", "\n", "registry_uri_training = sagemaker.image_uris.retrieve('mxnet', region, version= '1.6.0', py_version='py3', instance_type='ml.m5.2xlarge', image_scope='training')\n", "registry_uri_inference = sagemaker.image_uris.retrieve('mxnet', region, version= '1.6.0', py_version='py3', instance_type='ml.m5.2xlarge', image_scope='inference')\n", "ecr_uri_prefix = account +'.'+'.'.join(registry_uri_training.split('/')[0].split('.')[1:])" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "### Docker イメージのビルド" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "学習用と推論用のそれぞれのコンテナイメージをビルドし、Amazon Elastic Container Repository (ECR) へアップロードします。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "training_algorithm_name = 'autogluon-sagemaker-training'\n", "inference_algorithm_name = 'autogluon-sagemaker-inference'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "!/bin/bash ./container-training/build_push_training.sh {account} {region} {training_algorithm_name} {ecr_uri_prefix} {registry_uri_training.split('/')[0].split('.')[0]} {registry_uri_training}\n", "!/bin/bash ./container-inference/build_push_inference.sh {account} {region} {inference_algorithm_name} {ecr_uri_prefix} {registry_uri_training.split('/')[0].split('.')[0]} {registry_uri_inference}" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "### データの取得" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "このサンプルでは、ダイレクトマーケティングの提案を受け入れるかどうかを二値分類で予測するモデルを開発します。そのためのデータをダウンロードし、学習用データとテスト用データへ分割します。AutoGluon では K -交差検証を自動で行うため、事前に検証データを分割する必要はありません。データをダウンロードし、学習用とテスト用へデータを分割します。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "# Download and unzip the data\n", "!aws s3 cp --region {region} s3://sagemaker-sample-data-{region}/autopilot/direct_marketing/bank-additional.zip .\n", "!unzip -qq -o bank-additional.zip\n", "!rm bank-additional.zip\n", "\n", "local_data_path = './bank-additional/bank-additional-full.csv'\n", "data = pd.read_csv(local_data_path)\n", "\n", "# Split train/test data\n", "train = data.sample(frac=0.7, random_state=42)\n", "test = data.drop(train.index)\n", "\n", "# Split test X/y\n", "label = 'y'\n", "y_test = test[label]\n", "X_test = test.drop(columns=[label])" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "##### データの確認" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "train.head(3)\n", "train.shape\n", "\n", "test.head(3)\n", "test.shape\n", "\n", "X_test.head(3)\n", "X_test.shape" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "\n", "Amazon S3 へデータをアップロードします。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "train_file = 'train.csv'\n", "train.to_csv(train_file,index=False)\n", "train_s3_path = session.upload_data(train_file, key_prefix='{}/data'.format(prefix))\n", "\n", "test_file = 'test.csv'\n", "test.to_csv(test_file,index=False)\n", "test_s3_path = session.upload_data(test_file, key_prefix='{}/data'.format(prefix))\n", "\n", "X_test_file = 'X_test.csv'\n", "X_test.to_csv(X_test_file,index=False)\n", "X_test_s3_path = session.upload_data(X_test_file, key_prefix='{}/data'.format(prefix))" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "## ハイパーパラメータの設定\n", "\n", "最小の設定は,`fit_args['label']` を選ぶことです。\n", "\n", "`fit_args` を使って `autogluon.task.TabularPrediction.fit`に追加設定を渡すことができます。\n", "\n", "以下は [Predicting Columns in a Table - In Depth](https://autogluon.mxnet.io/tutorials/tabular_prediction/tabular-indepth.html#model-ensembling-with-stacking-bagging). にある通り、 AutoGluon-Tabular のハイパーパラメーターをより詳細に設定した例です。詳細は [fit parameters](https://autogluon.mxnet.io/api/autogluon.task.html?highlight=eval_metric#autogluon.task.TabularPrediction.fit) をご参照ください。\n", "\n", "SageMaker で設定を行う際には、`fit_args['hyperparameters']` のそれぞれの値は string 型で渡す必要があります。\n", "\n", "\n", "```python\n", "nn_options = {\n", " 'num_epochs': \"10\",\n", " 'learning_rate': \"ag.space.Real(1e-4, 1e-2, default=5e-4, log=True)\",\n", " 'activation': \"ag.space.Categorical('relu', 'softrelu', 'tanh')\",\n", " 'layers': \"ag.space.Categorical([100],[1000],[200,100],[300,200,100])\",\n", " 'dropout_prob': \"ag.space.Real(0.0, 0.5, default=0.1)\"\n", "}\n", "\n", "gbm_options = {\n", " 'num_boost_round': \"100\",\n", " 'num_leaves': \"ag.space.Int(lower=26, upper=66, default=36)\"\n", "}\n", "\n", "model_hps = {'NN': nn_options, 'GBM': gbm_options} \n", "\n", "fit_args = {\n", " 'label': 'y',\n", " 'presets': ['best_quality', 'optimize_for_deployment'],\n", " 'time_limits': 60*10,\n", " 'hyperparameters': model_hps,\n", " 'hyperparameter_tune': True,\n", " 'search_strategy': 'skopt'\n", "}\n", "\n", "hyperparameters = {\n", " 'fit_args': fit_args,\n", " 'feature_importance': True\n", "}\n", "```\n", "**Note:** ハイパーパラメータの選択によって、モデルパッケージの大きさに影響を及ぼすかも知れません。その場合、モデルのアップロードや学習時間により多くの時間がかかる可能性があります。`fit_args['presets']` の中にある、`'optimize_for_deployment'` を設定することでアップロード時間を短縮することができます。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "# Define required label and optional additional parameters\n", "fit_args = {\n", " 'label': 'y',\n", " # Adding 'best_quality' to presets list will result in better performance (but longer runtime)\n", " 'presets': ['optimize_for_deployment'],\n", "}\n", "\n", "# Pass fit_args to SageMaker estimator hyperparameters\n", "hyperparameters = {\n", " 'fit_args': fit_args,\n", " 'feature_importance': True\n", "}\n", "\n", "tags = [{\n", " 'Key' : 'AlgorithmName',\n", " 'Value' : 'AutoGluon-Tabular'\n", "}]" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "## 学習\n", "\n", "ノートブックインスタンス上での学習には、`train_instance_type` を `local` に、学習用インスタンスをお使いになる場合には `ml.m5.2xlarge` が推奨です。\n", "\n", "**Note:** 学習させるモデルの種類の数によっては、`train_volume_size` を増やす必要があるかもしれません。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "%%time\n", "\n", "instance_type = 'ml.m5.2xlarge'\n", "#instance_type = 'local'\n", "\n", "ecr_image = f'{ecr_uri_prefix}/{training_algorithm_name}:latest'\n", "\n", "estimator = Estimator(image_uri=ecr_image,\n", " role=role,\n", " instance_count=1,\n", " instance_type=instance_type,\n", " hyperparameters=hyperparameters,\n", " volume_size=100,\n", " tags=tags)\n", "\n", "# Set inputs. Test data is optional, but requires a label column.\n", "inputs = {'training': train_s3_path, 'testing': test_s3_path}\n", "\n", "estimator.fit(inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 学習されたモデルの性能を確認" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from utils.ag_utils import launch_viewer\n", "\n", "launch_viewer(is_debug=False)" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "### モデルの作成" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "# Create predictor object\n", "class AutoGluonTabularPredictor(Predictor):\n", " def __init__(self, *args, **kwargs):\n", " super().__init__(*args, \n", " serializer=CSVSerializer(), \n", " deserializer=StringDeserializer(), **kwargs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "ecr_image = f'{ecr_uri_prefix}/{inference_algorithm_name}:latest'\n", "\n", "if instance_type == 'local':\n", " model = estimator.create_model(image_uri=ecr_image, role=role)\n", "else:\n", " model_uri = os.path.join(estimator.output_path, estimator._current_job_name, \"output\", \"model.tar.gz\")\n", " model = Model(ecr_image, model_data=model_uri, role=role, sagemaker_session=session, predictor_cls=AutoGluonTabularPredictor)" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "### バッチ変換" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "\n", "ローカルモードでは `s3:////output/` か `file:///` を出力用に使うことができます。\n", "\n", "教師データのラベルをテストデータに含むことで、予測精度を評価することもできます。 (今回の例では, `test_s3_path` を `X_test_s3_path` の代わりに渡しています)。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "output_path = f's3://{bucket}/{prefix}/output/'\n", "# output_path = f'file://{os.getcwd()}'\n", "\n", "transformer = model.transformer(instance_count=1, \n", " instance_type=instance_type,\n", " strategy='MultiRecord',\n", " max_payload=6,\n", " max_concurrent_transforms=1, \n", " output_path=output_path)\n", "\n", "transformer.transform(test_s3_path, content_type='text/csv', split_type='Line')\n", "transformer.wait()" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "### 推論用エンドポイント" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "##### ローカモードでのデプロイ" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "instance_type = 'ml.m5.2xlarge'\n", "#instance_type = 'local'\n", "\n", "predictor = model.deploy(initial_instance_count=1, \n", " instance_type=instance_type)" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "##### エンドポイントへのアタッチ（カーネルがリスタートした場合は再アタッチ）" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "# Select standard or local session based on instance_type\n", "if instance_type == 'local': \n", " sess = local_session\n", "else: \n", " sess = session\n", "\n", "# Attach to endpoint\n", "predictor = AutoGluonTabularPredictor(predictor.endpoint, sagemaker_session=sess)" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "##### ラベルづけされていないデータの推論" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "results = predictor.predict(X_test.to_csv(index=False)).splitlines()\n", "\n", "# Check output\n", "print(Counter(results))" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "##### ラベルありデータの推論\n", "予測精度の指標がエンドポイントのログとして表示されます。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "results = predictor.predict(test.to_csv(index=False)).splitlines()\n", "\n", "# Check output\n", "print(Counter(results))" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "##### 分類精度の指標の確認" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "y_results = np.array(results)\n", "\n", "print(\"accuracy: {}\".format(accuracy_score(y_true=y_test, y_pred=y_results)))\n", "print(classification_report(y_true=y_test, y_pred=y_results, digits=6))" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "##### 推論用エンドポイントの削除" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "predictor.delete_endpoint()" ] } ], "metadata": { "kernelspec": { "display_name": "conda_mxnet_p36", "language": "python", "name": "conda_mxnet_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }