{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TensorFlow による学習とモデルホスティング\n", "\n", "TensorFlow を SageMaker 上で学習し、ホスティングするノートブックです。既存のコードに少しの変更を加えるだけで、SageMaker 上で TensorFlow のモデルを学習し、ホスティングすることが可能です。\n", "\n", "[SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) は、SageMakerトレーニングインスタンスへのスクリプトの転送を処理します。トレーニングインスタンスでは、SageMakerのネイティブTensorFlowサポートがトレーニング関連の環境変数を設定し、トレーニングスクリプトを実行します。このチュートリアルでは、SageMaker Python SDKを使用してトレーニングジョブを起動し、トレーニングされたモデルを展開します。\n", "TensorFlow Training についての詳細についてはこちらの[ドキュメント](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#train-a-model-with-tensorflow)にアクセスしてください\n", "\n", "この例では、Pythonスクリプトを使用して、[MNISTデータセット](http://yann.lecun.com/exdb/mnist/)の分類モデルをトレーニングします。さらに、このノートブックは、[SageMaker TensorFlow Serving container](https://github.com/aws/sagemaker-tensorflow-serving-container)でリアルタイム推論を実行する方法を示します。 TensorFlow hosting の詳細なドキュメントについては、[こちら](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html) にアクセスしてください。\n", "\n", "※本ハンズオンでは TensorFlow バージョン 2 以上で動作します。\n", "\n", "---\n", "\n", "## コンテンツ\n", "\n", "1. [環境のセットアップ](#1.環境のセットアップ)\n", "1. [学習データの準備](#2.学習データの準備)\n", "1. [分散学習用のスクリプトを作成する](#3.分散学習用のスクリプトを作成する)\n", "1. [TensorFlow Estimator を利用して学習ジョブを作成する](#4.TensorFlowEstimatorを利用して学習ジョブを作成する)\n", "1. [学習したモデルをエンドポイントにデプロイする](#5.学習したモデルをエンドポイントにデプロイする)\n", "1. [エンドポイントを呼び出し推論を実行する](#6.エンドポイントを呼び出し推論を実行する)\n", "1. [エンドポイントを削除する](#7.エンドポイントを削除する)\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.環境のセットアップ\n", "\n", "まずは環境のセットアップを行いましょう。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os, sagemaker, urllib\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "from sagemaker import get_execution_role\n", "\n", "sagemaker_session = sagemaker.Session()\n", "\n", "role = get_execution_role()\n", "region = sagemaker_session.boto_session.region_name\n", "\n", "print(f'Current SageMaker Python SDK Version = {sagemaker.__version__}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "注）このノートブックでは SageMaker SDK が 2.19.0 以上で動作します。上記の出力結果がそれ以前のバージョンになった際は、下記のセルの#を削除（コメントアウトを解除）して実行、Jupyterカーネルを再起動し、再度上記のセルを実行し、バージョンがアップデートされたことを確認してください。カーネルが再起動されない場合は、SageMaker SDK バージョン更新が反映されません。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !pip install -U --quiet \"sagemaker>=2.19.0\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.学習データの準備\n", "\n", "MNISTデータセットは、パブリックS3バケット ``sagemaker-sample-data-`` の下のプレフィックス ``tensorflow/mnist`` の下にロードされています。このプレフィックスの下には4つの ``.npy`` ファイルがあります：\n", "* ``train_data.npy``\n", "* ``eval_data.npy``\n", "* ``train_labels.npy``\n", "* ``eval_labels.npy``\n", "\n", "学習データが保存されている s3 の URI を変数に格納しておきます。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_data_uri = f's3://sagemaker-sample-data-{region}/tensorflow/mnist/'\n", "print(training_data_uri)\n", "!aws s3 ls {training_data_uri}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3.分散学習用のスクリプトを作成する\n", "\n", "このチュートリアルのトレーニングスクリプトは、TensorFlowの公式の[CNN MNISTの例](https://www.tensorflow.org/tutorials/images/cnn?hl=ja) をベースに作成されました。 SageMaker から渡された `` model_dir`` パラメーターを処理するように変更しています。これは、分散学習時のデータ共有、チェックポイント、モデルの永続保存などに使用できるS3パスです。また、トレーニング関連の変数を扱うために、引数をパースする関数も追加しました。\n", "\n", "トレーニングジョブの最後に、トレーニング済みモデルを環境変数 ``SM_MODEL_DIR`` に保存されているパスにエクスポートするステップを追加しました。このパスは常に ``/opt/ml/model`` をポイントします。 SageMaker は、トレーニングの終了時にこのフォルダー内のすべてのモデル成果物をS3にアップロードするため、これは重要です。\n", "\n", "スクリプト全体は次のとおりです。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!pygmentize 'mnist.py'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\\# 4.TensorFlowEstimatorを利用して学習ジョブを作成する\n", "\n", "`sagemaker.tensorflow.TensorFlow`　estimator は、スクリプトモード対応の TensorFlow コンテナの指定、学習・推論スクリプトの S3 へのアップロード、および SageMaker トレーニングジョブの作成を行います。ここでいくつかの重要なパラメーターを呼び出しましょう。\n", "\n", "* `py_version`は` 'py3'`に設定されています。レガシーモードは Python 2 のみをサポートしているため、この学習スクリプトはスクリプトモードを使用していることを示しています。Python2は間もなく廃止されますが、 `py_version` を設定することでPython 2でスクリプトモードを使用できます。`'py2'`と` script_mode`を `True`にします。\n", "\n", "* `distributions` は、分散トレーニング設定を構成するために使用されます。インスタンスのクラスターまたは複数の GPU をまたいで分散学習を行う場合にのみ必要です。ここでは、分散トレーニングスキーマとしてパラメーターサーバーを使用しています。 SageMaker トレーニングジョブは同種のクラスターで実行されます。 SageMaker セットアップでパラメーターサーバーのパフォーマンスを向上させるために、クラスター内のすべてのインスタンスでパラメーターサーバーを実行するため、起動するパラメーターサーバーの数を指定する必要はありません。スクリプトモードは、[Horovod](https://github.com/horovod/horovod) による分散トレーニングもサポートしています。 `distributions` の設定方法に関する詳細なドキュメントは[こちら](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training) をご参照ください。\n", "\n", "* 実際にモデル開発をする際はコード(ここでは `mnist.py` )にバグが混入していないか確認しながら実行することになりますが、トレーニングインスタンスを利用すると、インスタンスの起動に時間がかかるため、学習開始コマンドを打ち込んでから 10 分後に気づいてやり直し、となってしまうことがあります。そのオーバヘッドを防止するために、ローカルモードでの学習が Sagemaker ではサポートされています。``instance_type=local``を指定するだけで、ノートブックインスタンスで学習（＝インスタンスの立ち上げ時間なしで）を試すことができます。よくやるやり方としてはコードの確認用途のため、 epoch の数やデータを減らして動くかどうかの確認を行うことが多いです。\n", "\n", "また、Spot Instanceを用いて実行する場合は、下記のコードを `Estimator` の `train_instance_type` の次の行に追加しましょう。\n", "\n", "```python\n", " max_run = 5000, # 学習は最大で5000秒までにする設定\n", " use_spot_instances = 'True',\n", " max_wait = 7200 # 学習完了を待つ最大時間\n", "```\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tensorflow import TensorFlow\n", "\n", "\n", "mnist_estimator = TensorFlow(entry_point='mnist.py',\n", " role=role,\n", " instance_count=2,\n", " # instance_type='local',\n", " instance_type='ml.p3.2xlarge',\n", " framework_version='2.1.0',\n", " py_version='py3',\n", " distribution={'parameter_server': {'enabled': True}},\n", " hyperparameters={\n", " \"epochs\": 4,\n", " 'batch-size':16\n", " }\n", "# max_run = 5000, # 学習は最大で5000秒までにする設定\n", "# use_spot_instances = 'True',\n", "# max_wait = 7200 # 学習完了を待つ最大時間\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ``fit`` による学習ジョブの実行\n", "\n", "学習ジョブを開始するには、`estimator.fit（training_data_uri）` を呼び出します。\n", "\n", "ここでは、S3 ロケーションが入力として使用されます。 `fit` は、`training` という名前のデフォルトチャネルを作成します。これは、このS3ロケーションを指します。トレーニングスクリプトでは、 `SM_CHANNEL_TRAINING` に保存されている場所からトレーニングデータにアクセスできます。 `fit`は、他のいくつかのタイプの入力も受け入れます。詳細については、APIドキュメント[こちら](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) を参照してください。\n", "\n", "トレーニングが開始されると、TensorFlow コンテナは mnist.py を実行し、スクリプトの引数として　estimator から`hyperparameters` と `model_dir` を渡します。この例では、estimator 内で定義していないハイパーパラメーターは渡されず、 `model_dir` のデフォルトは `s3:///` であるため、スクリプトの実行は次のようになります。\n", "```bash\n", "python mnist.py --model_dir s3:///\n", "```\n", "トレーニングが完了すると、トレーニングジョブは保存されたモデルを TensorFlow serving にアップロードします。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "mnist_estimator.fit(training_data_uri)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.学習したモデルをエンドポイントにデプロイする\n", "\n", "`deploy（）`メソッドは SageMaker モデルを作成します。このモデルはエンドポイントにデプロイされ、リアルタイムで予測リクエストを処理します。スクリプトモードでトレーニングしたため、エンドポイントには TensorFlow Serving コンテナを使用します。このサービングコンテナは、SageMaker ホスティングプロトコルと互換性のあるWebサーバーの実装を実行します。 [独自の推論コードの使用](https://docs.aws.amazon.com/ja_jp/sagemaker/latest/dg/your-algorithms-inference-main.html) ドキュメントでは、SageMaker が推論コンテナを実行する方法について説明しています。\n", "\n", "traing と同様、 `instance_type` を `local` に指定するとノートブックインスタンスにデプロイすることが可能です。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='local')\n", "predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6.エンドポイントを呼び出し推論を実行する\n", "\n", "トレーニングデータをダウンロードして、推論の入力として使用してみましょう。\n", "\n", "入力データと出力データの形式は、[TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest) の `Predict`メソッドのリクエストとレスポンスの形式に直接対応しています。 SageMaker の TensforFlow Serving エンドポイントは、単純化されたJSON形式、行区切りのJSONオブジェクト (\"jsons\" または \"jsonlines\")、CSV データなど、TensorFlow REST API の一部ではない追加の入力形式も受け入れることができます。\n", "\n", "この例では、入力として `numpy` 配列を使用しています。これは、簡略化されたJSON形式にシリアル化されます。さらに、TensorFlow serving は、次のコードに示すように、複数のアイテムを一度に処理することもできます。 TensorFlow serving を用いた SageMaker エンドポイントに対して予測を行う方法に関する詳細なドキュメントは[こちら](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint)をご参照ください。\n", "\n", "まず、評価用のデータセットを取得します。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/eval_data.npy eval_data.npy\n", "!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/eval_labels.npy eval_labels.npy\n", "\n", "eval_data = np.load('eval_data.npy').reshape(-1,28,28,1)\n", "eval_labels = np.load('eval_labels.npy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下記の ``k　= `` に、9950 までの好きな数字を入れて、評価する手書き文字のセットを選択しましょう。\n", "`predictor.predict(test_data)` で推論を行い、選んだ手書き文字認識に対する `prediction is`: 推論結果と、`label is`: 実際のラベルの値が出力され、その二つが一致していれば最後に `matched: True` と表示されます。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k = 1000 # choose your favorite number from 0 to 9950\n", "test_data = eval_data[k:k+50]\n", "test_data\n", "\n", "for i in range(5):\n", " for j in range(10):\n", " plt.subplot(5, 10, 10* i + j+1)\n", " plt.imshow(test_data[10 * i + j, :].reshape(28, 28), cmap='gray')\n", " plt.title(10* i + j+1)\n", " plt.tick_params(labelbottom=False, labelleft = False)\n", " plt.subplots_adjust(wspace=0.2, hspace=1)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "predictions = predictor.predict(test_data.reshape(-1,28,28,1))\n", "for i in range(0, 50):\n", " prediction = np.argmax(predictions['predictions'][i])\n", " label = eval_labels[i+k]\n", " print(' [{}]: prediction is {}, label is {}, matched: {}'.format(i+1, prediction, label, prediction == label))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7.エンドポイントを削除する\n", "\n", "余分なコストが発生しないように、検証が終わったら上記で作成したエンドポイントを削除しましょう。" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 4 }