{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SageMaker Tensorflow를 이용한 MNIST 학습\n",
"\n",
"MNIST는 필기 숫자 분류하는 문제로 이미지 처리의 테스트용으로 널리 사용되는 데이터 세트입니다. 28x28 픽셀 그레이스케일로 70,000개의 손으로 쓴 숫자 이미지가 레이블과 함께 구성됩니다. 데이터 세트는 60,000개의 훈련 이미지와 10,000개의 테스트 이미지로 분할됩니다. 0~9까지 10개의 클래스가 있습니다. 이 튜토리얼은 SageMaker에서 Tensorflow V2를 이용하여 MNIST 분류 모델을 훈련하는 방법을 보여줍니다.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'2.45.0'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sagemaker \n",
"sagemaker.__version__"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"\n",
"import sagemaker\n",
"from sagemaker.tensorflow import TensorFlow\n",
"from sagemaker import get_execution_role\n",
"\n",
"sess = sagemaker.Session()\n",
"\n",
"role = get_execution_role()\n",
"\n",
"output_path='s3://' + sess.default_bucket() + '/tensorflow/mnist'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TensorFlow Estimator\n",
"\n",
"Tensorflow 클래스를 사용하면 SageMaker의 컨테이너 환경에서 학습 스크립트를 실행할 수 있습니다. \n",
"다음 파라미터 설정을 통해 환경을 셋업합니다.\n",
"\n",
"\n",
"- entry_point: 트레이닝 컨테이너에서 신경망 학습을 위해 사용하는 사용자 정의 파이썬 파일. 다음 섹션에서 다시 논의됩니다.\n",
"- role: AWS 자원에 접근하기 위한 IAM 역할(role) \n",
"- instance_type: 스크립트를 실행하는 SAGEMAKER 인스턴스 유형. 본 노트북을 실행하기 위해 사용중인 SageMaker 인스턴스에서 훈련 작업을 실행하려면`local`로 설정하십시오.\n",
"- model_dir: 학습중에 체크 포인트 데이터와 모델을 내보내는 S3 Bucket URI. (default : None). 이 매개변수가 스크립트에 전달되는 것을 막으려면 `model_dir`=False 로 설정하 수 있습니다.\n",
"- instance count: 학습작업이 실행될 인스턴스의 갯수. 분산 학습을 위해서는 1 이상의 값이 필요합니다. \n",
"- output_path: 학습의 결과물 (모델 아티팩트와 out 파일)을 내보내는 S3 Bucket URI. \n",
"- framework_version: 사용하는 프레임워크의 버전\n",
"- py_version: 파이썬 버전\n",
"\n",
"보다 자세한 내용은 [the API reference](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase)를 참조합니다.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 학습을 위한 entry point 스트립트 작성\n",
"\n",
"`entrypoint`를 통해 Tensorflow 모델을 학습하기 위한 Python 코드를 Enstimator (Tensroflow 클래스)에 제공합니다. \n",
"\n",
"SageMaker Tensorflow Estimator는 AWS의 관리환경으로 Tensorflow 실행환경이 저장된 도커 이미지를 가져올 것입니다. Estimator 클래스를 초기화할 때 사용한 파라미터 설정에 따라 스크립트를 실행합니다. \n",
"\n",
"실행되는 훈련 스크립트는 Amazon SageMaker 외부에서 실행될 수있는 훈련 스크립트와 매우 유사하지만 교육 이미지에서 제공하는 환경 변수에 액세스 하는 설정 등이 추가될 수 있습니다. 사용가능한 환경변수의 리스트를 확인하려면 다음 리소스 [the short list of environment variables provided by the SageMaker service](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html?highlight=entry%20point)를 참고하십시오. 환경변수의 풀셋은 다음 링크 [the complete list of environment variables](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md)에서 확인할 수 있습니다.\n",
"\n",
"본 예제에서는 `code/train.py` 스크립트를 사용합니다.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36m__future__\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m print_function\n",
"\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36margparse\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mlogging\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mos\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mjson\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mgzip\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mnumpy\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mnp\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtraceback\u001b[39;49;00m\n",
"\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mtf\u001b[39;49;00m\n",
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mkeras\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mlayers\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m Dense, Flatten, Conv2D\n",
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mkeras\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m Model\n",
"\n",
"\n",
"logging.basicConfig(level=logging.DEBUG)\n",
"\n",
"\u001b[37m# Define the model object\u001b[39;49;00m\n",
"\n",
"\u001b[34mclass\u001b[39;49;00m \u001b[04m\u001b[32mSmallConv\u001b[39;49;00m(Model):\n",
" \u001b[34mdef\u001b[39;49;00m \u001b[32m__init__\u001b[39;49;00m(\u001b[36mself\u001b[39;49;00m):\n",
" \u001b[36msuper\u001b[39;49;00m(SmallConv, \u001b[36mself\u001b[39;49;00m).\u001b[32m__init__\u001b[39;49;00m()\n",
" \u001b[36mself\u001b[39;49;00m.conv1 = Conv2D(\u001b[34m32\u001b[39;49;00m, \u001b[34m3\u001b[39;49;00m, activation=\u001b[33m'\u001b[39;49;00m\u001b[33mrelu\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n",
" \u001b[36mself\u001b[39;49;00m.flatten = Flatten()\n",
" \u001b[36mself\u001b[39;49;00m.d1 = Dense(\u001b[34m128\u001b[39;49;00m, activation=\u001b[33m'\u001b[39;49;00m\u001b[33mrelu\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n",
" \u001b[36mself\u001b[39;49;00m.d2 = Dense(\u001b[34m10\u001b[39;49;00m)\n",
" \n",
" \u001b[34mdef\u001b[39;49;00m \u001b[32mcall\u001b[39;49;00m(\u001b[36mself\u001b[39;49;00m, x):\n",
" x = \u001b[36mself\u001b[39;49;00m.conv1(x)\n",
" x = \u001b[36mself\u001b[39;49;00m.flatten(x)\n",
" x = \u001b[36mself\u001b[39;49;00m.d1(x)\n",
" \u001b[34mreturn\u001b[39;49;00m \u001b[36mself\u001b[39;49;00m.d2(x)\n",
"\n",
"\n",
"\u001b[37m# Decode and preprocess data\u001b[39;49;00m\n",
"\u001b[34mdef\u001b[39;49;00m \u001b[32mconvert_to_numpy\u001b[39;49;00m(data_dir, images_file, labels_file):\n",
" \u001b[33m\"\"\"Byte string to numpy arrays\"\"\"\u001b[39;49;00m\n",
" \u001b[34mwith\u001b[39;49;00m gzip.open(os.path.join(data_dir, images_file), \u001b[33m'\u001b[39;49;00m\u001b[33mrb\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m) \u001b[34mas\u001b[39;49;00m f:\n",
" images = np.frombuffer(f.read(), np.uint8, offset=\u001b[34m16\u001b[39;49;00m).reshape(-\u001b[34m1\u001b[39;49;00m, \u001b[34m28\u001b[39;49;00m, \u001b[34m28\u001b[39;49;00m)\n",
" \n",
" \u001b[34mwith\u001b[39;49;00m gzip.open(os.path.join(data_dir, labels_file), \u001b[33m'\u001b[39;49;00m\u001b[33mrb\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m) \u001b[34mas\u001b[39;49;00m f:\n",
" labels = np.frombuffer(f.read(), np.uint8, offset=\u001b[34m8\u001b[39;49;00m)\n",
"\n",
" \u001b[34mreturn\u001b[39;49;00m (images, labels)\n",
"\n",
"\u001b[34mdef\u001b[39;49;00m \u001b[32mmnist_to_numpy\u001b[39;49;00m(data_dir, train):\n",
" \u001b[33m\"\"\"Load raw MNIST data into numpy array\u001b[39;49;00m\n",
"\u001b[33m \u001b[39;49;00m\n",
"\u001b[33m Args:\u001b[39;49;00m\n",
"\u001b[33m data_dir (str): directory of MNIST raw data. \u001b[39;49;00m\n",
"\u001b[33m This argument can be accessed via SM_CHANNEL_TRAINING\u001b[39;49;00m\n",
"\u001b[33m \u001b[39;49;00m\n",
"\u001b[33m train (bool): use training data\u001b[39;49;00m\n",
"\u001b[33m\u001b[39;49;00m\n",
"\u001b[33m Returns:\u001b[39;49;00m\n",
"\u001b[33m tuple of images and labels as numpy array\u001b[39;49;00m\n",
"\u001b[33m \"\"\"\u001b[39;49;00m\n",
"\n",
" \u001b[34mif\u001b[39;49;00m train:\n",
" images_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mtrain-images-idx3-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" labels_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mtrain-labels-idx1-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" \u001b[34melse\u001b[39;49;00m:\n",
" images_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mt10k-images-idx3-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" labels_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mt10k-labels-idx1-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
"\n",
" \u001b[34mreturn\u001b[39;49;00m convert_to_numpy(data_dir, images_file, labels_file)\n",
"\n",
"\n",
"\u001b[34mdef\u001b[39;49;00m \u001b[32mnormalize\u001b[39;49;00m(x, axis):\n",
" eps = np.finfo(\u001b[36mfloat\u001b[39;49;00m).eps\n",
"\n",
" mean = np.mean(x, axis=axis, keepdims=\u001b[34mTrue\u001b[39;49;00m)\n",
" \u001b[37m# avoid division by zero\u001b[39;49;00m\n",
" std = np.std(x, axis=axis, keepdims=\u001b[34mTrue\u001b[39;49;00m) + eps\n",
" \u001b[34mreturn\u001b[39;49;00m (x - mean) / std\n",
"\n",
"\u001b[37m# Training logic\u001b[39;49;00m\n",
"\n",
"\u001b[34mdef\u001b[39;49;00m \u001b[32mtrain\u001b[39;49;00m(args):\n",
" \u001b[37m# create data loader from the train / test channels\u001b[39;49;00m\n",
" x_train, y_train = mnist_to_numpy(data_dir=args.train, train=\u001b[34mTrue\u001b[39;49;00m)\n",
" x_test, y_test = mnist_to_numpy(data_dir=args.test, train=\u001b[34mFalse\u001b[39;49;00m)\n",
"\n",
" x_train, x_test = x_train.astype(np.float32), x_test.astype(np.float32)\n",
"\n",
" \u001b[37m# normalize the inputs to mean 0 and std 1\u001b[39;49;00m\n",
" x_train, x_test = normalize(x_train, (\u001b[34m1\u001b[39;49;00m, \u001b[34m2\u001b[39;49;00m)), normalize(x_test, (\u001b[34m1\u001b[39;49;00m, \u001b[34m2\u001b[39;49;00m))\n",
"\n",
" \u001b[37m# expand channel axis\u001b[39;49;00m\n",
" \u001b[37m# tf uses depth minor convention\u001b[39;49;00m\n",
" x_train, x_test = np.expand_dims(x_train, axis=\u001b[34m3\u001b[39;49;00m), np.expand_dims(x_test, axis=\u001b[34m3\u001b[39;49;00m)\n",
" \n",
" \u001b[37m# normalize the data to mean 0 and std 1\u001b[39;49;00m\n",
" train_loader = tf.data.Dataset.from_tensor_slices(\n",
" (x_train, y_train)).shuffle(\u001b[36mlen\u001b[39;49;00m(x_train)).batch(args.batch_size)\n",
"\n",
" test_loader = tf.data.Dataset.from_tensor_slices(\n",
" (x_test, y_test)).batch(args.batch_size)\n",
"\n",
" model = SmallConv()\n",
" model.compile()\n",
" loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=\u001b[34mTrue\u001b[39;49;00m)\n",
" optimizer = tf.keras.optimizers.Adam(\n",
" learning_rate=args.learning_rate, \n",
" beta_1=args.beta_1,\n",
" beta_2=args.beta_2\n",
" )\n",
"\n",
"\n",
" train_loss = tf.keras.metrics.Mean(name=\u001b[33m'\u001b[39;49;00m\u001b[33mtrain_loss\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n",
" train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name=\u001b[33m'\u001b[39;49;00m\u001b[33mtrain_accuracy\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n",
"\n",
" test_loss = tf.keras.metrics.Mean(name=\u001b[33m'\u001b[39;49;00m\u001b[33mtest_loss\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n",
" test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name=\u001b[33m'\u001b[39;49;00m\u001b[33mtest_accuracy\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n",
"\n",
"\n",
" \u001b[90m@tf\u001b[39;49;00m.function\n",
" \u001b[34mdef\u001b[39;49;00m \u001b[32mtrain_step\u001b[39;49;00m(images, labels):\n",
" \u001b[34mwith\u001b[39;49;00m tf.GradientTape() \u001b[34mas\u001b[39;49;00m tape:\n",
" predictions = model(images, training=\u001b[34mTrue\u001b[39;49;00m)\n",
" loss = loss_fn(labels, predictions)\n",
" grad = tape.gradient(loss, model.trainable_variables)\n",
" optimizer.apply_gradients(\u001b[36mzip\u001b[39;49;00m(grad, model.trainable_variables))\n",
" \n",
" train_loss(loss)\n",
" train_accuracy(labels, predictions)\n",
" \u001b[34mreturn\u001b[39;49;00m \n",
" \n",
" \u001b[90m@tf\u001b[39;49;00m.function\n",
" \u001b[34mdef\u001b[39;49;00m \u001b[32mtest_step\u001b[39;49;00m(images, labels):\n",
" predictions = model(images, training=\u001b[34mFalse\u001b[39;49;00m)\n",
" t_loss = loss_fn(labels, predictions)\n",
" test_loss(t_loss)\n",
" test_accuracy(labels, predictions)\n",
" \u001b[34mreturn\u001b[39;49;00m\n",
" \n",
" \u001b[36mprint\u001b[39;49;00m(\u001b[33m\"\u001b[39;49;00m\u001b[33mTraining starts ...\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n",
" \u001b[34mfor\u001b[39;49;00m epoch \u001b[35min\u001b[39;49;00m \u001b[36mrange\u001b[39;49;00m(args.epochs):\n",
" train_loss.reset_states()\n",
" train_accuracy.reset_states()\n",
" test_loss.reset_states()\n",
" test_accuracy.reset_states()\n",
" \n",
" \u001b[34mfor\u001b[39;49;00m batch, (images, labels) \u001b[35min\u001b[39;49;00m \u001b[36menumerate\u001b[39;49;00m(train_loader):\n",
" train_step(images, labels)\n",
" \n",
" \u001b[34mfor\u001b[39;49;00m images, labels \u001b[35min\u001b[39;49;00m test_loader:\n",
" test_step(images, labels)\n",
" \n",
" \u001b[36mprint\u001b[39;49;00m(\n",
" \u001b[33mf\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33mEpoch \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mepoch + \u001b[34m1\u001b[39;49;00m\u001b[33m}\u001b[39;49;00m\u001b[33m, \u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n",
" \u001b[33mf\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33mLoss: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mtrain_loss.result()\u001b[33m}\u001b[39;49;00m\u001b[33m, \u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n",
" \u001b[33mf\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33mAccuracy: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mtrain_accuracy.result() * \u001b[34m100\u001b[39;49;00m\u001b[33m}\u001b[39;49;00m\u001b[33m, \u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n",
" \u001b[33mf\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33mTest Loss: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mtest_loss.result()\u001b[33m}\u001b[39;49;00m\u001b[33m, \u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n",
" \u001b[33mf\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33mTest Accuracy: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mtest_accuracy.result() * \u001b[34m100\u001b[39;49;00m\u001b[33m}\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n",
" )\n",
"\n",
" \u001b[37m# Save the model\u001b[39;49;00m\n",
" \u001b[37m# A version number is needed for the serving container\u001b[39;49;00m\n",
" \u001b[37m# to load the model\u001b[39;49;00m\n",
" version = \u001b[33m'\u001b[39;49;00m\u001b[33m00000000\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n",
" ckpt_dir = os.path.join(args.model_dir, version)\n",
" \u001b[34mif\u001b[39;49;00m \u001b[35mnot\u001b[39;49;00m os.path.exists(ckpt_dir):\n",
" os.makedirs(ckpt_dir)\n",
" model.save(ckpt_dir)\n",
" \u001b[34mreturn\u001b[39;49;00m\n",
"\n",
"\n",
"\u001b[34mdef\u001b[39;49;00m \u001b[32mparse_args\u001b[39;49;00m():\n",
" parser = argparse.ArgumentParser()\n",
"\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--batch-size\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m, default=\u001b[34m32\u001b[39;49;00m)\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--epochs\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m, default=\u001b[34m1\u001b[39;49;00m)\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--learning-rate\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m1e-3\u001b[39;49;00m)\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--beta_1\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m0.9\u001b[39;49;00m)\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--beta_2\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m0.999\u001b[39;49;00m)\n",
" \n",
" \u001b[37m# Environment variables given by the training image\u001b[39;49;00m\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--model-dir\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m'\u001b[39;49;00m\u001b[33mSM_MODEL_DIR\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m])\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--train\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m'\u001b[39;49;00m\u001b[33mSM_CHANNEL_TRAINING\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m])\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--test\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m'\u001b[39;49;00m\u001b[33mSM_CHANNEL_TESTING\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m])\n",
"\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--current-host\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=os.environ[\u001b[33m'\u001b[39;49;00m\u001b[33mSM_CURRENT_HOST\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m])\n",
" parser.add_argument(\u001b[33m'\u001b[39;49;00m\u001b[33m--hosts\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mlist\u001b[39;49;00m, default=json.loads(os.environ[\u001b[33m'\u001b[39;49;00m\u001b[33mSM_HOSTS\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m]))\n",
"\n",
" \u001b[34mreturn\u001b[39;49;00m parser.parse_args()\n",
"\n",
"\n",
"\n",
"\u001b[34mif\u001b[39;49;00m \u001b[31m__name__\u001b[39;49;00m == \u001b[33m'\u001b[39;49;00m\u001b[33m__main__\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m:\n",
" args = parse_args()\n",
" train(args)\n"
]
}
],
"source": [
"!pygmentize 'code/train.py'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 하이퍼파리미터 설정\n",
"\n",
"추가로, Tensorflow Estimator는 명령라인 매개변수로 학습작업에서 사용할 하이퍼파라미터를 전달합니다.\n",
"\n",
" Note: SageMaker Studio 에서는 local mode가 지원되지 않습니다. "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# set local_mode if you want to run the training script on the machine that runs this notebook\n",
"\n",
"instance_type='ml.c4.xlarge'\n",
" \n",
"est = TensorFlow(\n",
" entry_point='train.py',\n",
" source_dir='code', # directory of your training script\n",
" role=role,\n",
" framework_version='2.3.1',\n",
" model_dir=False, # don't pass --model_dir to your training script\n",
" py_version='py37',\n",
" instance_type=instance_type,\n",
" instance_count=1,\n",
" output_path=output_path,\n",
" hyperparameters={\n",
" 'batch-size':512,\n",
" 'epochs':10,\n",
" 'learning-rate': 1e-3,\n",
" 'beta_1' : 0.9,\n",
" 'beta_2' : 0.999\n",
" \n",
" }\n",
")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"학습 컨테이너는 아래와 같은 방식으로 하이퍼파라미터를 전달하고 스크립트를 실행할것입니다. \n",
"\n",
"```\n",
"python train.py --batch-size 32 --epochs 10 --learning-rate 0.001\n",
" --beta_1 0.9 --beta_2 0.999\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 학습 & 테스트 데이터 채널 지정 \n",
"\n",
"Tensorflow Estimator에게 학습 및 테스트 데이터셋을 찾을 수있는 위치를 알려야합니다. S3 버킷에 대한 링크 또는 로컬 모드를 사용하는 경우 로컬 파일 시스템의 경로가 될 수 있습니다. 이 예에서는 공용 S3 버킷에서 MNIST 데이터를 다운로드하고 기본 버킷에 업로드합니다."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import logging\n",
"import boto3\n",
"from botocore.exceptions import ClientError\n",
"# Download training and testing data from a public S3 bucket\n",
"\n",
"def download_from_s3(data_dir='/tmp/data', train=True):\n",
" \"\"\"Download MNIST dataset and convert it to numpy array\n",
" \n",
" Args:\n",
" data_dir (str): directory to save the data\n",
" train (bool): download training set\n",
" \n",
" Returns:\n",
" None\n",
" \"\"\"\n",
" \n",
" if not os.path.exists(data_dir):\n",
" os.makedirs(data_dir)\n",
" \n",
" if train:\n",
" images_file = \"train-images-idx3-ubyte.gz\"\n",
" labels_file = \"train-labels-idx1-ubyte.gz\"\n",
" else:\n",
" images_file = \"t10k-images-idx3-ubyte.gz\"\n",
" labels_file = \"t10k-labels-idx1-ubyte.gz\"\n",
" \n",
" with open('code/config.json', 'r') as f:\n",
" config = json.load(f)\n",
"\n",
" # download objects\n",
" s3 = boto3.client('s3')\n",
" bucket = config['public_bucket']\n",
" for obj in [images_file, labels_file]:\n",
" key = os.path.join(\"datasets/image/MNIST\", obj)\n",
" dest = os.path.join(data_dir, obj)\n",
" if not os.path.exists(dest):\n",
" s3.download_file(bucket, key, dest)\n",
" return\n",
"\n",
"\n",
"download_from_s3('/tmp/data', True)\n",
"download_from_s3('/tmp/data', False)\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# upload to the default bucket\n",
"\n",
"prefix = 'mnist'\n",
"bucket = sess.default_bucket()\n",
"loc = sess.upload_data(path='/tmp/data', bucket=bucket, key_prefix=prefix)\n",
"\n",
"channels = {\n",
" \"training\": loc,\n",
" \"testing\": loc\n",
"}\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"학습 실행시 `channels` 딕셔너리는 컨테이너 내에 `SM_CHANNEL_` 형태의 환경 변수를 만듭니다.\n",
"\n",
"본 사례에서는 `SM_CHANNEL_TRAINING`과 `SM_CHANNEL_TESTING` 이라는 이름으로 생성될 것입니다. `code/train.py` 에서 해당 값을 어떻게 참조하는지 살펴보십시오. 보다 자세한 내용은 [SM_CHANNEL_{channel_name}](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_channel_channel_name)를 참조합니다.\n",
"\n",
"필요시 다음과 같이 검증 채널을 추가할 수 있습니다.\n",
"```\n",
"channels = {\n",
" 'training': train_data_loc,\n",
" 'validation': val_data_loc,\n",
" 'test': test_data_loc\n",
" }\n",
"```\n",
"위 코드에 의해서는 다음 채널이 스크립트에서 사용가능하게 될 것입니다. \n",
"`SM_CHANNEL_VALIDATION`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SageMaker 학습작업 실행\n",
"\n",
"이제 훈련 컨테이너에는 교육용 스크립트를 실행할 수 있습니다. fit 명령을 호출하여 컨테이너를 시작할 수 있습니다\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2021-07-16 13:52:08 Starting - Starting the training job...\n",
"2021-07-16 13:52:32 Starting - Launching requested ML instancesProfilerReport-1626443528: InProgress\n",
"...\n",
"2021-07-16 13:53:07 Starting - Preparing the instances for training.........\n",
"2021-07-16 13:54:33 Downloading - Downloading input data...\n",
"2021-07-16 13:54:57 Training - Downloading the training image..\u001b[34m2021-07-16 13:55:19.238815: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:19.249118: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:19.630437: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:23,655 sagemaker-training-toolkit INFO Imported framework sagemaker_tensorflow_container.training\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:23,663 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:24,157 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:24,172 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:24,187 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:24,201 sagemaker-training-toolkit INFO Invoking user script\n",
"\u001b[0m\n",
"\u001b[34mTraining Env:\n",
"\u001b[0m\n",
"\u001b[34m{\n",
" \"additional_framework_parameters\": {},\n",
" \"channel_input_dirs\": {\n",
" \"testing\": \"/opt/ml/input/data/testing\",\n",
" \"training\": \"/opt/ml/input/data/training\"\n",
" },\n",
" \"current_host\": \"algo-1\",\n",
" \"framework_module\": \"sagemaker_tensorflow_container.training:main\",\n",
" \"hosts\": [\n",
" \"algo-1\"\n",
" ],\n",
" \"hyperparameters\": {\n",
" \"batch-size\": 512,\n",
" \"beta_1\": 0.9,\n",
" \"beta_2\": 0.999,\n",
" \"learning-rate\": 0.001,\n",
" \"epochs\": 10\n",
" },\n",
" \"input_config_dir\": \"/opt/ml/input/config\",\n",
" \"input_data_config\": {\n",
" \"testing\": {\n",
" \"TrainingInputMode\": \"File\",\n",
" \"S3DistributionType\": \"FullyReplicated\",\n",
" \"RecordWrapperType\": \"None\"\n",
" },\n",
" \"training\": {\n",
" \"TrainingInputMode\": \"File\",\n",
" \"S3DistributionType\": \"FullyReplicated\",\n",
" \"RecordWrapperType\": \"None\"\n",
" }\n",
" },\n",
" \"input_dir\": \"/opt/ml/input\",\n",
" \"is_master\": true,\n",
" \"job_name\": \"tensorflow-training-2021-07-16-13-52-08-478\",\n",
" \"log_level\": 20,\n",
" \"master_hostname\": \"algo-1\",\n",
" \"model_dir\": \"/opt/ml/model\",\n",
" \"module_dir\": \"s3://sagemaker-us-east-1-308961792850/tensorflow-training-2021-07-16-13-52-08-478/source/sourcedir.tar.gz\",\n",
" \"module_name\": \"train\",\n",
" \"network_interface_name\": \"eth0\",\n",
" \"num_cpus\": 4,\n",
" \"num_gpus\": 0,\n",
" \"output_data_dir\": \"/opt/ml/output/data\",\n",
" \"output_dir\": \"/opt/ml/output\",\n",
" \"output_intermediate_dir\": \"/opt/ml/output/intermediate\",\n",
" \"resource_config\": {\n",
" \"current_host\": \"algo-1\",\n",
" \"hosts\": [\n",
" \"algo-1\"\n",
" ],\n",
" \"network_interface_name\": \"eth0\"\n",
" },\n",
" \"user_entry_point\": \"train.py\"\u001b[0m\n",
"\u001b[34m}\n",
"\u001b[0m\n",
"\u001b[34mEnvironment variables:\n",
"\u001b[0m\n",
"\u001b[34mSM_HOSTS=[\"algo-1\"]\u001b[0m\n",
"\u001b[34mSM_NETWORK_INTERFACE_NAME=eth0\u001b[0m\n",
"\u001b[34mSM_HPS={\"batch-size\":512,\"beta_1\":0.9,\"beta_2\":0.999,\"epochs\":10,\"learning-rate\":0.001}\u001b[0m\n",
"\u001b[34mSM_USER_ENTRY_POINT=train.py\u001b[0m\n",
"\u001b[34mSM_FRAMEWORK_PARAMS={}\u001b[0m\n",
"\u001b[34mSM_RESOURCE_CONFIG={\"current_host\":\"algo-1\",\"hosts\":[\"algo-1\"],\"network_interface_name\":\"eth0\"}\u001b[0m\n",
"\u001b[34mSM_INPUT_DATA_CONFIG={\"testing\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"},\"training\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}}\u001b[0m\n",
"\u001b[34mSM_OUTPUT_DATA_DIR=/opt/ml/output/data\u001b[0m\n",
"\u001b[34mSM_CHANNELS=[\"testing\",\"training\"]\u001b[0m\n",
"\u001b[34mSM_CURRENT_HOST=algo-1\u001b[0m\n",
"\u001b[34mSM_MODULE_NAME=train\u001b[0m\n",
"\u001b[34mSM_LOG_LEVEL=20\u001b[0m\n",
"\u001b[34mSM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main\u001b[0m\n",
"\u001b[34mSM_INPUT_DIR=/opt/ml/input\u001b[0m\n",
"\u001b[34mSM_INPUT_CONFIG_DIR=/opt/ml/input/config\u001b[0m\n",
"\u001b[34mSM_OUTPUT_DIR=/opt/ml/output\u001b[0m\n",
"\u001b[34mSM_NUM_CPUS=4\u001b[0m\n",
"\u001b[34mSM_NUM_GPUS=0\u001b[0m\n",
"\u001b[34mSM_MODEL_DIR=/opt/ml/model\u001b[0m\n",
"\u001b[34mSM_MODULE_DIR=s3://sagemaker-us-east-1-308961792850/tensorflow-training-2021-07-16-13-52-08-478/source/sourcedir.tar.gz\u001b[0m\n",
"\u001b[34mSM_TRAINING_ENV={\"additional_framework_parameters\":{},\"channel_input_dirs\":{\"testing\":\"/opt/ml/input/data/testing\",\"training\":\"/opt/ml/input/data/training\"},\"current_host\":\"algo-1\",\"framework_module\":\"sagemaker_tensorflow_container.training:main\",\"hosts\":[\"algo-1\"],\"hyperparameters\":{\"batch-size\":512,\"beta_1\":0.9,\"beta_2\":0.999,\"epochs\":10,\"learning-rate\":0.001},\"input_config_dir\":\"/opt/ml/input/config\",\"input_data_config\":{\"testing\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"},\"training\":{\"RecordWrapperType\":\"None\",\"S3DistributionType\":\"FullyReplicated\",\"TrainingInputMode\":\"File\"}},\"input_dir\":\"/opt/ml/input\",\"is_master\":true,\"job_name\":\"tensorflow-training-2021-07-16-13-52-08-478\",\"log_level\":20,\"master_hostname\":\"algo-1\",\"model_dir\":\"/opt/ml/model\",\"module_dir\":\"s3://sagemaker-us-east-1-308961792850/tensorflow-training-2021-07-16-13-52-08-478/source/sourcedir.tar.gz\",\"module_name\":\"train\",\"network_interface_name\":\"eth0\",\"num_cpus\":4,\"num_gpus\":0,\"output_data_dir\":\"/opt/ml/output/data\",\"output_dir\":\"/opt/ml/output\",\"output_intermediate_dir\":\"/opt/ml/output/intermediate\",\"resource_config\":{\"current_host\":\"algo-1\",\"hosts\":[\"algo-1\"],\"network_interface_name\":\"eth0\"},\"user_entry_point\":\"train.py\"}\u001b[0m\n",
"\u001b[34mSM_USER_ARGS=[\"--batch-size\",\"512\",\"--beta_1\",\"0.9\",\"--beta_2\",\"0.999\",\"--epochs\",\"10\",\"--learning-rate\",\"0.001\"]\u001b[0m\n",
"\u001b[34mSM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate\u001b[0m\n",
"\u001b[34mSM_CHANNEL_TESTING=/opt/ml/input/data/testing\u001b[0m\n",
"\u001b[34mSM_CHANNEL_TRAINING=/opt/ml/input/data/training\u001b[0m\n",
"\u001b[34mSM_HP_BATCH-SIZE=512\u001b[0m\n",
"\u001b[34mSM_HP_BETA_1=0.9\u001b[0m\n",
"\u001b[34mSM_HP_BETA_2=0.999\u001b[0m\n",
"\u001b[34mSM_HP_LEARNING-RATE=0.001\u001b[0m\n",
"\u001b[34mSM_HP_EPOCHS=10\u001b[0m\n",
"\u001b[34mPYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python37.zip:/usr/local/lib/python3.7:/usr/local/lib/python3.7/lib-dynload:/usr/local/lib/python3.7/site-packages\n",
"\u001b[0m\n",
"\u001b[34mInvoking script with the following command:\n",
"\u001b[0m\n",
"\u001b[34m/usr/local/bin/python3.7 train.py --batch-size 512 --beta_1 0.9 --beta_2 0.999 --epochs 10 --learning-rate 0.001\n",
"\n",
"\u001b[0m\n",
"\u001b[34mTraining starts ...\u001b[0m\n",
"\n",
"2021-07-16 13:55:33 Training - Training image download completed. Training in progress.\u001b[34m[2021-07-16 13:55:27.833 ip-10-2-196-137.ec2.internal:24 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\u001b[0m\n",
"\u001b[34m[2021-07-16 13:55:28.122 ip-10-2-196-137.ec2.internal:24 INFO profiler_config_parser.py:102] User has disabled profiler.\u001b[0m\n",
"\u001b[34m[2021-07-16 13:55:28.123 ip-10-2-196-137.ec2.internal:24 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\u001b[0m\n",
"\u001b[34m[2021-07-16 13:55:28.124 ip-10-2-196-137.ec2.internal:24 INFO hook.py:199] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\u001b[0m\n",
"\u001b[34m[2021-07-16 13:55:28.124 ip-10-2-196-137.ec2.internal:24 INFO hook.py:253] Saving to /opt/ml/output/tensors\u001b[0m\n",
"\u001b[34m[2021-07-16 13:55:28.124 ip-10-2-196-137.ec2.internal:24 INFO state_store.py:75] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\u001b[0m\n",
"\u001b[34m[2021-07-16 13:55:28.125 ip-10-2-196-137.ec2.internal:24 INFO hook.py:413] Monitoring the collections: metrics, sm_metrics, losses\u001b[0m\n",
"\u001b[34mEpoch 1, Loss: 0.27551141381263733, Accuracy: 91.52999877929688, Test Loss: 0.11824200302362442, Test Accuracy: 96.58000183105469\u001b[0m\n",
"\u001b[34mEpoch 2, Loss: 0.07692711800336838, Accuracy: 97.78333282470703, Test Loss: 0.07241345942020416, Test Accuracy: 97.64999389648438\u001b[0m\n",
"\u001b[34mEpoch 3, Loss: 0.042056649923324585, Accuracy: 98.75166320800781, Test Loss: 0.05691447854042053, Test Accuracy: 98.11000061035156\u001b[0m\n",
"\u001b[34mEpoch 4, Loss: 0.02515381947159767, Accuracy: 99.28333282470703, Test Loss: 0.05645830184221268, Test Accuracy: 98.1500015258789\u001b[0m\n",
"\u001b[34mEpoch 5, Loss: 0.017273804172873497, Accuracy: 99.53166961669922, Test Loss: 0.05516675114631653, Test Accuracy: 98.33999633789062\u001b[0m\n",
"\u001b[34mEpoch 6, Loss: 0.012449586763978004, Accuracy: 99.65499877929688, Test Loss: 0.06764646619558334, Test Accuracy: 98.12999725341797\u001b[0m\n",
"\u001b[34mEpoch 7, Loss: 0.009526542387902737, Accuracy: 99.71666717529297, Test Loss: 0.05913087725639343, Test Accuracy: 98.29999542236328\u001b[0m\n",
"\u001b[34mEpoch 8, Loss: 0.004901242908090353, Accuracy: 99.90333557128906, Test Loss: 0.05749906972050667, Test Accuracy: 98.36000061035156\u001b[0m\n",
"\u001b[34mEpoch 9, Loss: 0.002548153977841139, Accuracy: 99.96833038330078, Test Loss: 0.056115925312042236, Test Accuracy: 98.5\u001b[0m\n",
"\n",
"2021-07-16 13:58:12 Uploading - Uploading generated training model\u001b[34mEpoch 10, Loss: 0.0011424849508330226, Accuracy: 99.99666595458984, Test Loss: 0.0559561550617218, Test Accuracy: 98.43000030517578\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:24.543481: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:24.543634: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\u001b[0m\n",
"\u001b[34m2021-07-16 13:55:24.571841: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\u001b[0m\n",
"\u001b[34mWARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.\u001b[0m\n",
"\u001b[34mInstructions for updating:\u001b[0m\n",
"\u001b[34mThis property should not be used in TensorFlow 2.0, as updates are applied automatically.\u001b[0m\n",
"\u001b[34mWARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.\u001b[0m\n",
"\u001b[34mInstructions for updating:\u001b[0m\n",
"\u001b[34mThis property should not be used in TensorFlow 2.0, as updates are applied automatically.\u001b[0m\n",
"\u001b[34mWARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\u001b[0m\n",
"\u001b[34mInstructions for updating:\u001b[0m\n",
"\u001b[34mThis property should not be used in TensorFlow 2.0, as updates are applied automatically.\u001b[0m\n",
"\u001b[34mWARNING:tensorflow:From /usr/local/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\u001b[0m\n",
"\u001b[34mInstructions for updating:\u001b[0m\n",
"\u001b[34mThis property should not be used in TensorFlow 2.0, as updates are applied automatically.\u001b[0m\n",
"\u001b[34m2021-07-16 13:58:09.140695: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.\u001b[0m\n",
"\u001b[34mINFO:tensorflow:Assets written to: /opt/ml/model/00000000/assets\u001b[0m\n",
"\u001b[34mINFO:tensorflow:Assets written to: /opt/ml/model/00000000/assets\n",
"\u001b[0m\n",
"\u001b[34m2021-07-16 13:58:09,801 sagemaker-training-toolkit INFO Reporting training SUCCESS\u001b[0m\n",
"\n",
"2021-07-16 13:58:34 Completed - Training job completed\n",
"ProfilerReport-1626443528: NoIssuesFound\n",
"Training seconds: 229\n",
"Billable seconds: 229\n"
]
}
],
"source": [
"est.fit(inputs=channels)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 저장된 모델 데이터 확인 \n",
"\n",
"이제 교육이 완료되면 모델 아티팩트가 `output_path`에 저장됩니다."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tf_mnist_model_data = est.model_data\n",
"print(\"Model artifact saved at:\\n\", tf_mnist_model_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"이제 현재 노트북 커널에 변수 `model_data`를 저장합니다. 다음 노트북에서 모델 아티팩트를 검색하고 SageMaker 엔드 포인트에 배포하는 방법을 배우게됩니다.\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Stored 'tf_mnist_model_data' (str)\n"
]
}
],
"source": [
"%store tf_mnist_model_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 학습컨테이너에서 실행하기 전에 스크립트를 테스트하고 디버깅하기 \n",
"\n",
"앞서 사용한 `train.py`는 테스트가 완료된 코드이며, 바로 학습 컨테이너에서 실행할 수 있습니다. 하지만 해당 스크립트를 개발할 때에는, SageMaker로 보내기 전에 로컬 환경에서 컨테이너 환경을 시뮬레이션하고 테스트해야할 수 있습니다. 컨테이너 환경에서 테스트와 디버깅을 하는 것이 번거롭다면 다음과 같은 코드를 참조하여 활용할 수 있습니다."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtrain\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m train, parse_args\n",
"\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36msys\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mos\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mboto3\u001b[39;49;00m\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mjson\u001b[39;49;00m\n",
"\n",
"dirname = os.path.dirname(os.path.abspath(\u001b[31m__file__\u001b[39;49;00m))\n",
"\n",
"\u001b[34mwith\u001b[39;49;00m \u001b[36mopen\u001b[39;49;00m(os.path.join(dirname, \u001b[33m\"\u001b[39;49;00m\u001b[33mconfig.json\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m), \u001b[33m\"\u001b[39;49;00m\u001b[33mr\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m) \u001b[34mas\u001b[39;49;00m f:\n",
" CONFIG = json.load(f)\n",
" \n",
"\u001b[34mdef\u001b[39;49;00m \u001b[32mdownload_from_s3\u001b[39;49;00m(data_dir=\u001b[33m'\u001b[39;49;00m\u001b[33m/tmp/data\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m, train=\u001b[34mTrue\u001b[39;49;00m):\n",
" \u001b[33m\"\"\"Download MNIST dataset and convert it to numpy array\u001b[39;49;00m\n",
"\u001b[33m Args:\u001b[39;49;00m\n",
"\u001b[33m data_dir (str): directory to save the data\u001b[39;49;00m\n",
"\u001b[33m train (bool): download training set\u001b[39;49;00m\n",
"\u001b[33m Returns:\u001b[39;49;00m\n",
"\u001b[33m tuple of images and labels as numpy arrays\u001b[39;49;00m\n",
"\u001b[33m \"\"\"\u001b[39;49;00m\n",
" \n",
" \u001b[34mif\u001b[39;49;00m \u001b[35mnot\u001b[39;49;00m os.path.exists(data_dir):\n",
" os.makedirs(data_dir)\n",
" \n",
" \u001b[34mif\u001b[39;49;00m train:\n",
" images_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mtrain-images-idx3-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" labels_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mtrain-labels-idx1-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" \u001b[34melse\u001b[39;49;00m:\n",
" images_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mt10k-images-idx3-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" labels_file = \u001b[33m\"\u001b[39;49;00m\u001b[33mt10k-labels-idx1-ubyte.gz\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
"\n",
" \u001b[37m# download objects\u001b[39;49;00m\n",
" s3 = boto3.client(\u001b[33m'\u001b[39;49;00m\u001b[33ms3\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n",
" bucket = CONFIG[\u001b[33m\"\u001b[39;49;00m\u001b[33mpublic_bucket\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m]\n",
" \u001b[34mfor\u001b[39;49;00m obj \u001b[35min\u001b[39;49;00m [images_file, labels_file]:\n",
" key = os.path.join(\u001b[33m\"\u001b[39;49;00m\u001b[33mdatasets/image/MNIST\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, obj)\n",
" dest = os.path.join(data_dir, obj)\n",
" \u001b[34mif\u001b[39;49;00m \u001b[35mnot\u001b[39;49;00m os.path.exists(dest):\n",
" s3.download_file(bucket, key, dest)\n",
" \u001b[34mreturn\u001b[39;49;00m\n",
"\n",
"\u001b[34mclass\u001b[39;49;00m \u001b[04m\u001b[32mEnv\u001b[39;49;00m:\n",
" \u001b[34mdef\u001b[39;49;00m \u001b[32m__init__\u001b[39;49;00m(\u001b[36mself\u001b[39;49;00m): \n",
" \u001b[37m# simulate container env\u001b[39;49;00m\n",
" os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_MODEL_DIR\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m\"\u001b[39;49;00m\u001b[33m/tmp/tf/model\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CHANNEL_TRAINING\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m]=\u001b[33m\"\u001b[39;49;00m\u001b[33m/tmp/data\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CHANNEL_TESTING\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m]=\u001b[33m\"\u001b[39;49;00m\u001b[33m/tmp/data\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_HOSTS\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m'\u001b[39;49;00m\u001b[33m[\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33malgo-1\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33m]\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n",
" os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_CURRENT_HOST\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m]=\u001b[33m\"\u001b[39;49;00m\u001b[33malgo-1\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" os.environ[\u001b[33m\"\u001b[39;49;00m\u001b[33mSM_NUM_GPUS\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m] = \u001b[33m\"\u001b[39;49;00m\u001b[33m0\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n",
" \n",
" \n",
"\u001b[34mif\u001b[39;49;00m \u001b[31m__name__\u001b[39;49;00m==\u001b[33m'\u001b[39;49;00m\u001b[33m__main__\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m:\n",
" Env()\n",
" download_from_s3()\n",
" download_from_s3(train=\u001b[34mFalse\u001b[39;49;00m)\n",
" args = parse_args()\n",
" train(args)\n"
]
}
],
"source": [
"!pygmentize code/test_train.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In [the next notebook](get_started_mnist_deploy.ipynb) you will see how to deploy your \n",
"trained model artifacts to a SageMaker endpoint. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "conda_tensorflow2_p36",
"language": "python",
"name": "conda_tensorflow2_p36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
},
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
},
"nbformat": 4,
"nbformat_minor": 4
}