{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Deep Learning\n", "Your first model took random shots and your second model tried matching the current game against past games; however, if no past games matched then it just randomly guessed. This model will do better by __learning__ the pattern of ship layouts, not just memorizing them. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "We will be using past game data to train our model. First, we need to download the data from the master game archive by running: (note the data is constantly being updated as games are played)\n", "You will be executed the following commands in the `Code` section below. \n", "```bash\n", "# Download, parse data, create sample data\n", "./bin/download_data.sh\n", "# Create directory to store the downloaded data\n", "mkdir ./containers/Shoot/CNN/mock/data/\n", "# Next copy the sample data file into the mock directory of the code for this model \n", "cp ./data/data.min.json ./containers/Shoot/CNN/mock/data/data.json\n", "```\n", "You will now be able to test your model training localy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%cd /home/ec2-user/SageMaker/GameDayRepo" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pwd;./bin/download_data.sh;mkdir -p ./containers/Shoot/CNN/mock/data/;cp ./data/data.min.json ./containers/Shoot/CNN/mock/data/data.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Code\n", "Please confirm that the Operations team has completed the initial setup including registration of the team name. Also, check with the instructor if the GameEngine has started. \n", "Please execute the following command from the terminal.\n", "Code is in /containers/Shoot/CNN contains the following files\n", " \n", "- ### ./containers/Shoot/CNN/Makefile\n", "Makefile for creating the tar.gz file of code for [Amazon Sagemaker](https://aws.amazon.com/sagemaker/) framework \n", "\n", "- ### [./containers/Shoot/CNN/host.py](/edit/GameDayRepo/containers/Shoot/CNN/host.py)\n", "The code for generating inferences (don't run this file directly). In addition to running our model, it will also put a tag in the session attribute that we will use in Athena to compare different models. You can change the tag value (which you should do for each model) by editing the host.py file.\n", "\n", "- ### [./containers/Shoot/CNN/test.py](/edit/GameDayRepo/containers/Shoot/CNN/test.py)\n", "code for testing the host.py and train.py files locally. It creates a mock [Amazon Sagemaker](https://aws.amazon.com/sagemaker/) runtime in order to run train.py and then test.py. You can edit this file in order to customize your test. If you get a message about missing dependencies (like mxnet), just install them with:\n", "```shell\n", "pip install mxnet # for example\n", "```\n", "\n", "- ### [./containers/Shoot/CNN/train.py](/edit/GameDayRepo/containers/Shoot/CNN/train.py)\n", "the training code (don't run this file directly)\n", "\n", "\n", "Next, run the test script locally to confirm the scripts works. To do this, you must download the dataset and create a sample data set. Do this by running the download script in the bin directory. \n", "```shell\n", "./bin/download_data.sh #downloads data from master Archive, parses it, and also create data sample\n", "```\n", "\n", "Then in the code directory (/containers/Shoot/CNN) run:\n", "```shell\n", "mkdir -p mock/data \n", "cp ../../../data/data.min.json mock/data/data.json #copy the data sample to the mock directory\n", "```\n", "Now you can run the test script:\n", "```shell\n", "python ./test.py\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cd ./containers/Shoot/CNN/;python test.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure\n", "Before you can deploy your second model, you need to configure the deployment. As described in the [Endpoint Notebook](../Endpoint_Reference.ipynb), you will edit the deployment configuration in [/shoot-config.json](/edit/GameDayRepo/shoot-config.json). You need to edit the following field to the following values\n", "\n", "|parameter field|Value|description|\n", "|---|---|---|\n", "|trainsourcefile|`shoot-CNN.tar.gz`| code for our model|\n", "|hostsourcefile|`shoot-CNN.tar.gz`|a code for our model|\n", "|hyperparameters.epochs|3|Max epochs to train for|\n", "|hyperparameters.warm_up|1|How many epochs to wait to start early stoppping|\n", "|hyperparameters.patience|1|How many epochs with no improvement before early stopping|\n", "|hyperparameters.learning_rate|0.005||\n", "|hyperparameters.width|32|How many feature filters in each layer|\n", "|hyperparameters.depth|4|how many layers|\n", "|hyperparameters.batch_size|10|batch size for training|\n", "|metrics|[{Name:\"Validation\",Regex:\"Testing loss: (.*?);\"},{Name:\"Training\",Regex:\"Training loss: (.*?);\"},{Name:\"Throughput\",Regex:\"Throughput=(.*?);\"}]|Metrics to send to cloudwatch logs|\n", "|traininstancetype|\"ml.m5.xlarge\"|instance type to train with|\n", "|channels|{\"train\":{\"path\":\"all/data.json\"}}|path to data in data bucket|" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!cp shoot-config-cnn.json shoot-config.json" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deploy\n", "1. Follow instructions from endpoint reference to commit your code and then push your changes to the remote branch.\n", "2. Download, parse and upload the training data by running the download_data.sh and upload_data.sh scripts\n", "```shell\n", "./bin/download_data.sh\n", "./bin/upload_data.sh\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!./bin/download_data.sh\n", "!./bin/upload_data.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Tell Operations to deploy your changes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training Evaluation\n", "After your model has run, there are two ways you will want to evaluate your model\n", " 1. Validation Performance: This is how well your model did on the hold out data set. \n", " 2. Training Throughput: How many data items can it train on per second.\n", "\n", "To view these metrics:\n", "1. Go to the [sagemaker training jobs](https://us-west-2.console.aws.amazon.com/sagemaker/home?#/jobs)\n", "1. Select the training job you want to evaluate (probably the most recent one)\n", "\n", "2. Scroll down till you see the monitor card. You will see three options\n", "\n", " 1. View Algorithm Metrics: will open CloudWatch and you can view a plot of the metrics of your model. For this model, you can view the training score, validation score, and throughput. \n", " 2. View Logs: Will take you to CloudWatch logs to see the log output of the models container (useful for debugging).\n", " 3. View Instance Metrics: will open CloudWatch and you can view a plot of the metrics (CPU utilization, GPU utilization, Memory usage, etc...)\n", " \n", "### Validation Performance\n", "The model you trained uses the [DICE score](https://forums.fast.ai/t/understanding-the-dice-coefficient/5838) which will be a number between 0.0 and 1.0 with 1.0 representing a perfect model. You want your model to have a high dice score as much as possible. Increase the Width and Depth hyperparameters will make your model more powerful and give you a higher dice score (but will take longer to train)\n", "\n", "### Training Throughput\n", "Your first training job should have trained for just a few epochs. This will give you a good estimate of the training throughput which you can then use to adjust how many epochs to train your next model. Too low and your model does not learn enough, too long and your model never finished in time to gain you points. As a team you will need to decided how long you want to train a model for. Note, increasing the width and depth hyperparamters will decrease your throughput but using a larger instance type will increase the throughput. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deployment Performance\n", "Next, monitor you models performance. See [Athena](../Athena.ipynb) for details on setting up Athena (replace values between `<>` like in the Athena instructions). The following query will show win percentage by shooting type. \n", "\n", "```sql\n", "SELECT type,\n", " sum(won)/(sum(won)+sum(lost)) as won,\n", " sum(lost)/(sum(won)+sum(lost)) as lost\n", "FROM \n", " (SELECT json_extract(session,'$.shootType') AS type,\n", " CASE\n", " WHEN winner = '' THEN 1.0\n", " ELSE 0.0\n", " END AS won,\n", " CASE\n", " WHEN winner != '' THEN 1.0\n", " ELSE 0.0\n", " END AS lost\n", " FROM \n", " (SELECT teama.session AS session,\n", " teama.teamname AS team,\n", " winner\n", " FROM \"\".\"\"\n", " UNION\n", " SELECT teamb.session AS session,\n", " teamb.teamname AS team,\n", " winner\n", " FROM \"\".\"\" )\n", " WHERE team = '' )\n", " GROUP BY type\n", "```\n", "\n", "Try editing hyperparamters or the code to improve your performance results. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_mxnet_p36", "language": "python", "name": "conda_mxnet_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 4 }